Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegexpError "invalid pattern in look-behind" for certain Regexps since 9.1.16.0 #5086

Closed
naag opened this issue Mar 13, 2018 · 7 comments
Closed
Assignees

Comments

@naag
Copy link

naag commented Mar 13, 2018

We've been running a regular expression matching certain file extensions since JRuby 9.1.12.0, but after upgrading to 9.1.16.0, it breaks. 9.1.15.0 is the last release where it works.

We've stripped down the regular expression a lot and it still fails with RegexpError: invalid pattern in look-behind: /^.*(?<!css)$/i for the most basic case. Example:

$ irb
jruby-9.1.16.0 :001 > "foo.html" =~ /^.*(?<!css)$/i
RegexpError: invalid pattern in look-behind: /^.*(?<!css)$/i
	from org/jruby/RubyRegexp.java:1097:in `=~'
	from org/jruby/RubyString.java:1613:in `=~'
	from (irb):1:in `<eval>'
	from org/jruby/RubyKernel.java:995:in `eval'
	from org/jruby/RubyKernel.java:1316:in `loop'
	from org/jruby/RubyKernel.java:1138:in `catch'
	from org/jruby/RubyKernel.java:1138:in `catch'
	from /usr/local/rvm/rubies/jruby-9.1.16.0/bin/irb:13:in `<main>'

Another expression that fails is /^.*(?<!tiff)$/i. Both work if we change the last letter from s to something else or t to something else. They also work in case sensitive mode. MRI 2.4.3 works fine with all of these variants.

Is there any other input you need?

Environment

$ jruby -v
jruby 9.1.16.0 (2.3.3) 2018-02-21 8f3f95a Java HotSpot(TM) 64-Bit Server VM 25.161-b12 on 1.8.0_161-b12 +jit [linux-x86_64]

$ env | grep '\(JRUBY\|JAVA\)' | sort
JAVA_HOME=/usr/lib/jvm/java-8-oracle
JRUBY_OPTS=--server -J-Xmn128m -J-Xms1536m -J-Xmx1536m

$ uname -a
Linux 6ee27f8ee9cd 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.2 LTS
Release:	16.04
Codename:	xenial
@enebo enebo added this to the JRuby 9.1.17.0 milestone Mar 13, 2018
lopex added a commit that referenced this issue Mar 14, 2018

Verified

This commit was signed with the committer’s verified signature.
headius Charles Oliver Nutter
… Regexps since 9.1.16.0
@lopex
Copy link
Contributor

lopex commented Mar 14, 2018

MRI has special case for us-ascii regexps and strings with 7-bit coderange where it uses regexp encoding for the match, whereas we're creating new regexp with actual string encoding. So ultimately this could be a big performance penalty for two reasons:

  • we're polluting regexp cache and/or creating new regexps every time in case of dregexps
  • we're matching using utf-8 encoding unnecessarily which can be tens of times slower and even more if the match is case insensitive.

Fixed in 36b44df

@naag
Copy link
Author

naag commented Mar 14, 2018

Thanks a lot @lopex, going to test it tomorrow with our code base!

@lopex
Copy link
Contributor

lopex commented Mar 14, 2018

I think the root cause also deserves some explanation. Character series like ss and ff are special because in Unicode for example ss is https://en.wikipedia.org/wiki/%C3%9F.
In case insensitive mode ss is unfolded to [S, ſ, ß, ẞ] alternatives (having different length in bytes). Another factor that triggered this issue is look-behind limitations - it doesnt allow variable length match as in (?<!.*) for example. For some reason Onigmo treats (?<!css) as a variable (not to be confused with different) length alternative in case insensitive mode even though all alternatives are fixed in size, why then is (?<!ss) ok ? this is a mystery right now.

So right now there's some caviats regarding MRI "us-ascii regexps by default":

  • "foo" =~ /(?<!css)/i # works
  • "foą" =~ /(?<!css)/i # blows
  • "foo" =~ /(?<!css)/iu # blows

If either string or regexp happens to end up with unicode encoding, look-behind will blow.

@lopex
Copy link
Contributor

lopex commented Mar 15, 2018

Also, great majority of regexps will end up being up to multiple times faster by default, with cases like e ("_" * 1000) + "" =~ /[a-z]+/i being 35 times faster.

@naag
Copy link
Author

naag commented Mar 15, 2018

Thank you for the detailed explanation! Unfortunately I was unable to build JRuby locally, but I can try again when 9.1.17.0 is released :-)

@naag naag closed this as completed Mar 15, 2018
@enebo
Copy link
Member

enebo commented Mar 15, 2018

@naag I just corrected an issue with our nightlies link: http://jruby.org/nightly (click stable).

@lopex
Copy link
Contributor

lopex commented Mar 18, 2018

Onigmo issue: k-takata/Onigmo#92

headius pushed a commit that referenced this issue Mar 21, 2018
kares added a commit to kares/jruby that referenced this issue Mar 27, 2018
* jruby-9.1:
  [fix] cast nsec nanos to long to avoid "overflow" with double value
  Handle this deprecation differently.
  Default to Java 9 bytecode for any java.specification.version>1.8.
  WrapperMethod is still needed for visibility.
  Revert "Finally eliminate use of WrapperMethod."
  Eliminate deprecation warnings in test suite.
  Finally eliminate use of WrapperMethod.
  Fix most deprecated calls.
  Handle error when attempting to connect to IP6 with default INET4.
  Add test_coverage to jruby.index.
  Add test for null filename in coverage. jruby#5111
  Do not attempt to add coverage for null filename. Fixes jruby#5111.
  Add basic specs for Exception#backtrace_locations.
  Exception.backtrace_locations should persist and be mutable.
  Return nil if no backtrace has been captured. Fixes jruby#5099.
  fix for jruby#5086, RegexpError invalid pattern in look-behind for certain Regexps since 9.1.16.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants