RegexpError "invalid pattern in look-behind" for certain Regexps since 9.1.16.0 #5086

naag · 2018-03-13T20:46:44Z

We've been running a regular expression matching certain file extensions since JRuby 9.1.12.0, but after upgrading to 9.1.16.0, it breaks. 9.1.15.0 is the last release where it works.

We've stripped down the regular expression a lot and it still fails with RegexpError: invalid pattern in look-behind: /^.*(?<!css)$/i for the most basic case. Example:

$ irb
jruby-9.1.16.0 :001 > "foo.html" =~ /^.*(?<!css)$/i
RegexpError: invalid pattern in look-behind: /^.*(?<!css)$/i
	from org/jruby/RubyRegexp.java:1097:in `=~'
	from org/jruby/RubyString.java:1613:in `=~'
	from (irb):1:in `<eval>'
	from org/jruby/RubyKernel.java:995:in `eval'
	from org/jruby/RubyKernel.java:1316:in `loop'
	from org/jruby/RubyKernel.java:1138:in `catch'
	from org/jruby/RubyKernel.java:1138:in `catch'
	from /usr/local/rvm/rubies/jruby-9.1.16.0/bin/irb:13:in `<main>'

Another expression that fails is /^.*(?<!tiff)$/i. Both work if we change the last letter from s to something else or t to something else. They also work in case sensitive mode. MRI 2.4.3 works fine with all of these variants.

Is there any other input you need?

Environment

$ jruby -v
jruby 9.1.16.0 (2.3.3) 2018-02-21 8f3f95a Java HotSpot(TM) 64-Bit Server VM 25.161-b12 on 1.8.0_161-b12 +jit [linux-x86_64]

$ env | grep '\(JRUBY\|JAVA\)' | sort
JAVA_HOME=/usr/lib/jvm/java-8-oracle
JRUBY_OPTS=--server -J-Xmn128m -J-Xms1536m -J-Xmx1536m

$ uname -a
Linux 6ee27f8ee9cd 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.2 LTS
Release:	16.04
Codename:	xenial

The text was updated successfully, but these errors were encountered:

… Regexps since 9.1.16.0

lopex · 2018-03-14T20:26:34Z

MRI has special case for us-ascii regexps and strings with 7-bit coderange where it uses regexp encoding for the match, whereas we're creating new regexp with actual string encoding. So ultimately this could be a big performance penalty for two reasons:

we're polluting regexp cache and/or creating new regexps every time in case of dregexps
we're matching using utf-8 encoding unnecessarily which can be tens of times slower and even more if the match is case insensitive.

Fixed in 36b44df

naag · 2018-03-14T21:36:20Z

Thanks a lot @lopex, going to test it tomorrow with our code base!

lopex · 2018-03-14T23:37:46Z

I think the root cause also deserves some explanation. Character series like ss and ff are special because in Unicode for example ss is https://en.wikipedia.org/wiki/%C3%9F.
In case insensitive mode ss is unfolded to [S, ſ, ß, ẞ] alternatives (having different length in bytes). Another factor that triggered this issue is look-behind limitations - it doesnt allow variable length match as in (?<!.*) for example. For some reason Onigmo treats (?<!css) as a variable (not to be confused with different) length alternative in case insensitive mode even though all alternatives are fixed in size, why then is (?<!ss) ok ? this is a mystery right now.

So right now there's some caviats regarding MRI "us-ascii regexps by default":

"foo" =~ /(?<!css)/i # works
"foą" =~ /(?<!css)/i # blows
"foo" =~ /(?<!css)/iu # blows

If either string or regexp happens to end up with unicode encoding, look-behind will blow.

lopex · 2018-03-15T00:07:08Z

Also, great majority of regexps will end up being up to multiple times faster by default, with cases like e ("_" * 1000) + "" =~ /[a-z]+/i being 35 times faster.

naag · 2018-03-15T13:42:20Z

Thank you for the detailed explanation! Unfortunately I was unable to build JRuby locally, but I can try again when 9.1.17.0 is released :-)

enebo · 2018-03-15T14:23:53Z

@naag I just corrected an issue with our nightlies link: http://jruby.org/nightly (click stable).

lopex · 2018-03-18T18:05:30Z

Onigmo issue: k-takata/Onigmo#92

… Regexps since 9.1.16.0

* jruby-9.1: [fix] cast nsec nanos to long to avoid "overflow" with double value Handle this deprecation differently. Default to Java 9 bytecode for any java.specification.version>1.8. WrapperMethod is still needed for visibility. Revert "Finally eliminate use of WrapperMethod." Eliminate deprecation warnings in test suite. Finally eliminate use of WrapperMethod. Fix most deprecated calls. Handle error when attempting to connect to IP6 with default INET4. Add test_coverage to jruby.index. Add test for null filename in coverage. jruby#5111 Do not attempt to add coverage for null filename. Fixes jruby#5111. Add basic specs for Exception#backtrace_locations. Exception.backtrace_locations should persist and be mutable. Return nil if no backtrace has been captured. Fixes jruby#5099. fix for jruby#5086, RegexpError invalid pattern in look-behind for certain Regexps since 9.1.16.0

enebo added this to the JRuby 9.1.17.0 milestone Mar 13, 2018

enebo assigned lopex Mar 13, 2018

enebo added the regression label Mar 13, 2018

lopex added a commit that referenced this issue Mar 14, 2018

fix for #5086, RegexpError invalid pattern in look-behind for certain…

36b44df

… Regexps since 9.1.16.0

naag closed this as completed Mar 15, 2018

lopex mentioned this issue Sep 12, 2018

Invalid byte sequence in UTF-8 for StringIO objects #5309

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

RegexpError "invalid pattern in look-behind" for certain Regexps since 9.1.16.0 #5086

RegexpError "invalid pattern in look-behind" for certain Regexps since 9.1.16.0 #5086

naag commented Mar 13, 2018

lopex commented Mar 14, 2018

naag commented Mar 14, 2018

lopex commented Mar 14, 2018 •

edited

Loading

lopex commented Mar 15, 2018

naag commented Mar 15, 2018

enebo commented Mar 15, 2018

lopex commented Mar 18, 2018

RegexpError "invalid pattern in look-behind" for certain Regexps since 9.1.16.0 #5086

RegexpError "invalid pattern in look-behind" for certain Regexps since 9.1.16.0 #5086

Comments

naag commented Mar 13, 2018

Environment

lopex commented Mar 14, 2018

naag commented Mar 14, 2018

lopex commented Mar 14, 2018 • edited Loading

lopex commented Mar 15, 2018

naag commented Mar 15, 2018

enebo commented Mar 15, 2018

lopex commented Mar 18, 2018

lopex commented Mar 14, 2018 •

edited

Loading