UTF-16LE Regexp not working as expected #2863

zako42 · 2015-04-21T18:46:29Z

Not able to create a UTF-16LE encoded Regexp object. I tried the following in Jruby 1.7.19 in IRB:

abc8 = "abc"
abc8.encoding  # <Encoding:UTF-8>
regex8 = Regexp.new(abc8)  # succeeds

abc16 = "abc".encode("UTF-16LE")
abc16.encoding  # <Encoding:UTF-16LE>
regex16 = Regexp.new(abc16)
# in MRI 1.9.3-p545 this succeeds
# in jruby 1.7.19:  Encoding::CompatibilityError: incompatible character encodings: US-ASCII and UTF-16LE

The text was updated successfully, but these errors were encountered:

headius · 2015-04-29T04:27:05Z

Works ok on master, but that's not too surprising (reimplemented encodings since 1.7).

I would suspect the problem is in our Regexp encoding negotiation (prepareRegexp and such); it's not allowing this combination when it should.

azolotko · 2015-05-14T21:13:37Z

Here's the stack trace:

Encoding::CompatibilityError: incompatible character encodings: US-ASCII and UTF-16LE
    from org/jruby/RubyString.java:1228:in `%'
    from org/jruby/RubyKernel.java:986:in `sprintf'
    from org/jruby/RubyKernel.java:615:in `printf'
    from org/jruby/RubyKernel.java:1511:in `loop'
    from org/jruby/RubyKernel.java:1274:in `catch'
    from org/jruby/RubyKernel.java:1274:in `catch'
    from /Users/azolotko/.rbenv/versions/jruby-1.7.20/bin/jirb:13:in `(root)'

As you can see, the error is raised when IRB tries to printf the result of the evaluation, i.e. => /abc/.

The actual difference between JRuby and MRI seems to be in the result of the Regexp#inspect:
regex16.inspect in JRuby 1.7.20 returns "\u612F\u6200\u6300\u2F00", when MRI does "/a\x00b\x00c\x00/"

azolotko · 2015-05-14T22:35:02Z

JRuby 1.7.20:

>> regex16.inspect.encoding
=> #<Encoding:UTF-16LE>

MRI 2.2.2p95:

irb(main):004:0> regex16.inspect.encoding
=> #<Encoding:US-ASCII>

MRI forces US-ASCII if the Regexp encoding is not compatible with ASCII: https://github.com/ruby/ruby/blob/ruby_2_2/re.c#L434
https://github.com/ruby/ruby/blob/ruby_2_2/re.c#L411

But JRuby uses the original encoding:
https://github.com/jruby/jruby/blob/jruby-1_7/core/src/main/java/org/jruby/RubyRegexp.java#L1843
https://github.com/jruby/jruby/blob/jruby-1_7/core/src/main/java/org/jruby/RubyRegexp.java#L1314

enebo · 2015-05-15T13:41:05Z

@azolotko you did such a good job find this is there any chance you can submit a pull request too?

azolotko · 2015-05-15T20:00:05Z

@enebo appendRegexpString19 looks scary. Let's see if I can do this.

azolotko · 2015-06-06T08:35:24Z

Sorry guys, I didn't manage to tackle the issue and can't spend more time on this. Maybe someone else will have a better luck.

enebo · 2016-04-11T17:59:13Z

Ok, I am not going to fix this before 1.7.25 goes out but I learned some stuff so I will document something at least:

abc8 = "abc"
p abc8.encoding  # <Encoding:UTF-8>
regex8 = Regexp.new(abc8)  # succeeds

p regex8.inspect
p regex8.inspect.encoding


abc16 = "abc".encode("UTF-16LE")
p abc16.encoding  # <Encoding:UTF-16LE>
regex16 = Regexp.new(abc16)

p regex16.inspect
p regex16.inspect.encoding

Both 1.7 and 9k are both wrong. 9k does not throw the original reported exception but it is still not properly converting the encoding to the same one as MRI. I did just move over the 9k logic from regexpDescription to 1.7 as a test and the exception goes away but the result is still wrong; so I do not think that is appropriate to unbreak it less.

Results:

(mri 1.9)

#<Encoding:US-ASCII>
"/abc/"
#<Encoding:US-ASCII>
#<Encoding:UTF-16LE>
"/a\x00b\x00c\x00/"
#<Encoding:US-ASCII>

mri2.3

#<Encoding:UTF-8>
"/abc/"
#<Encoding:US-ASCII>
#<Encoding:UTF-16LE>
"/a\x00b\x00c\x00/"
#<Encoding:US-ASCII>

Only difference in MRI is that default encoding changed so we see UTF-8 in first print out.

JRuby 1.7.24 has original reported problem Encoding::ComatibilityError.

9.0.5.0

#<Encoding:UTF-8>
"/abc/"
#<Encoding:UTF-8>
#<Encoding:UTF-16LE>
"\u612F\u6200\u6300\u2F00"
#<Encoding:UTF-16LE>

So difference is as @azolotko said inspect will convert the inspect of a regexp to US-ASCII and then print out mbc's as /x codes.

headius · 2017-03-15T22:53:45Z

This works on 1.7 HEAD, so I'll call it fixed in 1.7.26.

headius added JRuby 1.7.x core encoding labels Apr 29, 2015

enebo added this to the JRuby 1.7.21 milestone May 15, 2015

enebo modified the milestones: JRuby 1.7.21, JRuby 1.7.22 Jul 7, 2015

enebo modified the milestones: JRuby 1.7.22, JRuby 1.7.23 Aug 20, 2015

enebo modified the milestones: JRuby 1.7.23, JRuby 1.7.24 Nov 24, 2015

enebo modified the milestones: JRuby 1.7.24, JRuby 1.7.25 Jan 20, 2016

enebo added ruby 2.3 JRuby 9000 labels Apr 11, 2016

enebo modified the milestones: JRuby 1.7.26, JRuby 1.7.25 Apr 11, 2016

enebo modified the milestones: JRuby 1.7.26, JRuby 1.7.27 Aug 26, 2016

headius closed this as completed Mar 15, 2017

headius modified the milestones: JRuby 1.7.26, JRuby 1.7.27 Mar 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-16LE Regexp not working as expected #2863

UTF-16LE Regexp not working as expected #2863

zako42 commented Apr 21, 2015

headius commented Apr 29, 2015

azolotko commented May 14, 2015

azolotko commented May 14, 2015

enebo commented May 15, 2015

azolotko commented May 15, 2015

azolotko commented Jun 6, 2015

enebo commented Apr 11, 2016

headius commented Mar 15, 2017

UTF-16LE Regexp not working as expected #2863

UTF-16LE Regexp not working as expected #2863

Comments

zako42 commented Apr 21, 2015

headius commented Apr 29, 2015

azolotko commented May 14, 2015

azolotko commented May 14, 2015

enebo commented May 15, 2015

azolotko commented May 15, 2015

azolotko commented Jun 6, 2015

enebo commented Apr 11, 2016

headius commented Mar 15, 2017