Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-16LE Regexp not working as expected #2863

Closed
zako42 opened this issue Apr 21, 2015 · 8 comments
Closed

UTF-16LE Regexp not working as expected #2863

zako42 opened this issue Apr 21, 2015 · 8 comments

Comments

@zako42
Copy link

zako42 commented Apr 21, 2015

Not able to create a UTF-16LE encoded Regexp object. I tried the following in Jruby 1.7.19 in IRB:

abc8 = "abc"
abc8.encoding  # <Encoding:UTF-8>
regex8 = Regexp.new(abc8)  # succeeds

abc16 = "abc".encode("UTF-16LE")
abc16.encoding  # <Encoding:UTF-16LE>
regex16 = Regexp.new(abc16)
# in MRI 1.9.3-p545 this succeeds
# in jruby 1.7.19:  Encoding::CompatibilityError: incompatible character encodings: US-ASCII and UTF-16LE
@headius
Copy link
Member

headius commented Apr 29, 2015

Works ok on master, but that's not too surprising (reimplemented encodings since 1.7).

I would suspect the problem is in our Regexp encoding negotiation (prepareRegexp and such); it's not allowing this combination when it should.

@azolotko
Copy link
Contributor

Here's the stack trace:

Encoding::CompatibilityError: incompatible character encodings: US-ASCII and UTF-16LE
    from org/jruby/RubyString.java:1228:in `%'
    from org/jruby/RubyKernel.java:986:in `sprintf'
    from org/jruby/RubyKernel.java:615:in `printf'
    from org/jruby/RubyKernel.java:1511:in `loop'
    from org/jruby/RubyKernel.java:1274:in `catch'
    from org/jruby/RubyKernel.java:1274:in `catch'
    from /Users/azolotko/.rbenv/versions/jruby-1.7.20/bin/jirb:13:in `(root)'

As you can see, the error is raised when IRB tries to printf the result of the evaluation, i.e. => /abc/.

The actual difference between JRuby and MRI seems to be in the result of the Regexp#inspect:
regex16.inspect in JRuby 1.7.20 returns "\u612F\u6200\u6300\u2F00", when MRI does "/a\x00b\x00c\x00/"

@azolotko
Copy link
Contributor

JRuby 1.7.20:

>> regex16.inspect.encoding
=> #<Encoding:UTF-16LE>

MRI 2.2.2p95:

irb(main):004:0> regex16.inspect.encoding
=> #<Encoding:US-ASCII>

MRI forces US-ASCII if the Regexp encoding is not compatible with ASCII: https://github.com/ruby/ruby/blob/ruby_2_2/re.c#L434
https://github.com/ruby/ruby/blob/ruby_2_2/re.c#L411

But JRuby uses the original encoding:
https://github.com/jruby/jruby/blob/jruby-1_7/core/src/main/java/org/jruby/RubyRegexp.java#L1843
https://github.com/jruby/jruby/blob/jruby-1_7/core/src/main/java/org/jruby/RubyRegexp.java#L1314

@enebo enebo added this to the JRuby 1.7.21 milestone May 15, 2015
@enebo
Copy link
Member

enebo commented May 15, 2015

@azolotko you did such a good job find this is there any chance you can submit a pull request too?

@azolotko
Copy link
Contributor

@enebo appendRegexpString19 looks scary. Let's see if I can do this.

@azolotko
Copy link
Contributor

azolotko commented Jun 6, 2015

Sorry guys, I didn't manage to tackle the issue and can't spend more time on this. Maybe someone else will have a better luck.

@enebo enebo modified the milestones: JRuby 1.7.21, JRuby 1.7.22 Jul 7, 2015
@enebo enebo modified the milestones: JRuby 1.7.22, JRuby 1.7.23 Aug 20, 2015
@enebo enebo modified the milestones: JRuby 1.7.23, JRuby 1.7.24 Nov 24, 2015
@enebo enebo modified the milestones: JRuby 1.7.24, JRuby 1.7.25 Jan 20, 2016
@enebo enebo modified the milestones: JRuby 1.7.26, JRuby 1.7.25 Apr 11, 2016
@enebo
Copy link
Member

enebo commented Apr 11, 2016

Ok, I am not going to fix this before 1.7.25 goes out but I learned some stuff so I will document something at least:

abc8 = "abc"
p abc8.encoding  # <Encoding:UTF-8>
regex8 = Regexp.new(abc8)  # succeeds

p regex8.inspect
p regex8.inspect.encoding


abc16 = "abc".encode("UTF-16LE")
p abc16.encoding  # <Encoding:UTF-16LE>
regex16 = Regexp.new(abc16)

p regex16.inspect
p regex16.inspect.encoding

Both 1.7 and 9k are both wrong. 9k does not throw the original reported exception but it is still not properly converting the encoding to the same one as MRI. I did just move over the 9k logic from regexpDescription to 1.7 as a test and the exception goes away but the result is still wrong; so I do not think that is appropriate to unbreak it less.

Results:

(mri 1.9)

#<Encoding:US-ASCII>
"/abc/"
#<Encoding:US-ASCII>
#<Encoding:UTF-16LE>
"/a\x00b\x00c\x00/"
#<Encoding:US-ASCII>

mri2.3

#<Encoding:UTF-8>
"/abc/"
#<Encoding:US-ASCII>
#<Encoding:UTF-16LE>
"/a\x00b\x00c\x00/"
#<Encoding:US-ASCII>

Only difference in MRI is that default encoding changed so we see UTF-8 in first print out.

JRuby 1.7.24 has original reported problem Encoding::ComatibilityError.

9.0.5.0

#<Encoding:UTF-8>
"/abc/"
#<Encoding:UTF-8>
#<Encoding:UTF-16LE>
"\u612F\u6200\u6300\u2F00"
#<Encoding:UTF-16LE>

So difference is as @azolotko said inspect will convert the inspect of a regexp to US-ASCII and then print out mbc's as /x codes.

@enebo enebo modified the milestones: JRuby 1.7.26, JRuby 1.7.27 Aug 26, 2016
@headius
Copy link
Member

headius commented Mar 15, 2017

This works on 1.7 HEAD, so I'll call it fixed in 1.7.26.

@headius headius closed this as completed Mar 15, 2017
@headius headius modified the milestones: JRuby 1.7.26, JRuby 1.7.27 Mar 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants