JRuby creates symbols with US-ASCII encoding but non-ASCII bytes #4828

jmiettinen · 2017-10-25T14:08:40Z

When JRuby creates symbols for undefined local variable, the symbols' ByteList have US-ASCII encoding but bytes in it may not actually be within US-ASCII range.

Environment

Reproduces at least on JRuby 1.7.27 && JRuby 9.1.12.0. Master seems to currently be the same.

This is JRuby internal, so platform is mostly irrevelant. However:

uname -a says Darwin jmiettinen.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64
locale says LC_ALL="fi_FI.UTF-8"

Expected Behavior

Given this small script (named utf8_fail.rb in my example outputs):

# encoding: utf-8                                                                                                                                                                                                                                                             
begin
  öÖa
rescue => e
  puts e.message.encoding
  puts e.message
  puts /foo/ === e.message
end

I would expect to get the following output (this is from 1.9.3-p448 and 2.3.1):

UTF-8
undefined local variable or method `öÖa' for main:Object
false

Actual Behavior

However, when the same file is run with JRuby 1.7.27 / JRuby 9.1.12.0, we get problems with bytes in the created symbol öÖa:

US-ASCII
undefined local variable or method `??a' for main:Object
ArgumentError: invalid byte sequence in US-ASCII
     === at org/jruby/RubyRegexp.java:1078
  <main> at utf8_fail.rb:8

Here the error message differs and RubyRegexp notices that there are some non-ASCII bytes in the string with US-ASCII encoding and throws ArgumentError.

If we run this through hexdump (ruby utf8_fail.rb 2>&1| hexdump -C), we get

00000000  55 53 2d 41 53 43 49 49  0a 75 6e 64 65 66 69 6e  |US-ASCII.undefin|
00000010  65 64 20 6c 6f 63 61 6c  20 76 61 72 69 61 62 6c  |ed local variabl|
00000020  65 20 6f 72 20 6d 65 74  68 6f 64 20 60 f6 d6 61  |e or method `..a|
00000030  27 20 66 6f 72 20 6d 61  69 6e 3a 4f 62 6a 65 63  |' for main:Objec|
00000040  74 0a 41 72 67 75 6d 65  6e 74 45 72 72 6f 72 3a  |t.ArgumentError:|
00000050  20 69 6e 76 61 6c 69 64  20 62 79 74 65 20 73 65  | invalid byte se|
00000060  71 75 65 6e 63 65 20 69  6e 20 55 53 2d 41 53 43  |quence in US-ASC|
00000070  49 49 0a 20 20 20 20 20  3d 3d 3d 20 61 74 20 6f  |II.     === at o|
00000080  72 67 2f 6a 72 75 62 79  2f 52 75 62 79 52 65 67  |rg/jruby/RubyReg|
00000090  65 78 70 2e 6a 61 76 61  3a 31 30 37 38 0a 20 20  |exp.java:1078.  |
000000a0  3c 6d 61 69 6e 3e 20 61  74 20 75 74 66 38 5f 66  |<main> at utf8_f|
000000b0  61 69 6c 2e 72 62 3a 38  0a                       |ail.rb:8.|
000000b9

Here it can be seen that codepoints for ö and Ö (f6 and d6) are copied just directly to the ByteList used in that message.

The text was updated successfully, but these errors were encountered:

headius · 2017-10-26T18:01:29Z

Nice...this shouldn't be hard to fix. Currently we still process most identifiers as Java strings, though @enebo has been experimenting with moving that to a ByteList (byte[] + encoding) or Symbol. I'm guessing we don't track encoding properly here and then produce the error wrong as a result.

headius · 2017-10-26T19:33:16Z

This is unlikely to get fixed for 9.1.14 because there's some wide-ranging changes I'm not sure I'd be comfortable with. Is this affecting you or did you just happen to notice it?

jmiettinen · 2017-10-26T19:35:20Z

We have a work-around so it's no biggie.

cmthakur · 2018-01-25T05:42:34Z

@jmiettinen Would you please let us know the workaround so that we can fix the issue until it gets fixed in upcoming versions?

jmiettinen · 2018-01-25T07:28:18Z

I am not sure if our work-around helps as it was related to printing backtrace and message of an Exception. But it might give you some ideas @cmthakur.

We encountered the problem in situation where we had

class Exception
  def stacktrace
    "#{message}\n\t#{(backtrace || []).join("\n\t")}"
  end
end

which was to be written to an output stream. But writing did not succeed as we'd get bytes outside ASCII range. We just ended up sanitizing the result by replacing all bytes > 127 with ? . One could try to interpret those bytes as UTF-8 bytes too which they most probably are. So here's the simple approach:

# Usage: sanitize_nonascii_stacktrace(exception.stacktrace)
ASCII_ENCODING = Encoding.find("US-ASCII")
def sanitize_nonascii_stacktrace(stacktrace)
  if stacktrace.encoding == ASCII_ENCODING && !stacktrace.valid_encoding?
    question_mark_byte = '?'.encode(ASCII_ENCODING).getbyte(0)
    (0...stacktrace.bytesize).each do |index|
      if stacktrace.getbyte(index) > 0x7f
        stacktrace.setbyte(index, question_mark_byte)
      end
    end
  end
  stacktrace
end

headius · 2018-01-25T17:58:51Z

@enebo actually chatted with me about the encoding of exception messages and backtraces, relating to his work on the bytelist_love branch.

enebo · 2018-01-25T19:52:13Z

@jmiettinen I have been working towards merging a large branch now to upcoming 9.2.x which will largely solve these problems. The main issue we have today is that at some point all data for method and variable names end up as a Java String and we run into lots of scenarios where we try and make it back into a Ruby String or symbol and we have lost the ability to regain its encoding. The new code works around this by leveraging our symbol tables so we can use the strings we are passing around to regain the original symbol we used (thus getting the encoding back).

I actually suspect once this lands we will spend many point releases correcting missing piece of logic for this but we have a MASSIVE codebase. Your particular issue I will record against the feature.

enebo · 2018-01-25T19:57:53Z

Oh I should add we have no current plans to entertain this for 9.1.x. It is a lot of work and a few small (which no one should experience) breaking changes. 9.1.x will still see innovations but just not in this area. It is too icky to guarantee on a stable line.

kares · 2018-05-24T09:44:30Z

thanks for @enebo this is now fixed (on master -- JRuby >= 9.2) and outputs exactly as MRI :

UTF-8
undefined local variable or method `öÖa' for main:Object
false

headius added this to the JRuby 9.2.0.0 milestone Oct 26, 2017

enebo mentioned this issue Feb 27, 2018

Properly represent all encodings in methods/constants/variables/symbols #4965

Closed

kares closed this as completed May 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

JRuby creates symbols with US-ASCII encoding but non-ASCII bytes #4828

JRuby creates symbols with US-ASCII encoding but non-ASCII bytes #4828

jmiettinen commented Oct 25, 2017 •

edited

Loading

headius commented Oct 26, 2017

headius commented Oct 26, 2017

jmiettinen commented Oct 26, 2017

cmthakur commented Jan 25, 2018

jmiettinen commented Jan 25, 2018

headius commented Jan 25, 2018

enebo commented Jan 25, 2018

enebo commented Jan 25, 2018

kares commented May 24, 2018

JRuby creates symbols with US-ASCII encoding but non-ASCII bytes #4828

JRuby creates symbols with US-ASCII encoding but non-ASCII bytes #4828

Comments

jmiettinen commented Oct 25, 2017 • edited Loading

Environment

Expected Behavior

Actual Behavior

headius commented Oct 26, 2017

headius commented Oct 26, 2017

jmiettinen commented Oct 26, 2017

cmthakur commented Jan 25, 2018

jmiettinen commented Jan 25, 2018

headius commented Jan 25, 2018

enebo commented Jan 25, 2018

enebo commented Jan 25, 2018

kares commented May 24, 2018

jmiettinen commented Oct 25, 2017 •

edited

Loading