-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JRuby creates symbols with US-ASCII encoding but non-ASCII bytes #4828
Comments
Nice...this shouldn't be hard to fix. Currently we still process most identifiers as Java strings, though @enebo has been experimenting with moving that to a ByteList (byte[] + encoding) or Symbol. I'm guessing we don't track encoding properly here and then produce the error wrong as a result. |
This is unlikely to get fixed for 9.1.14 because there's some wide-ranging changes I'm not sure I'd be comfortable with. Is this affecting you or did you just happen to notice it? |
We have a work-around so it's no biggie. |
@jmiettinen Would you please let us know the workaround so that we can fix the issue until it gets fixed in upcoming versions? |
I am not sure if our work-around helps as it was related to printing backtrace and message of an We encountered the problem in situation where we had class Exception
def stacktrace
"#{message}\n\t#{(backtrace || []).join("\n\t")}"
end
end which was to be written to an output stream. But writing did not succeed as we'd get bytes outside ASCII range. We just ended up sanitizing the result by replacing all bytes > 127 with # Usage: sanitize_nonascii_stacktrace(exception.stacktrace)
ASCII_ENCODING = Encoding.find("US-ASCII")
def sanitize_nonascii_stacktrace(stacktrace)
if stacktrace.encoding == ASCII_ENCODING && !stacktrace.valid_encoding?
question_mark_byte = '?'.encode(ASCII_ENCODING).getbyte(0)
(0...stacktrace.bytesize).each do |index|
if stacktrace.getbyte(index) > 0x7f
stacktrace.setbyte(index, question_mark_byte)
end
end
end
stacktrace
end |
@enebo actually chatted with me about the encoding of exception messages and backtraces, relating to his work on the bytelist_love branch. |
@jmiettinen I have been working towards merging a large branch now to upcoming 9.2.x which will largely solve these problems. The main issue we have today is that at some point all data for method and variable names end up as a Java String and we run into lots of scenarios where we try and make it back into a Ruby String or symbol and we have lost the ability to regain its encoding. The new code works around this by leveraging our symbol tables so we can use the strings we are passing around to regain the original symbol we used (thus getting the encoding back). I actually suspect once this lands we will spend many point releases correcting missing piece of logic for this but we have a MASSIVE codebase. Your particular issue I will record against the feature. |
Oh I should add we have no current plans to entertain this for 9.1.x. It is a lot of work and a few small (which no one should experience) breaking changes. 9.1.x will still see innovations but just not in this area. It is too icky to guarantee on a stable line. |
thanks for @enebo this is now fixed (on master -- JRuby >= 9.2) and outputs exactly as MRI :
|
When JRuby creates symbols for undefined local variable, the symbols'
ByteList
haveUS-ASCII
encoding but bytes in it may not actually be withinUS-ASCII
range.Environment
Reproduces at least on JRuby 1.7.27 && JRuby 9.1.12.0. Master seems to currently be the same.
This is JRuby internal, so platform is mostly irrevelant. However:
uname -a
saysDarwin jmiettinen.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64
locale
saysLC_ALL="fi_FI.UTF-8"
Expected Behavior
Given this small script (named
utf8_fail.rb
in my example outputs):I would expect to get the following output (this is from 1.9.3-p448 and 2.3.1):
Actual Behavior
However, when the same file is run with JRuby 1.7.27 / JRuby 9.1.12.0, we get problems with bytes in the created symbol öÖa:
Here the error message differs and
RubyRegexp
notices that there are some non-ASCII bytes in the string withUS-ASCII
encoding and throwsArgumentError
.If we run this through hexdump (
ruby utf8_fail.rb 2>&1| hexdump -C
), we getHere it can be seen that codepoints for ö and Ö (f6 and d6) are copied just directly to the
ByteList
used in that message.The text was updated successfully, but these errors were encountered: