-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround for JRuby bug with UTF-8 encoding on non-English Windows #3767
Workaround for JRuby bug with UTF-8 encoding on non-English Windows #3767
Conversation
We discovered this with a customer yesterday. DNS resolving (using `Resolv::DNS#getaddresses`) failed. It turned out that the underlying problem was that the way `win32/registry.rb` retrieves error messages from Windows is flawed, under certain circumstances. The bug only appears if: - `JAVA_OPTS` includes `-Dfile.encoding=UTF-8` - The environment you run it on is a _non-English_ Windows installation, or more specifically: a Windows installation where the error message(s) include some non-ASCII character. The specific use case here is a Swedish Windows installation, where the error message for error number 2 (File not found) reads as "Filen du söker finns inte", or something similar. (The debugging here was complicated by the fact that I don't have a Swedish Windows myself, only the customer). Here is a test script that reproduces the problem given that you have a Swedish Windows: ```ruby require 'win32/resolv' puts Encoding.locale_charmap puts Win32::Resolv.send(:get_info).inspect ``` Since I don't have a Swedish Windows, I managed to reproduce the same error locally by adding this code into the `Error` constructor: ```ruby msg = "Malm\xf6, G\xf6teborg och V\xe4xj\xf6" ``` That gives me the same error: `ArgumentError: invalid byte sequence in UTF-8`. The original bug occurs in the `msg.tr` call, since the string it operates on is _marked_ as UTF-8 (because of the `force_encoding` call), where it is in fact Windows-1252 encoded (since we call `FormatMessageA`, which returns Windows-1252 encoded strings...). IMHO, this is clearly a bug. --- Now, to the solution. The "correct" approach would be to not use the ANSI version of the system call _at all_, but instead use the "wide" version. This is the approach MRI has taken, the bug was fixed there almost 3 years ago. :) ruby/ruby@9db6beb is the specific commit, fixing the bug reported in this issue: https://bugs.ruby-lang.org/issues/8508 I tried that first, by copying over their current version of the `Error` class. The only problem is that it depends on `win32/importer`, which we don't have, which in turn depends on `fiddle/import` which we don't seem to have either. You get the picture. Therefore, I tried with a simpler workaround instead: _if_ we get an exception in this method, let's just retry by fetching the `en-US` error message instead. (inspired by how the code on the MRI side now looks: https://github.com/ruby/ruby/blob/trunk/ext/win32/lib/win32/registry.rb) I have tested this with the customer; it works correctly as far as the code no longer crashes. _But_, it fails to retrieve the English error message (possibly because it's not available on that Windows installation). I am willing to live with that for now; if someone wants to fix that part as well, be my guest. The critical part for me and the customer is that the code must not crash on non-ASCII Windows installations.
@perlun that particular commit (e.g. diff) looks like it would just work wouldn't it? Or are there some companion commits which will break if we just replace the *A with *W here? |
It wasn't applicable straight away, since it relied on stuff we don't use (like Anyway, I did a bit of a "compromise" now, bringing in the essence of their change but making it work in our codebase. Have tested it locally and this version now works fine on my machine at least; will verify with customer as well. |
Verified with customer - works even better than my previous approach, the non-ASCII error message is properly retrieved now with this change. |
@perlun great! I think we have not updated this impl because of the deps it uses in future revisions. Without supporting those we might need to get a little more inventive like this and probably change any other A signatures to W signatures. This gets you past your current hurdle though so let's roll with it. Thanks for the PR and followup work. |
Thanks Thomas! You probably get this question all the time, but when do you anticipate 1.7.25 to be officially released? (it's no rush for me personally, just curious - the customer already received a working |
@perlun yeah we anticipate it within the next week or so. We actually have slipped a few times but we are spending some extra time to make sure we get appveyor CI runs more green on windows. In theory, 1.7.25 and 9.1 will be the most tested releases against windows in a long time. We will not end up fixing all windows issues but hopefully will be on a much better support trajectory. |
@enebo - long weeks over there? 😜 |
(nvm - I just realized you had released it yesterday. Bummer!) |
@perlun sorry. Did I miss something? I don't get the bummer comment? This was merged. And yeah things did not go as planned on release timewise :| |
I meant it like: a shame I wasn't aware the release was already made. Complaining over my own failures. 😃 You're doing a great job, thanks for all of it. |
We discovered this with a customer yesterday. DNS resolving (using
Resolv::DNS#getaddresses
) failed. It turned out that the underlying problem was that the waywin32/registry.rb
retrieves error messages from Windows is flawed, under certain circumstances.The bug only appears if:
JAVA_OPTS
includes-Dfile.encoding=UTF-8
Here is a test script that reproduces the problem given that you have a Swedish Windows:
Since I don't have a Swedish Windows, I managed to reproduce the same error locally by adding this code into the
Error
constructor:That gives me the same error:
ArgumentError: invalid byte sequence in UTF-8
. The original bug occurs in themsg.tr
call, since the string it operates on is marked as UTF-8 (because of theforce_encoding
call), where it is in fact Windows-1252 encoded (since we callFormatMessageA
, which returns Windows-1252 encoded strings...). IMHO, this is clearly a bug.Now, to the solution. The "correct" approach would be to not use the ANSI version of the system call at all, but instead use the "wide" version. This is the approach MRI has taken, the bug was fixed there almost 3 years ago. :) ruby/ruby@9db6beb is the specific commit, fixing the bug reported in this issue: https://bugs.ruby-lang.org/issues/8508
I tried that first, by copying over their current version of the
Error
class. The only problem is that it depends onwin32/importer
, which we don't have, which in turn depends onfiddle/import
which we don't seem to have either. You get the picture.Therefore, I tried with a simpler workaround instead: if we get an exception in this method, let's just retry by fetching the
en-US
error message instead. (inspired by how the code on the MRI side now looks: https://github.com/ruby/ruby/blob/trunk/ext/win32/lib/win32/registry.rb)I have tested this with the customer; it works correctly as far as the code no longer crashes. But, it fails to retrieve the English error message (possibly because it's not available on that Windows installation). I am willing to live with that for now; if someone wants to fix that part as well, be my guest. The critical part for me and the customer is that the code must not crash on non-ASCII Windows installations.