-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Marshal::dump adds string encoding #4047
Comments
MRI is still embedding the encoding, but doing it using shortcut symbols "ET". The E means "Encoding" and the T means "True, this is a US-ASCII string". The "encoding" symbol in JRuby and the "E" symbol in MRI get assigned link So the problem isn't that we're including the encoding, it's that we're not treating 7-bit Windows-1252 strings as US-ASCII during marshaling. |
The other possibility here is that the parser on Windows is giving 7-bit strings a non-"US-ASCII" encoding. I believe MRI normalized literal 7-bit strings to always say they're US-ASCII at some point, and we may not have that change everywhere. |
oooh oooh I know this one :) You are using irb and we must not be transcoding the incoming string to proper internal/external encoding. If I take your example and run from a file in both 1.7 and 9k I do not see any windows code pages. |
Indeed, you are right! I had never expected, that irb/jirb causes this difference in behaviour. I think this difference in behaviour should be documented in the Readme for JRuby, because it is confusing. BTW, I don't quite understand the reason why it must be different for irb. Take for example the expression
At what point comes in the necessity to format the resulting string different to the usual (non-irb) behaviour? It can't even be the encoding of the terminal window running irb, because the same problem happens in a console window running the windows character encoding, and in a mintty window configured to run UTF-8. |
@rovf this is definitely a bug but it involves however we read in stdin and convert it to a string internally. We must not be doing that correctly on Windows. I believe on non-windows we transcode all input to proper internal (which will usually be external) encoding but on windows we still fall back to non-native IO so we are just not doing it in that code path. tl;dr a bug we will fix on reading IO on windows from a console/tty |
Environment
and
When I do a Marshal.dump in JRuby 9.0.4.0 on Windows:
or in JRuby 1.7.24:
I see that the string encoding is also part of the marshalled string (and, interestingly, the encoding name is different in the two JRuby versions). If I do the same in Ruby, I don't get this encoding information in such verbosity, for instance with MRI Ruby 2.2.4:
When marshalling data structures which contain lots of very short strings, the repetition of encoding names takes up a lot of space in the resulting (marshalled) string.
JRuby should behave in this respect like MRI Ruby, to save space in the dumped data.
The text was updated successfully, but these errors were encountered: