Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marshal::dump adds string encoding #4047

Open
rovf opened this issue Aug 1, 2016 · 5 comments
Open

Marshal::dump adds string encoding #4047

rovf opened this issue Aug 1, 2016 · 5 comments

Comments

@rovf
Copy link

rovf commented Aug 1, 2016

Environment

jruby 1.7.24 (1.9.3p551) 2016-01-20 bd68d85 on Java HotSpot(TM) 64-Bit Server VM 1.7.0_79-b15 +jit [Windows 7-amd64]

and

jruby 9.0.4.0 (2.2.2) 2015-11-12 b9fb7aa Java HotSpot(TM) 64-Bit Server VM 24.79-b02 on 1.7.0_79-b15 +jit [Windows 7-amd64]

When I do a Marshal.dump in JRuby 9.0.4.0 on Windows:

irb(main):001:0> Marshal.dump(['abc','xyz','UUUUUU'])
=> "\x04\b[\bI\"\babc\x06:\rencoding\"\nCP850I\"\bxyz\x06;\x00\"\nCP850I\"\vUUUUUU\x06;\x00\"\nCP850"
irb(main):002:0>

or in JRuby 1.7.24:

irb(main):005:0> Marshal.dump(['abc','xyz','UUUUUU'])
=> "\x04\b[\bI\"\babc\x06:\rencoding\"\x11Windows-1252I\"\bxyz\x06;\x00\"\x11Windows-1252I\"\vUUUUUU\x06;\x00\"\x11Windows-1252"

I see that the string encoding is also part of the marshalled string (and, interestingly, the encoding name is different in the two JRuby versions). If I do the same in Ruby, I don't get this encoding information in such verbosity, for instance with MRI Ruby 2.2.4:

irb(main):042:0>  Marshal.dump(['abc','xyz','UUUUUU'])
=> "\x04\b[\bI\"\babc\x06:\x06ETI\"\bxyz\x06;\x00TI\"\vUUUUUU\x06;\x00T"

When marshalling data structures which contain lots of very short strings, the repetition of encoding names takes up a lot of space in the resulting (marshalled) string.

JRuby should behave in this respect like MRI Ruby, to save space in the dumped data.

@headius
Copy link
Member

headius commented Aug 24, 2016

MRI is still embedding the encoding, but doing it using shortcut symbols "ET". The E means "Encoding" and the T means "True, this is a US-ASCII string".

The "encoding" symbol in JRuby and the "E" symbol in MRI get assigned link \x00 and used twice more in both outputs, followed by the encoding name (which for MRI is just "T" again).

So the problem isn't that we're including the encoding, it's that we're not treating 7-bit Windows-1252 strings as US-ASCII during marshaling.

@headius
Copy link
Member

headius commented Aug 24, 2016

The other possibility here is that the parser on Windows is giving 7-bit strings a non-"US-ASCII" encoding. I believe MRI normalized literal 7-bit strings to always say they're US-ASCII at some point, and we may not have that change everywhere.

@enebo
Copy link
Member

enebo commented Aug 24, 2016

oooh oooh I know this one :)

You are using irb and we must not be transcoding the incoming string to proper internal/external encoding. If I take your example and run from a file in both 1.7 and 9k I do not see any windows code pages.

@rovf
Copy link
Author

rovf commented Aug 25, 2016

Indeed, you are right! I had never expected, that irb/jirb causes this difference in behaviour. I think this difference in behaviour should be documented in the Readme for JRuby, because it is confusing.

BTW, I don't quite understand the reason why it must be different for irb. Take for example the expression

Marshal.dump(['abc','xyz','UUUUUU']).inspect

At what point comes in the necessity to format the resulting string different to the usual (non-irb) behaviour? It can't even be the encoding of the terminal window running irb, because the same problem happens in a console window running the windows character encoding, and in a mintty window configured to run UTF-8.

@enebo
Copy link
Member

enebo commented Aug 26, 2016

@rovf this is definitely a bug but it involves however we read in stdin and convert it to a string internally. We must not be doing that correctly on Windows. I believe on non-windows we transcode all input to proper internal (which will usually be external) encoding but on windows we still fall back to non-native IO so we are just not doing it in that code path.

tl;dr a bug we will fix on reading IO on windows from a console/tty

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants