Marshal::dump adds string encoding #4047

rovf · 2016-08-01T08:54:22Z

Environment

jruby 1.7.24 (1.9.3p551) 2016-01-20 bd68d85 on Java HotSpot(TM) 64-Bit Server VM 1.7.0_79-b15 +jit [Windows 7-amd64]

and

jruby 9.0.4.0 (2.2.2) 2015-11-12 b9fb7aa Java HotSpot(TM) 64-Bit Server VM 24.79-b02 on 1.7.0_79-b15 +jit [Windows 7-amd64]

When I do a Marshal.dump in JRuby 9.0.4.0 on Windows:

irb(main):001:0> Marshal.dump(['abc','xyz','UUUUUU'])
=> "\x04\b[\bI\"\babc\x06:\rencoding\"\nCP850I\"\bxyz\x06;\x00\"\nCP850I\"\vUUUUUU\x06;\x00\"\nCP850"
irb(main):002:0>

or in JRuby 1.7.24:

irb(main):005:0> Marshal.dump(['abc','xyz','UUUUUU'])
=> "\x04\b[\bI\"\babc\x06:\rencoding\"\x11Windows-1252I\"\bxyz\x06;\x00\"\x11Windows-1252I\"\vUUUUUU\x06;\x00\"\x11Windows-1252"

I see that the string encoding is also part of the marshalled string (and, interestingly, the encoding name is different in the two JRuby versions). If I do the same in Ruby, I don't get this encoding information in such verbosity, for instance with MRI Ruby 2.2.4:

irb(main):042:0>  Marshal.dump(['abc','xyz','UUUUUU'])
=> "\x04\b[\bI\"\babc\x06:\x06ETI\"\bxyz\x06;\x00TI\"\vUUUUUU\x06;\x00T"

When marshalling data structures which contain lots of very short strings, the repetition of encoding names takes up a lot of space in the resulting (marshalled) string.

JRuby should behave in this respect like MRI Ruby, to save space in the dumped data.

The text was updated successfully, but these errors were encountered:

headius · 2016-08-24T21:16:55Z

MRI is still embedding the encoding, but doing it using shortcut symbols "ET". The E means "Encoding" and the T means "True, this is a US-ASCII string".

The "encoding" symbol in JRuby and the "E" symbol in MRI get assigned link \x00 and used twice more in both outputs, followed by the encoding name (which for MRI is just "T" again).

So the problem isn't that we're including the encoding, it's that we're not treating 7-bit Windows-1252 strings as US-ASCII during marshaling.

headius · 2016-08-24T21:18:27Z

The other possibility here is that the parser on Windows is giving 7-bit strings a non-"US-ASCII" encoding. I believe MRI normalized literal 7-bit strings to always say they're US-ASCII at some point, and we may not have that change everywhere.

enebo · 2016-08-24T21:50:08Z

oooh oooh I know this one :)

You are using irb and we must not be transcoding the incoming string to proper internal/external encoding. If I take your example and run from a file in both 1.7 and 9k I do not see any windows code pages.

rovf · 2016-08-25T06:56:00Z

Indeed, you are right! I had never expected, that irb/jirb causes this difference in behaviour. I think this difference in behaviour should be documented in the Readme for JRuby, because it is confusing.

BTW, I don't quite understand the reason why it must be different for irb. Take for example the expression

Marshal.dump(['abc','xyz','UUUUUU']).inspect

At what point comes in the necessity to format the resulting string different to the usual (non-irb) behaviour? It can't even be the encoding of the terminal window running irb, because the same problem happens in a console window running the windows character encoding, and in a mintty window configured to run UTF-8.

enebo · 2016-08-26T15:08:30Z

@rovf this is definitely a bug but it involves however we read in stdin and convert it to a string internally. We must not be doing that correctly on Windows. I believe on non-windows we transcode all input to proper internal (which will usually be external) encoding but on windows we still fall back to non-native IO so we are just not doing it in that code path.

tl;dr a bug we will fix on reading IO on windows from a console/tty

headius added core encoding JRuby 1.7.x JRuby 9000 needs tests labels Aug 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Marshal::dump adds string encoding #4047

Marshal::dump adds string encoding #4047

rovf commented Aug 1, 2016 •

edited by headius

Loading

headius commented Aug 24, 2016

headius commented Aug 24, 2016

enebo commented Aug 24, 2016

rovf commented Aug 25, 2016

enebo commented Aug 26, 2016

Marshal::dump adds string encoding #4047

Marshal::dump adds string encoding #4047

Comments

rovf commented Aug 1, 2016 • edited by headius Loading

Environment

headius commented Aug 24, 2016

headius commented Aug 24, 2016

enebo commented Aug 24, 2016

rovf commented Aug 25, 2016

enebo commented Aug 26, 2016

rovf commented Aug 1, 2016 •

edited by headius

Loading