Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unmarshaled symbol has the wrong encoding #1329

Closed
DavidEGrayson opened this issue Dec 12, 2013 · 9 comments
Closed

Unmarshaled symbol has the wrong encoding #1329

DavidEGrayson opened this issue Dec 12, 2013 · 9 comments

Comments

@DavidEGrayson
Copy link
Contributor

JRuby behaves differently than MRI when it is unmarshaling symbols. The symbol always seems to have the US-ASCII encoding, even if it has special unicode characters in it.

To reproduce this, I needed two separate scripts. (It seems that the state of JRuby's symbol table affects how Marshal.load behaves.)

In test1.rb, I have:

# coding: UTF-8
mu = 'µ'.to_sym
File.open('mu.dat', 'wb') { |f| f.write(Marshal.dump(mu)) }

In test2.rb, I have:

dump = File.open('mu.dat', 'rb') { |f| f.read }
p dump.bytes.to_a
mu = Marshal.load(dump)
puts mu.to_s.encoding

Here is the output I get from running these scripts, and also information about the versions of Ruby I am using:

$ jruby -v && jruby test1.rb && jruby test2.rb
jruby 1.7.9 (1.9.3p392) 2013-12-06 87b108a on Java HotSpot(TM) 64-Bit Server VM
1.7.0_07-b10 [Windows 8-amd64]
[4, 8, 73, 58, 7, 194, 181, 6, 58, 6, 69, 84]
US-ASCII
$ ruby -v && ruby test1.rb && ruby test2.rb
ruby 2.0.0p0 (2013-02-24) [x64-mingw32]
[4, 8, 73, 58, 7, 194, 181, 6, 58, 6, 69, 84]
UTF-8

From this we can see that both JRuby and MRI are marshaling the data in the same way, but when JRuby unmarshals it, it is setting the encoding to US-ASCII instead of UTF-8.

This issue came up because I am trying to use YARD to generate documentation for JRuby code that has special characters in a few method alias names. When I run "yard doc", the data about those methods is marshaled and written to the disk, and when I run "yard server --reload" it gets unmarshaled badly.

One workaround for this issue is to create a symbol with the proper encoding before running Marshal.load.

Sorry if this is a duplicate. This could be related to issue with symbol literal encoding that I just reported, #1328. I also see there is another open issue about method that is probably related to symbol encoding: #914.

@enebo
Copy link
Member

enebo commented Dec 12, 2013

I suspect this is a direct consequence of #1328 since that issue establishes that the encoding of the symbol is wrong in the first place. So it is marshalling that improperly encoded symbol just fine.

@DavidEGrayson
Copy link
Contributor Author

Thanks for taking a look at this. In test1.rb, I worked around issue #1328 by using a string literal and calling to_sym. In test2.rb I printed the values of the bytes to make sure that the marshaling was done correctly; both JRuby and MRI produce the same bytes when they marshal that symbol. So it has to be a problem with the unmarshaling.

To simplify the issue, we could just consider the following code:

Marshal.load("\x04\x08\x49\x3a\x07\xc2\xb5\x06\x3a\x06\x45\x54").encoding.to_s

When I run it with JRuby I get "US-ASCII" and when I run it with MRI I get "UTF-8".

@enebo
Copy link
Member

enebo commented Dec 12, 2013

Sorry I missed that you used to_sym in that example. This indeed appears to be a problem...

@DavidEGrayson
Copy link
Contributor Author

I sat down and studied the marshaled data we are working with and figured out what each byte means.

\x04\x08  Marhsal format version 4.8.
\x49      TYPE_IVAR: We are going to have an object followed by its instance variables.
:         TYPE_SYMBOL: The object is a symbol.
\x07      long: 2: The symbol has two bytes.
\xc2\xb5  The actual bytes for the symbol (mu in UTF-8).
\x06      long: 1. The symbol has two instance variables.
:         TYPE_SYMBOL.  This symbol is going to be the name of the instance variable.
\x06      long: 1.  The symbol has one byte.
E         actual bytes for name symbol.
T         true.  The value of the :E instance variable is true.

(I figured all this out by studying the MRI source code. I have not looked up the official Marshal format documentation if there is one, so some parts of my description might show that ignorance.)

To summarize, it looks like we have a symbol whose value consists of some UTF-8 bytes and it is marshaled as if it has one instance variable :E whose value is true. This is not a real instance variable; I suspect it is just a bit of a hack for adding encoding data to marshaled symbols without breaking the data format. MRI effectively ignores all "instance variables" for symbols except the last one.

When unmarshalling the symbol in r_symreal in marshal.c, MRI passes the name and value of the instance variable (:E and true) as the arguments to in2encidx to figure out what encoding to use for the symbol. You can see the code for in2encidx here:

https://github.com/ruby/ruby/blob/v2_0_0_352/marshal.c#L1237-1250

It seems that there are four types of things that are recognized:

  • If the instance variable is named :E and it is true, the symbol's encoding is UTF-8. That is what is happening here in my example.
  • If the instance variable is named :E and it is false, the symbol's encoding is US-ASCII.
  • If the instance variable is named :encoding then its value should be a string that says what encoding to use.
  • If no instance variables are supplied or something was wrong with them that wasn't an exception, then the encoding doesn't get set explicitly in r_symreal. It is just whatever encoding was returned by r_bytes0 (which I'd have to look at carefully to understand).

From looking at the JRuby code, I don't think it has the equivalent of MRI's id2encidx. It just sets the encoding of all symbols it unmarshals to US-ASCII. I would like to change JRuby's behavior to be the same as MRI's and make a pull request.

@enebo, let me know if you have any objections or advice! I hope I can write a good pull request for this. I have never contributed to JRuby before.

@enebo
Copy link
Member

enebo commented Dec 13, 2013

It looks like you have dug into this just fine and it is pretty clear how MRI works in seeing that method. It will be great if you can make a PR to correct this in our marshalling.

@DavidEGrayson
Copy link
Contributor Author

I tested this again today with MRI 2.2.0p0 and JRuby 9.0.0.0.pre1 and nothing has changed. I could probably make a pull request that fixes it 99.9% of the time, but issue #1348 is still open so it would not be a complete fix. As described in issue #1348, symbols retrieved from the JRuby symbol table will sometimes have the wrong encoding, so we cannot guarantee that unmarshalled symbols will have the right encoding until that is fixed.

@DavidEGrayson
Copy link
Contributor Author

I just tested it again, and this problem is still present in JRuby 9.0.0.0.

@anthonylebrun
Copy link

anthonylebrun commented Mar 4, 2017

Phew, this bug has been haunting me for the last few days.

I've been using the u2f gem with the latest JRruby (9.1.7.0) and jruby-openssl. When trying to register my U2F device, I get OpenSSL::X509::CertificateError: No message available errors when the DER certificate file generated by my YubiKey gets passed to OpenSSL::X509::Certificate.new.

Turns out this is essentially a catch-all exception for "couldn't parse your certificate" in the jruby openssl implementation.

I setup a dummy project to test this with MRI ruby 1.9.3 and a few newer versions and everything ran smoothly. I was eventually able to get around the issue by calling String#force_encoding('UTF-8') on the certificate.

Thanks @DavidEGrayson for doing the research on this. I hope this issue does get resolved once and for all eventually :)

@headius
Copy link
Member

headius commented Apr 27, 2017

This appears to be working ok in 9.1.8.0 (the encoding comes back as UTF-8) and we have further fixes for encodings in Strings (see #1348, #4564). Marking this as fixed in 9.1.9.0. I'm marking this fixed as of 9.1.8.0.

@anthonylebrun Sorry to hear about your trouble! If it's still happening for you with 9.1.8.0, try master (9.1.9.0). If that also breaks, open an issue and we'll look into it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants