Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON: Handle non-standard UTF-8 string escapes #6429

Closed
wants to merge 1 commit into from

Conversation

hinrik
Copy link
Contributor

@hinrik hinrik commented Jul 22, 2018

I came across \u escapes like this in Facebook's JSON-exported user
data. Here's another example of these in the wild:

http://seclists.org/wireshark/2018/Jul/36

This issue is present in the standard Ruby/Python/Go/PHP parsers too.
The only parser I found that handles it is Perl's JSON module:

https://github.com/makamaka/JSON/blob/master/lib/JSON/backportPP.pm#L811

I came across `\u` escapes like this in Facebook's JSON-exported user
data. Here's another example of these in the wild:

  http://seclists.org/wireshark/2018/Jul/36

This issue is present in the standard Ruby/Python/Go/PHP parsers too.
The only parser I found that handles it is Perl's JSON module:

  https://github.com/makamaka/JSON/blob/master/lib/JSON/backportPP.pm#L811
@asterite
Copy link
Member

Maybe if the issue is present in so many languages, and it's a non-standard thing, it's not an issue? Where in the spec of JSON is this?

@hinrik
Copy link
Contributor Author

hinrik commented Jul 22, 2018

Where in the spec of JSON is this?

The standard only specifies UTF-16 surrogate pairs: https://tools.ietf.org/html/rfc7159#section-7.

Maybe if the issue is present in so many languages, and it's a non-standard thing, it's not an issue?

It was an issue for me as least, as I had to use Perl instead of Crystal for a text processing task. :)

Sure, this kind of output is non-standard and rare, but a major website is outputting it to its users, and it's possible to work around it as Perl has done, so why not do so? I'd say the robustness principle should apply.

@@ -154,7 +154,7 @@ abstract class JSON::Lexer
when '\0'
raise "Unterminated string"
when '\\'
@buffer << consume_string_escape_sequence
consume_string_escape_sequence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this isn't using skip: true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's inside consume_string_with_buffer, which cares about collecting the contents of the string into a buffer. Only consume_string_skip uses skip: true.

@asterite
Copy link
Member

Sounds good. But I don't understand the change. Why is it called skip? Could you add a few comments explaining the change, and this particular sequence?

@hinrik
Copy link
Contributor Author

hinrik commented Jul 22, 2018

Yeah, I can try to make it a bit clearer. There wasn't an obvious simple change to make here because normally the string consuming functions return a Char, but for these UTF-8 escapes I'm adding the bytes directly to the string @buffer. Perhaps it's possible still to refactor the code to be a bit cleaner (e.g. to avoid so many if !skip checks in consume_string_escape_sequence).

To have those functions continue to return a Char, you'd either have to rely on lookahead (which would require other changes to json/lexer/io_based.cr) or be smarter about the contents of the UTF-8 escape to know when we've parsed enough of them to constitute a full character.

@asterite
Copy link
Member

@hinrik I'm sorry, but reviewing the code and seeing the spec, I don't understand the reason of this change. JSON says that to encode a char in UTF-8 you can use \uXXXX. That means \u00c3 is the unicode character with codepoing c3 hexadecimal, which is 'Ã'. I can't find anywhere where it says that should be interpreted as a byte value. The example given in the spec, "\uD834\uDD1E", that should expand into a G clef, works as expected right now. That other parsers, except Perl (I don't know why) work like Crystal confirms my belief. Additionally, the character you want, the one you used in the spec, 'Ö', has a codepoint of 214, so it can be expressed in JSON with "\u00d6".

That Facebook generates invalid JSON is not out problem.

@straight-shoota
Copy link
Member

This simply doesn't seem correct. There would be no way to disambiguate if "\u00c3\u0096" should be interpreted as Ã\u{96} (codepoints 195 and 150) or Ö (codepoint 214).

@asterite
Copy link
Member

Exactly.

@hinrik
Copy link
Contributor Author

hinrik commented Jul 22, 2018

You're right. From what I can tell, it only worked with Perl due to some accidental re-encoding. Because for backwards compatibility reasons, Perl by default outputs latin1 to stdout. If I tell it to output UTF-8, I get the same output as in Crystal.

@jhass
Copy link
Member

jhass commented Jul 24, 2018

Yeah you should work around this by preprocessing the JSON before even parsing it to convert the improper escapes into proper ones.

@bcardiff
Copy link
Member

bcardiff commented Aug 6, 2018

Closing since there seems to be nothing wrong in stdlib. Someone can post some re encoding snippets at most.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants