JSON: Handle non-standard UTF-8 string escapes #6429

hinrik · 2018-07-22T16:32:52Z

I came across \u escapes like this in Facebook's JSON-exported user
data. Here's another example of these in the wild:

http://seclists.org/wireshark/2018/Jul/36

This issue is present in the standard Ruby/Python/Go/PHP parsers too.
The only parser I found that handles it is Perl's JSON module:

https://github.com/makamaka/JSON/blob/master/lib/JSON/backportPP.pm#L811

I came across `\u` escapes like this in Facebook's JSON-exported user data. Here's another example of these in the wild: http://seclists.org/wireshark/2018/Jul/36 This issue is present in the standard Ruby/Python/Go/PHP parsers too. The only parser I found that handles it is Perl's JSON module: https://github.com/makamaka/JSON/blob/master/lib/JSON/backportPP.pm#L811

asterite · 2018-07-22T17:08:31Z

Maybe if the issue is present in so many languages, and it's a non-standard thing, it's not an issue? Where in the spec of JSON is this?

hinrik · 2018-07-22T17:21:40Z

Where in the spec of JSON is this?

The standard only specifies UTF-16 surrogate pairs: https://tools.ietf.org/html/rfc7159#section-7.

Maybe if the issue is present in so many languages, and it's a non-standard thing, it's not an issue?

It was an issue for me as least, as I had to use Perl instead of Crystal for a text processing task. :)

Sure, this kind of output is non-standard and rare, but a major website is outputting it to its users, and it's possible to work around it as Perl has done, so why not do so? I'd say the robustness principle should apply.

asterite · 2018-07-22T18:09:01Z

src/json/lexer.cr

@@ -154,7 +154,7 @@ abstract class JSON::Lexer
      when '\0'
        raise "Unterminated string"
      when '\\'
-        @buffer << consume_string_escape_sequence
+        consume_string_escape_sequence


Why this isn't using skip: true?

Because it's inside consume_string_with_buffer, which cares about collecting the contents of the string into a buffer. Only consume_string_skip uses skip: true.

asterite · 2018-07-22T18:10:31Z

Sounds good. But I don't understand the change. Why is it called skip? Could you add a few comments explaining the change, and this particular sequence?

hinrik · 2018-07-22T18:34:11Z

Yeah, I can try to make it a bit clearer. There wasn't an obvious simple change to make here because normally the string consuming functions return a Char, but for these UTF-8 escapes I'm adding the bytes directly to the string @buffer. Perhaps it's possible still to refactor the code to be a bit cleaner (e.g. to avoid so many if !skip checks in consume_string_escape_sequence).

To have those functions continue to return a Char, you'd either have to rely on lookahead (which would require other changes to json/lexer/io_based.cr) or be smarter about the contents of the UTF-8 escape to know when we've parsed enough of them to constitute a full character.

asterite · 2018-07-22T19:53:16Z

@hinrik I'm sorry, but reviewing the code and seeing the spec, I don't understand the reason of this change. JSON says that to encode a char in UTF-8 you can use \uXXXX. That means \u00c3 is the unicode character with codepoing c3 hexadecimal, which is 'Ã'. I can't find anywhere where it says that should be interpreted as a byte value. The example given in the spec, "\uD834\uDD1E", that should expand into a G clef, works as expected right now. That other parsers, except Perl (I don't know why) work like Crystal confirms my belief. Additionally, the character you want, the one you used in the spec, 'Ö', has a codepoint of 214, so it can be expressed in JSON with "\u00d6".

That Facebook generates invalid JSON is not out problem.

straight-shoota · 2018-07-22T20:19:48Z

This simply doesn't seem correct. There would be no way to disambiguate if "\u00c3\u0096" should be interpreted as Ã\u{96} (codepoints 195 and 150) or Ö (codepoint 214).

asterite · 2018-07-22T20:50:41Z

Exactly.

hinrik · 2018-07-22T21:36:21Z

You're right. From what I can tell, it only worked with Perl due to some accidental re-encoding. Because for backwards compatibility reasons, Perl by default outputs latin1 to stdout. If I tell it to output UTF-8, I get the same output as in Crystal.

jhass · 2018-07-24T09:40:10Z

Yeah you should work around this by preprocessing the JSON before even parsing it to convert the improper escapes into proper ones.

bcardiff · 2018-08-06T19:55:47Z

Closing since there seems to be nothing wrong in stdlib. Someone can post some re encoding snippets at most.

hinrik force-pushed the json_utf8_escapes branch from dd301f8 to 4a43092 Compare July 22, 2018 16:33

asterite reviewed Jul 22, 2018

View reviewed changes

bcardiff closed this Aug 6, 2018

bcardiff added the status:invalid label Aug 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON: Handle non-standard UTF-8 string escapes #6429

JSON: Handle non-standard UTF-8 string escapes #6429

hinrik commented Jul 22, 2018 •

edited

Loading

asterite commented Jul 22, 2018

hinrik commented Jul 22, 2018

asterite Jul 22, 2018

hinrik Jul 22, 2018

asterite commented Jul 22, 2018

hinrik commented Jul 22, 2018 •

edited

Loading

asterite commented Jul 22, 2018

straight-shoota commented Jul 22, 2018

asterite commented Jul 22, 2018

hinrik commented Jul 22, 2018

jhass commented Jul 24, 2018

bcardiff commented Aug 6, 2018

JSON: Handle non-standard UTF-8 string escapes #6429

JSON: Handle non-standard UTF-8 string escapes #6429

Conversation

hinrik commented Jul 22, 2018 • edited Loading

asterite commented Jul 22, 2018

hinrik commented Jul 22, 2018

asterite Jul 22, 2018

Choose a reason for hiding this comment

hinrik Jul 22, 2018

Choose a reason for hiding this comment

asterite commented Jul 22, 2018

hinrik commented Jul 22, 2018 • edited Loading

asterite commented Jul 22, 2018

straight-shoota commented Jul 22, 2018

asterite commented Jul 22, 2018

hinrik commented Jul 22, 2018

jhass commented Jul 24, 2018

bcardiff commented Aug 6, 2018

hinrik commented Jul 22, 2018 •

edited

Loading

hinrik commented Jul 22, 2018 •

edited

Loading