-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON: Handle non-standard UTF-8 string escapes #6429
Conversation
I came across `\u` escapes like this in Facebook's JSON-exported user data. Here's another example of these in the wild: http://seclists.org/wireshark/2018/Jul/36 This issue is present in the standard Ruby/Python/Go/PHP parsers too. The only parser I found that handles it is Perl's JSON module: https://github.com/makamaka/JSON/blob/master/lib/JSON/backportPP.pm#L811
dd301f8
to
4a43092
Compare
Maybe if the issue is present in so many languages, and it's a non-standard thing, it's not an issue? Where in the spec of JSON is this? |
The standard only specifies UTF-16 surrogate pairs: https://tools.ietf.org/html/rfc7159#section-7.
It was an issue for me as least, as I had to use Perl instead of Crystal for a text processing task. :) Sure, this kind of output is non-standard and rare, but a major website is outputting it to its users, and it's possible to work around it as Perl has done, so why not do so? I'd say the robustness principle should apply. |
@@ -154,7 +154,7 @@ abstract class JSON::Lexer | |||
when '\0' | |||
raise "Unterminated string" | |||
when '\\' | |||
@buffer << consume_string_escape_sequence | |||
consume_string_escape_sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this isn't using skip: true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's inside consume_string_with_buffer
, which cares about collecting the contents of the string into a buffer. Only consume_string_skip
uses skip: true
.
Sounds good. But I don't understand the change. Why is it called skip? Could you add a few comments explaining the change, and this particular sequence? |
Yeah, I can try to make it a bit clearer. There wasn't an obvious simple change to make here because normally the string consuming functions return a To have those functions continue to return a |
@hinrik I'm sorry, but reviewing the code and seeing the spec, I don't understand the reason of this change. JSON says that to encode a char in UTF-8 you can use That Facebook generates invalid JSON is not out problem. |
This simply doesn't seem correct. There would be no way to disambiguate if |
Exactly. |
You're right. From what I can tell, it only worked with Perl due to some accidental re-encoding. Because for backwards compatibility reasons, Perl by default outputs latin1 to stdout. If I tell it to output UTF-8, I get the same output as in Crystal. |
Yeah you should work around this by preprocessing the JSON before even parsing it to convert the improper escapes into proper ones. |
Closing since there seems to be nothing wrong in stdlib. Someone can post some re encoding snippets at most. |
I came across
\u
escapes like this in Facebook's JSON-exported userdata. Here's another example of these in the wild:
http://seclists.org/wireshark/2018/Jul/36
This issue is present in the standard Ruby/Python/Go/PHP parsers too.
The only parser I found that handles it is Perl's JSON module:
https://github.com/makamaka/JSON/blob/master/lib/JSON/backportPP.pm#L811