Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve decoding of HTML entities #5064

Merged
merged 9 commits into from Oct 4, 2017

Conversation

straight-shoota
Copy link
Member

@straight-shoota straight-shoota commented Sep 30, 2017

This PR adds a few fixes for several edge cases that were broken from #5055 or before

  • named character references including numbers are replaced
  • non-matching named character references with trailing semicolon are left in place
  • invalid numeric character references are replaced with replacement character
  • numeric character references resulting in noncharacters or control characters (except space) are left in place
  • numeric character references between 0x80 and 0x9F are replaced with compatibility replacements for old numeric entities from Windows-1252
  • the API documentation states that this method recognizes character references from HTML5. Other doctypes have different entities and even different rules for numeric references.

The code is based on the HTML5 tokenizing guidelines and inspired by the implementations for Go and PHP.

src/html.cr Outdated

# see https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state
private def self.decode_codepoint(codepoint)
if 0x80 <= codepoint <= 0x9F
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case switch might look neater here.

src/html.cr Outdated
"\uFFFD"
elsif 0xFDD0 <= codepoint <= 0xFDEF || # unicode noncharacters
codepoint & 0xFFFF >= 0xFFFE || # last two of each plane (nonchars) disallowed
(codepoint < 0x0020 && codepoint != 0x0009 && codepoint != 0x000A && codepoint != 0x000C) || # unicode control characters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codepoint != 0x0009 && codepoint != 0x000A && codepoint != 0x000C

->

!{0x0009, 0x000A, 0x000C}.includes?(codepoint)

end

it "unescapes characters above Char::MAX_CODEPOINT" do
str = HTML.unescape("limit &#x110000;")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it, if 10FFFF is not a character, then why should something right after it be a character

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a typo. It doesn't decode &#x110000; but replaces it with replacement character \uFFFD.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question still stands. &#x10FFFF;->&#x10FFFF; but &#x110000; -> \uFFFD - why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&#x10FFFF; is a valid codepoint, yet not supported as a character reference because it's a unicode noncharacter. Therefore it is not decoded.
&#x110000; is an invalid codepoint and therefore replaced by replacement character.

See Numeric character reference end state for details.

@straight-shoota
Copy link
Member Author

@Sija comment
Thanks, case looks better I think.
Unfortunately I couldn't put codepoint & 0xFFFF >= 0xFFFE in a when clause (maybe with some case-fu? but probably not possible). I moved all checks for disallowed characters into the else part to keep them together.

@RX14 RX14 added this to the Next milestone Oct 4, 2017
@RX14 RX14 merged commit 17ac8a2 into crystal-lang:master Oct 4, 2017
@straight-shoota straight-shoota deleted the jm-html-entities branch October 4, 2017 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants