Let HTML.unescape unescape all HTML5 entities #5055

asterite · 2017-09-28T20:21:04Z

Both Go and Python unescape all HTML5 named entities. Let's do the same.

I also improved a bit the code to use less regexes. Ideally, though, we shouldn't use regexes at all, but I'm super lazy now to optimize this. We can always optimize it later. For example Go doesn't use regex and it's about 10 times faster.

Closes #3409

RX14 · 2017-09-28T20:57:53Z

src/html.cr

+    string.gsub(/&(?:([a-zA-Z]+;?)|\#([0-9]+);?|\#[xX]([0-9A-Fa-f]+);?)/) do |string, match|
+      if code = match[1]?
+        HTML::SINGLE_CHAR_ENTITIES[code]? ||
+          HTML::DOUBLE_CHAR_ENTITIES[code]? ||


Do we really need to seperate these hashes? They're :nodoc: anyway so why not make it 1 hash and 1 hash lookup?

I think it occupies more size in memory. Chars are just four bytes, while a string is minimum 9 bytes plus an indirection. Go implements it like that too and I think it's fine. In any case, the slowness comes from the regex.

It might be faster and more space efficient to encode the double char entities with an unused unicode section and use the section index minus offset to retrieve the actual strings from a tuple or array instead of two hash lookups.

Please send a PR with optimizations later. This is just about getting it right at a reasonable performance.

RX14 · 2017-09-28T21:00:31Z

src/html.cr

-        else
-          "&#x#{$1};"
-        end
+    string.gsub(/&(?:([a-zA-Z]+;?)|\#([0-9]+);?|\#[xX]([0-9A-Fa-f]+);?)/) do |string, match|


You should remove the first ;? and make it just ; since it'll never match inside HTML::*_ENTITIES without the ; anyway.

There are entities without ';'

Like for example &amp

I apologise for attempting to apply logic to HTML. I simply saw the start of the hash and assumed they all terminated with ;.

I have a secret: I had it implemented like you asked me too until I realized &amp and a few others existed :-P

This regex won't work for an entity that is not delimited by ; but immediately followed by an ascii letter. For example &ampd should return &d.
This is diffiecult to handle properly. It requires either a custom parse tree or multiple checks if one of those undelimited character sequences is matched. There are 106 of them ranging from 3 to 7 charachters.

straight-shoota · 2017-09-28T21:37:20Z

I'd suggest to rename this method to HTML.decode as detailed in #3409 (issue comment).

straight-shoota · 2017-09-28T21:54:54Z

src/html.cr

-        else
-          "&#x#{$1};"
-        end
+    string.gsub(/&(?:([a-zA-Z]+;?)|\#([0-9]+);?|\#[xX]([0-9A-Fa-f]+);?)/) do |string, match|


This regex won't work for an entity that is not delimited by ; but immediately followed by an ascii letter. For example &ampd should return &d.
This is diffiecult to handle properly. It requires either a custom parse tree or multiple checks if one of those undelimited character sequences is matched. There are 106 of them ranging from 3 to 7 charachters.

Sija · 2017-09-29T11:10:50Z

spec/std/html_spec.cr

+      str.should eq(" ⊐̸ ")
+    end
+
+    it "unescapes &ampd" do


typo: &ampd -> &amp

I don't think it's a typo. At least not in that sense. It was copied from my comment.
If at all, it should be changed to unescapes &amphello.

asterite self-assigned this Sep 28, 2017

RX14 requested changes Sep 28, 2017

View reviewed changes

RX14 approved these changes Sep 28, 2017

View reviewed changes

straight-shoota requested changes Sep 28, 2017

View reviewed changes

Let HTML.unescape unescape all HTML5 entities

8e71b05

asterite merged commit ee0b844 into crystal-lang:master Sep 29, 2017

Sija reviewed Sep 29, 2017

View reviewed changes

straight-shoota mentioned this pull request Sep 30, 2017

Improve decoding of HTML entities #5064

Merged

jeromegn mentioned this pull request Feb 11, 2018

Add HTML entity support to Slang jeromegn/slang#39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let HTML.unescape unescape all HTML5 entities #5055

Let HTML.unescape unescape all HTML5 entities #5055

asterite commented Sep 28, 2017

RX14 Sep 28, 2017

asterite Sep 28, 2017

straight-shoota Sep 28, 2017

asterite Sep 28, 2017

RX14 Sep 28, 2017

asterite Sep 28, 2017

asterite Sep 28, 2017

RX14 Sep 28, 2017 •

edited

Loading

asterite Sep 28, 2017

straight-shoota Sep 28, 2017 •

edited

Loading

asterite Sep 28, 2017

straight-shoota commented Sep 28, 2017

straight-shoota Sep 28, 2017 •

edited

Loading

Sija Sep 29, 2017

straight-shoota Sep 30, 2017

Let HTML.unescape unescape all HTML5 entities #5055

Let HTML.unescape unescape all HTML5 entities #5055

Conversation

asterite commented Sep 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RX14 Sep 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

straight-shoota Sep 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

straight-shoota commented Sep 28, 2017

straight-shoota Sep 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RX14 Sep 28, 2017 •

edited

Loading

straight-shoota Sep 28, 2017 •

edited

Loading

straight-shoota Sep 28, 2017 •

edited

Loading