Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let HTML.unescape unescape all HTML5 entities #5055

Merged
merged 1 commit into from Sep 29, 2017
Merged

Let HTML.unescape unescape all HTML5 entities #5055

merged 1 commit into from Sep 29, 2017

Conversation

asterite
Copy link
Member

Both Go and Python unescape all HTML5 named entities. Let's do the same.

I also improved a bit the code to use less regexes. Ideally, though, we shouldn't use regexes at all, but I'm super lazy now to optimize this. We can always optimize it later. For example Go doesn't use regex and it's about 10 times faster.

Closes #3409

@asterite asterite self-assigned this Sep 28, 2017
src/html.cr Outdated
string.gsub(/&(?:([a-zA-Z]+;?)|\#([0-9]+);?|\#[xX]([0-9A-Fa-f]+);?)/) do |string, match|
if code = match[1]?
HTML::SINGLE_CHAR_ENTITIES[code]? ||
HTML::DOUBLE_CHAR_ENTITIES[code]? ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to seperate these hashes? They're :nodoc: anyway so why not make it 1 hash and 1 hash lookup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it occupies more size in memory. Chars are just four bytes, while a string is minimum 9 bytes plus an indirection. Go implements it like that too and I think it's fine. In any case, the slowness comes from the regex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be faster and more space efficient to encode the double char entities with an unused unicode section and use the section index minus offset to retrieve the actual strings from a tuple or array instead of two hash lookups.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please send a PR with optimizations later. This is just about getting it right at a reasonable performance.

src/html.cr Outdated
else
"&#x#{$1};"
end
string.gsub(/&(?:([a-zA-Z]+;?)|\#([0-9]+);?|\#[xX]([0-9A-Fa-f]+);?)/) do |string, match|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should remove the first ;? and make it just ; since it'll never match inside HTML::*_ENTITIES without the ; anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are entities without ';'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like for example &amp

Copy link
Contributor

@RX14 RX14 Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologise for attempting to apply logic to HTML. I simply saw the start of the hash and assumed they all terminated with ;.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a secret: I had it implemented like you asked me too until I realized &amp and a few others existed :-P

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex won't work for an entity that is not delimited by ; but immediately followed by an ascii letter. For example &ampd should return &d.
This is diffiecult to handle properly. It requires either a custom parse tree or multiple checks if one of those undelimited character sequences is matched. There are 106 of them ranging from 3 to 7 charachters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@straight-shoota
Copy link
Member

I'd suggest to rename this method to HTML.decode as detailed in #3409 (issue comment).

src/html.cr Outdated
else
"&#x#{$1};"
end
string.gsub(/&(?:([a-zA-Z]+;?)|\#([0-9]+);?|\#[xX]([0-9A-Fa-f]+);?)/) do |string, match|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex won't work for an entity that is not delimited by ; but immediately followed by an ascii letter. For example &ampd should return &d.
This is diffiecult to handle properly. It requires either a custom parse tree or multiple checks if one of those undelimited character sequences is matched. There are 106 of them ranging from 3 to 7 charachters.

@asterite asterite merged commit ee0b844 into crystal-lang:master Sep 29, 2017
str.should eq(" ⊐̸ ")
end

it "unescapes &ampd" do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: &ampd -> &amp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a typo. At least not in that sense. It was copied from my comment.
If at all, it should be changed to unescapes &amphello.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants