-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Let HTML.unescape unescape all HTML5 entities #5055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/html.cr
Outdated
string.gsub(/&(?:([a-zA-Z]+;?)|\#([0-9]+);?|\#[xX]([0-9A-Fa-f]+);?)/) do |string, match| | ||
if code = match[1]? | ||
HTML::SINGLE_CHAR_ENTITIES[code]? || | ||
HTML::DOUBLE_CHAR_ENTITIES[code]? || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to seperate these hashes? They're :nodoc:
anyway so why not make it 1 hash and 1 hash lookup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it occupies more size in memory. Chars are just four bytes, while a string is minimum 9 bytes plus an indirection. Go implements it like that too and I think it's fine. In any case, the slowness comes from the regex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be faster and more space efficient to encode the double char entities with an unused unicode section and use the section index minus offset to retrieve the actual strings from a tuple or array instead of two hash lookups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please send a PR with optimizations later. This is just about getting it right at a reasonable performance.
src/html.cr
Outdated
else | ||
"&#x#{$1};" | ||
end | ||
string.gsub(/&(?:([a-zA-Z]+;?)|\#([0-9]+);?|\#[xX]([0-9A-Fa-f]+);?)/) do |string, match| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should remove the first ;?
and make it just ;
since it'll never match inside HTML::*_ENTITIES
without the ;
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are entities without ';'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like for example &
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologise for attempting to apply logic to HTML. I simply saw the start of the hash and assumed they all terminated with ;
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a secret: I had it implemented like you asked me too until I realized & and a few others existed :-P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regex won't work for an entity that is not delimited by ;
but immediately followed by an ascii letter. For example &d
should return &d
.
This is diffiecult to handle properly. It requires either a custom parse tree or multiple checks if one of those undelimited character sequences is matched. There are 106 of them ranging from 3 to 7 charachters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
I'd suggest to rename this method to |
src/html.cr
Outdated
else | ||
"&#x#{$1};" | ||
end | ||
string.gsub(/&(?:([a-zA-Z]+;?)|\#([0-9]+);?|\#[xX]([0-9A-Fa-f]+);?)/) do |string, match| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regex won't work for an entity that is not delimited by ;
but immediately followed by an ascii letter. For example &d
should return &d
.
This is diffiecult to handle properly. It requires either a custom parse tree or multiple checks if one of those undelimited character sequences is matched. There are 106 of them ranging from 3 to 7 charachters.
str.should eq(" ⊐̸ ") | ||
end | ||
|
||
it "unescapes &d" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: &d
-> &
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a typo. At least not in that sense. It was copied from my comment.
If at all, it should be changed to unescapes &hello
.
Both Go and Python unescape all HTML5 named entities. Let's do the same.
I also improved a bit the code to use less regexes. Ideally, though, we shouldn't use regexes at all, but I'm super lazy now to optimize this. We can always optimize it later. For example Go doesn't use regex and it's about 10 times faster.
Closes #3409