Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support to many html entities in HTML.unescape #3409

Closed

Conversation

dukex
Copy link
Contributor

@dukex dukex commented Oct 11, 2016

Hi again,
Like I said in my PR #3374, I'm creating this PR to fix the nbsp and to add entities described in w3.org/TR/html401/sgml/entities.html to HTML.unescape.

Someone have a better idea to organize this entities?

/cc @chaniks @RX14

@dukex dukex changed the title add support to many html entities add support to many html entities in HTML.unescape Oct 11, 2016
@dukex dukex force-pushed the add-support-to-all-html-entities branch from 968a24e to 00b8c93 Compare October 11, 2016 10:44
@kostya
Copy link
Contributor

kostya commented Oct 11, 2016

i think this is little crazy in std, maybe shard?

@asterite
Copy link
Member

I also think it's a bit too much, not even Ruby does this replacement :-)

when "nbsp" then " "
when "apos" then "'"
when "nbsp" then '\u{A0}'
when "iexcl" then '\u{A1}'
Copy link
Contributor

@kostya kostya Oct 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just interesting, what would be faster, case or hash lookup here? seems case should be slower, because of strcmp?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know Java optimises this by using a case on the strings hash, then strcmp to prevent hash collisions. Its a simple compiler optimisation step that we might want to persue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case I'd put this hash in another file, load it at runtime once and do a hash lookup to simplify the code.

@RX14
Copy link
Contributor

RX14 commented Oct 11, 2016

I don't think that this is too much for the standard library. Its a standardized, verifiably correct, verifiably complete set of replacements specific to html. It shouldn't require much, if any maintenance to the list, as I doubt html will be adding any more of these now Unicode is prevalent.

@kostya
Copy link
Contributor

kostya commented Oct 11, 2016

but may be then implement html5 spec? which allowed entities without ; on the end.
https://html.spec.whatwg.org/multipage/syntax.html#named-character-references

@RX14
Copy link
Contributor

RX14 commented Oct 11, 2016

Personaly, I would rewrite this without the slow regex.

@dukex dukex force-pushed the add-support-to-all-html-entities branch from 00b8c93 to 6fc7cce Compare October 12, 2016 02:06
@dukex
Copy link
Contributor Author

dukex commented Oct 12, 2016

Hi guys, thank you for your comments, I update my patch, please review again

@dukex
Copy link
Contributor Author

dukex commented Oct 12, 2016

@RX14 Any idea how avoid the regex?

@@ -57,6 +53,7 @@ module HTML
else
"&#x#{$1};"
end
when /\w{2,8}/ then HTML::ENTITIES[match]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it should be HTML::ENTITIES[match]? || "&#{match};"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 Thank you, I write a test for this case

@dukex dukex force-pushed the add-support-to-all-html-entities branch from 6fc7cce to 5e42a83 Compare October 12, 2016 03:51
@chaniks
Copy link

chaniks commented Oct 12, 2016

@dukex To avoid regex:

  • You can use a pointer and move it until the first & char, and if the next char is # then try digit conversion, otherwise try entities.
  • Of course, you can use String#each_char, Char::Reader, String.build(), and etc., to make your life easier.

However, I believe the regex is not so bad for the starting point. As long as the code stays simple and readable, we can always tune the performance easily. But once the code is rewritten for the performance, it usually takes more effort to add features—and sometimes people give up touching it.

I can't say which is better. Those companies I worked with usually had problems with less-readable codes, not slow codes. But this is an open-source project and also a std library for the language. I can't just apply the same rule.

Of course, if you can achieve both, that might be the best. 😄

@straight-shoota
Copy link
Member

straight-shoota commented Jun 28, 2017

HTML.unescape should do exactly the reverse of HTML.escape with both (un-)escaping HTML syntax characters. As illustrated in #4555 this should include only < > & " ' and their escapes.

A conversion of arbitrary HTML entities to unescaped characters is a totally different application and a complete implementation is probably to extensive and specific to be included in the standard library.

Besides, it can already be accomplished through the XML parser:

XML.parse_html("&lt;s&gt;&yacute;&nbsp;&ndash;").content # => "<s>ý –"

@akzhan
Copy link
Contributor

akzhan commented Jun 28, 2017

Both #3374 and #3409 are wrong. I'm sure that #3374 must be rolled back.

@straight-shoota
Copy link
Member

straight-shoota commented Jun 29, 2017

Okay, I have been looking at implementations of HTML unescape in other languages. And it seems that most of them are actually resolving all (or almost all) HTML entities: Ruby CGI.unescape_html, Python html.unescape, Go html.UnescapeString. Others do do not even have such a method (Phoenix, Rack).
In PHP there are htmlspecialchars_decode as an inverse of htmlspecialchars and html_entity_decode which decodes all entities (inverse of htmlentities).

I guess this makes some sense and HTML.unescape(HTML.escape(string)) should be guaranteed to equal string nevertheless. But perhaps it would be better to have a different name to make clear, that escape and unescape are not inverse but perform different substitutions.

Still, I would question if this method should be implemented in the standard lib anyway. There are not many use cases I can think of, where you would need to unescape HTML entities outside a HTML parser. In fact, at least in repos on Github it seems this method is used mostly in examples. As I have shown above, decoding of HTML entities can be easily accomplished through the XML parser.

@da99
Copy link

da99 commented Sep 23, 2017

Note: HTML.escape and XML.parse_html do not fully parse the following into < brackets. @kostya's myhtml shard and modern browsers do parse all of it

    bracket = "
      < &lt &lt; &LT &LT; &#60 &#060 &#0060
      &#00060 &#000060 &#0000060 &#60; &#060; &#0060; &#00060;
      &#000060; &#0000060; &#x3c &#x03c &#x003c &#x0003c &#x00003c
      &#x000003c &#x3c; &#x03c; &#x003c; &#x0003c; &#x00003c;
      &#x000003c; &#X3c &#X03c &#X003c &#X0003c &#X00003c &#X000003c
      &#X3c; &#X03c; &#X003c; &#X0003c; &#X00003c; &#X000003c;
      &#x3C &#x03C &#x003C &#x0003C &#x00003C &#x000003C &#x3C; &#x03C;
      &#x003C; &#x0003C; &#x00003C; &#x000003C; &#X3C &#X03C
      &#X003C &#X0003C &#X00003C &#X000003C &#X3C; &#X03C; &#X003C; &#X0003C;
      &#X00003C; &#X000003C; \x3c \x3C \u003c \u003C
    "

require "html"
require "xml"
require "myhtml"

puts HTML.unescape(bracket)
puts XML.parse_html(bracket).content
puts Myhtml.decode_html_entities(bracket)

screenshot from 2017-09-22 23-09-56

@akzhan
Copy link
Contributor

akzhan commented Sep 23, 2017

Triage: I think that this pull request must be closed without any merge.

@RX14
Copy link
Contributor

RX14 commented Sep 24, 2017

Why?

@straight-shoota
Copy link
Member

straight-shoota commented Sep 26, 2017

@da99 As it was already mentioned above, HTML entity expansion is quite a complex feature to implement.
The difference between libxml (XML.parse_html) and myhtml is that the latter seems to work in quirks mode accepting &LT as an entitity, while per HTML standard it is not.
So as a summary: HTML.unsescpae is incomplete, XML.parse_html is standardHTML4 compliant, myhtml is fault-tolerant similarly to modern browsersHTML5 compliant.

@asterite
Copy link
Member

I don't think HTML entity expansion is difficult to implement. We just need to have a table of all possible values. Both Python and Go implement this.

@kostya
Copy link
Contributor

kostya commented Sep 26, 2017

@straight-shoota &LT is entity in html5, see: https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references

@straight-shoota
Copy link
Member

Thanks @kostya! I didn't know that. I've updated my comment above.

Okay, so for HTML entities, there is a list from W3C of currently 2331 entities which could (maybe even automatically) be inserted into a crystal source file.

The complexity comes from the different entities between standards (HTML 4.0.1, XHTML, HTML5 or even XML?). If we decide to support only HTML5 (as Go seems to do, though it is not documented), the process becomes much more straightforward.

I'd like to point out that unescaping is IMHO incorrectly phrased because (un)escaping only applies to HTML special characters < > & " '. Converting named entities into unicode codepoints means decoding them.
It seems, many libraries call it unescape, but decode would be more precise. There are however example like PHP's html_entity_decode, html-entities for Node or htmlentities for Ruby
This naming would ensure that HTML.escape is something entirely different and it would allow to also implement HTML.encode which converts unicode characters to HTML entities.

@RX14
Copy link
Contributor

RX14 commented Sep 26, 2017

👍 on the naming, it makes great sense to me. Not sure how useful encode would be though, so perhaps just escape and decode.

@asterite
Copy link
Member

I can't see why we need unescape and decode. Just unescape should be enough. I'll send a PR soon.

The reason is that you want HTML.escape so that you can safely insert text inside some HTML document. The use case of unescape is: I fetch a webpage and I want to show parts of it (maybe some paragraphs?) to the user. The user would like to see unicode characters instead of escape codes. Just unescaping a few characters but leaving all those &aacute; things makes little sense.

@dukex dukex deleted the add-support-to-all-html-entities branch October 2, 2017 22:32
@dukex
Copy link
Contributor Author

dukex commented Oct 2, 2017

Thanks @asterite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants