add support to many html entities in HTML.unescape #3409

dukex · 2016-10-11T04:12:15Z

Hi again,
Like I said in my PR #3374, I'm creating this PR to fix the nbsp and to add entities described in w3.org/TR/html401/sgml/entities.html to HTML.unescape.

Someone have a better idea to organize this entities?

/cc @chaniks @RX14

kostya · 2016-10-11T11:08:37Z

i think this is little crazy in std, maybe shard?

asterite · 2016-10-11T14:07:31Z

I also think it's a bit too much, not even Ruby does this replacement :-)

kostya · 2016-10-11T14:13:45Z

src/html.cr

-      when "nbsp" then " "
+      when "apos"     then "'"
+      when "nbsp"     then '\u{A0}'
+      when "iexcl"    then '\u{A1}'


just interesting, what would be faster, case or hash lookup here? seems case should be slower, because of strcmp?

I know Java optimises this by using a case on the strings hash, then strcmp to prevent hash collisions. Its a simple compiler optimisation step that we might want to persue.

In any case I'd put this hash in another file, load it at runtime once and do a hash lookup to simplify the code.

RX14 · 2016-10-11T15:25:37Z

I don't think that this is too much for the standard library. Its a standardized, verifiably correct, verifiably complete set of replacements specific to html. It shouldn't require much, if any maintenance to the list, as I doubt html will be adding any more of these now Unicode is prevalent.

kostya · 2016-10-11T15:27:17Z

but may be then implement html5 spec? which allowed entities without ; on the end.
https://html.spec.whatwg.org/multipage/syntax.html#named-character-references

RX14 · 2016-10-11T15:39:04Z

Personaly, I would rewrite this without the slow regex.

dukex · 2016-10-12T02:07:54Z

Hi guys, thank you for your comments, I update my patch, please review again

dukex · 2016-10-12T02:08:33Z

@RX14 Any idea how avoid the regex?

kostya · 2016-10-12T02:13:26Z

src/html.cr

@@ -57,6 +53,7 @@ module HTML
        else
          "&#x#{$1};"
        end
+      when /\w{2,8}/ then HTML::ENTITIES[match]


i think it should be HTML::ENTITIES[match]? || "&#{match};"

😮 Thank you, I write a test for this case

chaniks · 2016-10-12T13:28:01Z

@dukex To avoid regex:

You can use a pointer and move it until the first & char, and if the next char is # then try digit conversion, otherwise try entities.
Of course, you can use String#each_char, Char::Reader, String.build(), and etc., to make your life easier.

However, I believe the regex is not so bad for the starting point. As long as the code stays simple and readable, we can always tune the performance easily. But once the code is rewritten for the performance, it usually takes more effort to add features—and sometimes people give up touching it.

I can't say which is better. Those companies I worked with usually had problems with less-readable codes, not slow codes. But this is an open-source project and also a std library for the language. I can't just apply the same rule.

Of course, if you can achieve both, that might be the best. 😄

straight-shoota · 2017-06-28T18:48:14Z

HTML.unescape should do exactly the reverse of HTML.escape with both (un-)escaping HTML syntax characters. As illustrated in #4555 this should include only < > & " ' and their escapes.

A conversion of arbitrary HTML entities to unescaped characters is a totally different application and a complete implementation is probably to extensive and specific to be included in the standard library.

Besides, it can already be accomplished through the XML parser:

XML.parse_html("&lt;s&gt;&yacute;&nbsp;&ndash;").content # => "<s>ý –"

akzhan · 2017-06-28T19:06:09Z

Both #3374 and #3409 are wrong. I'm sure that #3374 must be rolled back.

straight-shoota · 2017-06-29T15:54:28Z

Okay, I have been looking at implementations of HTML unescape in other languages. And it seems that most of them are actually resolving all (or almost all) HTML entities: Ruby CGI.unescape_html, Python html.unescape, Go html.UnescapeString. Others do do not even have such a method (Phoenix, Rack).
In PHP there are htmlspecialchars_decode as an inverse of htmlspecialchars and html_entity_decode which decodes all entities (inverse of htmlentities).

I guess this makes some sense and HTML.unescape(HTML.escape(string)) should be guaranteed to equal string nevertheless. But perhaps it would be better to have a different name to make clear, that escape and unescape are not inverse but perform different substitutions.

Still, I would question if this method should be implemented in the standard lib anyway. There are not many use cases I can think of, where you would need to unescape HTML entities outside a HTML parser. In fact, at least in repos on Github it seems this method is used mostly in examples. As I have shown above, decoding of HTML entities can be easily accomplished through the XML parser.

da99 · 2017-09-23T04:03:38Z

Note: HTML.escape and XML.parse_html do not fully parse the following into < brackets. @kostya's myhtml shard and modern browsers do parse all of it

    bracket = "
      < &lt &lt; &LT &LT; &#60 &#060 &#0060
      &#00060 &#000060 &#0000060 &#60; &#060; &#0060; &#00060;
      &#000060; &#0000060; &#x3c &#x03c &#x003c &#x0003c &#x00003c
      &#x000003c &#x3c; &#x03c; &#x003c; &#x0003c; &#x00003c;
      &#x000003c; &#X3c &#X03c &#X003c &#X0003c &#X00003c &#X000003c
      &#X3c; &#X03c; &#X003c; &#X0003c; &#X00003c; &#X000003c;
      &#x3C &#x03C &#x003C &#x0003C &#x00003C &#x000003C &#x3C; &#x03C;
      &#x003C; &#x0003C; &#x00003C; &#x000003C; &#X3C &#X03C
      &#X003C &#X0003C &#X00003C &#X000003C &#X3C; &#X03C; &#X003C; &#X0003C;
      &#X00003C; &#X000003C; \x3c \x3C \u003c \u003C
    "

require "html"
require "xml"
require "myhtml"

puts HTML.unescape(bracket)
puts XML.parse_html(bracket).content
puts Myhtml.decode_html_entities(bracket)

akzhan · 2017-09-23T22:16:18Z

Triage: I think that this pull request must be closed without any merge.

RX14 · 2017-09-24T11:13:30Z

Why?

straight-shoota · 2017-09-26T15:22:21Z

@da99 As it was already mentioned above, HTML entity expansion is quite a complex feature to implement.
The difference between libxml (XML.parse_html) and myhtml is that the latter seems to work in quirks mode accepting &LT as an entitity, while per HTML standard it is not.
So as a summary: HTML.unsescpae is incomplete, XML.parse_html is ~~standard~~HTML4 compliant, myhtml is ~~fault-tolerant similarly to modern browsers~~HTML5 compliant.

asterite · 2017-09-26T15:30:09Z

I don't think HTML entity expansion is difficult to implement. We just need to have a table of all possible values. Both Python and Go implement this.

kostya · 2017-09-26T15:34:32Z

@straight-shoota &LT is entity in html5, see: https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references

straight-shoota · 2017-09-26T16:38:11Z

Thanks @kostya! I didn't know that. I've updated my comment above.

Okay, so for HTML entities, there is a list from W3C of currently 2331 entities which could (maybe even automatically) be inserted into a crystal source file.

The complexity comes from the different entities between standards (HTML 4.0.1, XHTML, HTML5 or even XML?). If we decide to support only HTML5 (as Go seems to do, though it is not documented), the process becomes much more straightforward.

I'd like to point out that unescaping is IMHO incorrectly phrased because (un)escaping only applies to HTML special characters < > & " '. Converting named entities into unicode codepoints means decoding them.
It seems, many libraries call it unescape, but decode would be more precise. There are however example like PHP's html_entity_decode, html-entities for Node or htmlentities for Ruby
This naming would ensure that HTML.escape is something entirely different and it would allow to also implement HTML.encode which converts unicode characters to HTML entities.

RX14 · 2017-09-26T21:43:09Z

👍 on the naming, it makes great sense to me. Not sure how useful encode would be though, so perhaps just escape and decode.

asterite · 2017-09-28T20:17:09Z

I can't see why we need unescape and decode. Just unescape should be enough. I'll send a PR soon.

The reason is that you want HTML.escape so that you can safely insert text inside some HTML document. The use case of unescape is: I fetch a webpage and I want to show parts of it (maybe some paragraphs?) to the user. The user would like to see unicode characters instead of escape codes. Just unescaping a few characters but leaving all those á things makes little sense.

dukex · 2017-10-02T22:34:38Z

Thanks @asterite

dukex changed the title ~~add support to many html entities~~ add support to many html entities in HTML.unescape Oct 11, 2016

dukex force-pushed the add-support-to-all-html-entities branch from 968a24e to 00b8c93 Compare October 11, 2016 10:44

kostya reviewed Oct 11, 2016

View reviewed changes

dukex force-pushed the add-support-to-all-html-entities branch from 00b8c93 to 6fc7cce Compare October 12, 2016 02:06

kostya reviewed Oct 12, 2016

View reviewed changes

add support to many html entities

5e42a83

dukex force-pushed the add-support-to-all-html-entities branch from 6fc7cce to 5e42a83 Compare October 12, 2016 03:51

asterite mentioned this pull request Sep 28, 2017

Let HTML.unescape unescape all HTML5 entities #5055

Merged

asterite closed this in #5055 Sep 29, 2017

dukex deleted the add-support-to-all-html-entities branch October 2, 2017 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support to many html entities in HTML.unescape #3409

add support to many html entities in HTML.unescape #3409

dukex commented Oct 11, 2016 •

edited

Loading

kostya commented Oct 11, 2016

asterite commented Oct 11, 2016

kostya Oct 11, 2016 •

edited

Loading

RX14 Oct 11, 2016

asterite Oct 11, 2016

RX14 commented Oct 11, 2016 •

edited

Loading

kostya commented Oct 11, 2016 •

edited

Loading

RX14 commented Oct 11, 2016

dukex commented Oct 12, 2016

dukex commented Oct 12, 2016

kostya Oct 12, 2016

dukex Oct 12, 2016

chaniks commented Oct 12, 2016 •

edited

Loading

straight-shoota commented Jun 28, 2017 •

edited

Loading

akzhan commented Jun 28, 2017

straight-shoota commented Jun 29, 2017 •

edited

Loading

da99 commented Sep 23, 2017 •

edited

Loading

akzhan commented Sep 23, 2017

RX14 commented Sep 24, 2017

straight-shoota commented Sep 26, 2017 •

edited

Loading

asterite commented Sep 26, 2017

kostya commented Sep 26, 2017

straight-shoota commented Sep 26, 2017

RX14 commented Sep 26, 2017

asterite commented Sep 28, 2017

dukex commented Oct 2, 2017

add support to many html entities in HTML.unescape #3409

add support to many html entities in HTML.unescape #3409

Conversation

dukex commented Oct 11, 2016 • edited Loading

kostya commented Oct 11, 2016

asterite commented Oct 11, 2016

kostya Oct 11, 2016 • edited Loading

Choose a reason for hiding this comment

RX14 Oct 11, 2016

Choose a reason for hiding this comment

asterite Oct 11, 2016

Choose a reason for hiding this comment

RX14 commented Oct 11, 2016 • edited Loading

kostya commented Oct 11, 2016 • edited Loading

RX14 commented Oct 11, 2016

dukex commented Oct 12, 2016

dukex commented Oct 12, 2016

kostya Oct 12, 2016

Choose a reason for hiding this comment

dukex Oct 12, 2016

Choose a reason for hiding this comment

chaniks commented Oct 12, 2016 • edited Loading

straight-shoota commented Jun 28, 2017 • edited Loading

akzhan commented Jun 28, 2017

straight-shoota commented Jun 29, 2017 • edited Loading

da99 commented Sep 23, 2017 • edited Loading

akzhan commented Sep 23, 2017

RX14 commented Sep 24, 2017

straight-shoota commented Sep 26, 2017 • edited Loading

asterite commented Sep 26, 2017

kostya commented Sep 26, 2017

straight-shoota commented Sep 26, 2017

RX14 commented Sep 26, 2017

asterite commented Sep 28, 2017

dukex commented Oct 2, 2017

dukex commented Oct 11, 2016 •

edited

Loading

kostya Oct 11, 2016 •

edited

Loading

RX14 commented Oct 11, 2016 •

edited

Loading

kostya commented Oct 11, 2016 •

edited

Loading

chaniks commented Oct 12, 2016 •

edited

Loading

straight-shoota commented Jun 28, 2017 •

edited

Loading

straight-shoota commented Jun 29, 2017 •

edited

Loading

da99 commented Sep 23, 2017 •

edited

Loading

straight-shoota commented Sep 26, 2017 •

edited

Loading