Fixes #3107 #3226

amedeiros · 2016-09-01T15:05:19Z

No description provided.

RX14 · 2016-09-01T15:16:29Z

src/string.cr

+  def gsub(hash : Hash(String, _))
+    string = self
+    hash.each do |key, value|
+      string = string.gsub(key, value)


I think that iterating the string for every replacement is not the best performing solution. You probably want to find the last character of every key, and place them in an array, then iterate the string checking against each character. If a match is found perform strcmp calls to determine if any of the full strings match.

@RX14 👍 I just left a note at the same time. I was not excited about this change it did not seem very performant to me.

Also, String#gsub addition is not related in any way to the scope of this PR.

It's a sufficiently small diff that it could get merged as 1 PR with 2 commits.

I was curious to see what the difference between the additional String#gsub(Hash(String, _)) versus the already existing String#gsub(Hash(Char, _))

require "benchmark" string = "Apples " * 10000 hash = { "p" => 'a', "l" => 's' } inverse = hash.invert Benchmark.bm do |x| x.report("Gsub with Hash String => Char 2000 times") do 2000.times do string.gsub(hash) end end x.report("Gsub with Hash Char => String 2000 times") do 2000.times do string.gsub(inverse) end end end

I have run this four times and so far my implementation seems faster on my MacBook. Keep in mind I am new to crystal maybe there is a better way to do this benchmark.

user system total real Gsub with Hash String => Char 2000 times 1.400000 0.270000 1.670000 ( 1.499639) Gsub with Hash Char => String 2000 times 1.620000 0.160000 1.780000 ( 1.669893)

user system total real Gsub with Hash String => Char 2000 times 1.390000 0.270000 1.660000 ( 1.482408) Gsub with Hash Char => String 2000 times 1.610000 0.150000 1.760000 ( 1.669062)

user system total real Gsub with Hash String => Char 2000 times 1.380000 0.270000 1.650000 ( 1.471932) Gsub with Hash Char => String 2000 times 1.620000 0.150000 1.770000 ( 1.677077)

user system total real Gsub with Hash String => Char 2000 times 1.390000 0.270000 1.660000 ( 1.482940) Gsub with Hash Char => String 2000 times 1.610000 0.150000 1.760000 ( 1.668073)

Definitely not as slow as I would have thought:

require "benchmark" string = "Apples Oranges " * 100 hash = { "p" => 'a', "l" => 's', "es" => 'c', "Oranges" => 'f' } inverse = hash.invert string_inverse = string.gsub(hash) class String def gsub(hash : Hash(String, _)) string = self hash.each do |key, value| string = string.gsub(key, value) end string end end Benchmark.ips do |x| x.report("Hash(String => Char)") do string.gsub(hash) end x.report("Hash(Char => String)") do string_inverse.gsub(inverse) end end

$ ./test Hash(String => Char) 37.59k (± 0.59%) 1.82× slower Hash(Char => String) 68.4k (± 0.86%) fastest

With single char replacements:

$ ./test Hash(String => Char) 40.93k (± 0.45%) 1.45× slower Hash(Char => String) 59.46k (± 2.19%) fastest

I reduced the string length, and used Benchmark.ips because it has a much nicer output. it's a bit slower but not by as much as I would have thought.

amedeiros · 2016-09-01T15:16:49Z

I am not sure how I feel about the String addition seems like there could be a better way to do that.

asterite · 2016-09-01T18:40:04Z

@amedeiros Thank you! But this is missing encoded entities like < (where 60 could be replaced by other numbers, like ‑). So a Hash won't just be enough.

oprypin · 2016-09-01T21:44:58Z

It also misses hundreds of other named entities.

Basically, this implementation can undo Crystal's escape method, but not generally unescape HTML.

amedeiros · 2016-09-02T00:49:38Z

Ok. I will go back to it when I get a little time.

amedeiros · 2016-09-02T00:50:59Z

Does this also mean that the escape method that is in the standard library is not complete either?

asterite · 2016-09-02T01:01:09Z

No.

The thing is that to escape HTML some entities must absolutely be escaped, for example < and ", so they are not confused with HTML tags. However, all characters can be encoded with &#hexa;, but this is not necessary.

For the inverse operation, all encoded entities must be decoded back. You can't leave a < unencoded, or an ‑ unencoded. This is an asymmetric operation.

For example:

$ irb
reqirb(main):001:0> require "cgi"
=> true
irb(main):002:0> CGI.escapeHTML "<hello world>"
=> "&lt;hello world&gt;"
irb(main):003:0> unescaped = CGI.unescapeHTML "&lt;&#104;ello world&gt;"
=> "<hello world>"
irb(main):004:0> CGI.escapeHTML unescaped
=> "&lt;hello world&gt;"

Note that we unescaped "<hello world>" and Ruby correctly turned it into "<hello world>", but when escaping it again we didn't get that h in there, because it's not needed (but someone might use it).

amedeiros · 2016-09-02T01:09:39Z

Awesome thank you for the explanation. I was just looking over the Ruby implementation which is indeed much more involved. I can close this pull request and make a new branch start over so it's clean if that works best.

kostya · 2016-09-02T16:11:32Z

may be translate it from ruby https://github.com/threedaymonk/htmlentities as shard

Fixes #3107

f274adb

RX14 reviewed Sep 1, 2016
View reviewed changes

amedeiros closed this Sep 2, 2016

amedeiros deleted the feature/#3107_html_unescape branch September 29, 2016 23:21

dukex mentioned this pull request Oct 3, 2016

Add HTML.unescape [Closes #3107] #3374

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #3107 #3226

Fixes #3107 #3226

amedeiros commented Sep 1, 2016

RX14 Sep 1, 2016

amedeiros Sep 1, 2016

Sija Sep 1, 2016

RX14 Sep 1, 2016

amedeiros Sep 1, 2016

RX14 Sep 1, 2016

amedeiros commented Sep 1, 2016

asterite commented Sep 1, 2016

oprypin commented Sep 1, 2016

amedeiros commented Sep 2, 2016

amedeiros commented Sep 2, 2016

asterite commented Sep 2, 2016

amedeiros commented Sep 2, 2016

kostya commented Sep 2, 2016

Fixes #3107 #3226

Fixes #3107 #3226

Conversation

amedeiros commented Sep 1, 2016

RX14 Sep 1, 2016

Choose a reason for hiding this comment

amedeiros Sep 1, 2016

Choose a reason for hiding this comment

Sija Sep 1, 2016

Choose a reason for hiding this comment

RX14 Sep 1, 2016

Choose a reason for hiding this comment

amedeiros Sep 1, 2016

Choose a reason for hiding this comment

RX14 Sep 1, 2016

Choose a reason for hiding this comment

amedeiros commented Sep 1, 2016

asterite commented Sep 1, 2016

oprypin commented Sep 1, 2016

amedeiros commented Sep 2, 2016

amedeiros commented Sep 2, 2016

asterite commented Sep 2, 2016

amedeiros commented Sep 2, 2016

kostya commented Sep 2, 2016