Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #3107 #3226

Closed
wants to merge 1 commit into from
Closed

Fixes #3107 #3226

wants to merge 1 commit into from

Conversation

amedeiros
Copy link

No description provided.

def gsub(hash : Hash(String, _))
string = self
hash.each do |key, value|
string = string.gsub(key, value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that iterating the string for every replacement is not the best performing solution. You probably want to find the last character of every key, and place them in an array, then iterate the string checking against each character. If a match is found perform strcmp calls to determine if any of the full strings match.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RX14 👍 I just left a note at the same time. I was not excited about this change it did not seem very performant to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, String#gsub addition is not related in any way to the scope of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a sufficiently small diff that it could get merged as 1 PR with 2 commits.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious to see what the difference between the additional String#gsub(Hash(String, _)) versus the already existing String#gsub(Hash(Char, _))

require "benchmark"

string  = "Apples " * 10000
hash    = { "p" => 'a', "l" => 's' }
inverse = hash.invert

Benchmark.bm do |x|
  x.report("Gsub with Hash String => Char 2000 times") do
    2000.times do
      string.gsub(hash)
    end
  end

  x.report("Gsub with Hash Char => String 2000 times") do
    2000.times do
      string.gsub(inverse)
    end
  end
end

I have run this four times and so far my implementation seems faster on my MacBook. Keep in mind I am new to crystal maybe there is a better way to do this benchmark.

                                               user     system      total        real
Gsub with Hash String => Char 2000 times   1.400000   0.270000   1.670000 (  1.499639)
Gsub with Hash Char => String 2000 times   1.620000   0.160000   1.780000 (  1.669893)
                                               user     system      total        real
Gsub with Hash String => Char 2000 times   1.390000   0.270000   1.660000 (  1.482408)
Gsub with Hash Char => String 2000 times   1.610000   0.150000   1.760000 (  1.669062)
                                               user     system      total        real
Gsub with Hash String => Char 2000 times   1.380000   0.270000   1.650000 (  1.471932)
Gsub with Hash Char => String 2000 times   1.620000   0.150000   1.770000 (  1.677077)
                                               user     system      total        real
Gsub with Hash String => Char 2000 times   1.390000   0.270000   1.660000 (  1.482940)
Gsub with Hash Char => String 2000 times   1.610000   0.150000   1.760000 (  1.668073)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not as slow as I would have thought:

require "benchmark"

string  = "Apples Oranges " * 100
hash    = { "p" => 'a', "l" => 's', "es" => 'c', "Oranges" => 'f' }

inverse = hash.invert
string_inverse = string.gsub(hash)

class String
  def gsub(hash : Hash(String, _))
    string = self
    hash.each do |key, value|
      string = string.gsub(key, value)
    end

    string
  end
end

Benchmark.ips do |x|
  x.report("Hash(String => Char)") do
    string.gsub(hash)
  end

  x.report("Hash(Char => String)") do
    string_inverse.gsub(inverse)
  end
end
$ ./test
Hash(String => Char)  37.59k (± 0.59%)  1.82× slower
Hash(Char => String)   68.4k (± 0.86%)       fastest

With single char replacements:

$ ./test
Hash(String => Char)  40.93k (± 0.45%)  1.45× slower
Hash(Char => String)  59.46k (± 2.19%)       fastest

I reduced the string length, and used Benchmark.ips because it has a much nicer output. it's a bit slower but not by as much as I would have thought.

@amedeiros
Copy link
Author

I am not sure how I feel about the String addition seems like there could be a better way to do that.

@asterite
Copy link
Member

asterite commented Sep 1, 2016

@amedeiros Thank you! But this is missing encoded entities like < (where 60 could be replaced by other numbers, like ‑). So a Hash won't just be enough.

@oprypin
Copy link
Member

oprypin commented Sep 1, 2016

It also misses hundreds of other named entities.

Basically, this implementation can undo Crystal's escape method, but not generally unescape HTML.

@amedeiros
Copy link
Author

Ok. I will go back to it when I get a little time.

@amedeiros
Copy link
Author

Does this also mean that the escape method that is in the standard library is not complete either?

@asterite
Copy link
Member

asterite commented Sep 2, 2016

No.

The thing is that to escape HTML some entities must absolutely be escaped, for example < and ", so they are not confused with HTML tags. However, all characters can be encoded with &#hexa;, but this is not necessary.

For the inverse operation, all encoded entities must be decoded back. You can't leave a &lt; unencoded, or an &#8209; unencoded. This is an asymmetric operation.

For example:

$ irb
reqirb(main):001:0> require "cgi"
=> true
irb(main):002:0> CGI.escapeHTML "<hello world>"
=> "&lt;hello world&gt;"
irb(main):003:0> unescaped = CGI.unescapeHTML "&lt;&#104;ello world&gt;"
=> "<hello world>"
irb(main):004:0> CGI.escapeHTML unescaped
=> "&lt;hello world&gt;"

Note that we unescaped "&lt;&#104;ello world&gt;" and Ruby correctly turned it into "<hello world>", but when escaping it again we didn't get that &#104; in there, because it's not needed (but someone might use it).

@amedeiros
Copy link
Author

Awesome thank you for the explanation. I was just looking over the Ruby implementation which is indeed much more involved. I can close this pull request and make a new branch start over so it's clean if that works best.

@amedeiros amedeiros closed this Sep 2, 2016
@kostya
Copy link
Contributor

kostya commented Sep 2, 2016

may be translate it from ruby https://github.com/threedaymonk/htmlentities as shard

@amedeiros amedeiros deleted the feature/#3107_html_unescape branch September 29, 2016 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants