Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HTML.escape(string) in terms of HTML.escape(string, io) #4674

Closed
wants to merge 1 commit into from

Conversation

chastell
Copy link
Contributor

@chastell chastell commented Jul 4, 2017

While working on my PolyConf slides 😉 I discovered that HTML.escape(string) is implemeted in terms of String#gsub, rather than in terms of HTML.escape(string, io). The below benchmark (with string examples straight from this wonderful source) suggests implementing HTML.escape(string) in terms of HTML.escape(string, io) is significantly faster while not sacrificing much readability.

require "benchmark"
require "html"

strings = {
  no_html_short: "A paragraph without HTML characters that need to be escaped.",
  no_html_long: "Another paragraph without characters that need to be escaped. This paragraph is a bit longer, as sometimes there can be large paragraphs that don't any special characters, e.g., in novels or whatever.",
  html_short: "Here->An <<example>> of rust codefn foo(u: &u32) -> &u32 {u}",
  html_long: "A somewhat longer paragraph containing a character that needs to be escaped, because e.g. the author mentions the movie Fast&Furious in it. This paragraph is also quite long to match the non-html one.",
}

strings.each do |name, string|
  Benchmark.ips do |bench|
    bench.report("HTML.escape_io #{name}") do
      io = IO::Memory.new
      HTML.escape(string, io)
      io.to_s
    end
    bench.report("HTML.escape #{name}") do
      HTML.escape(string)
    end
  end
end

On current master (d24f79c), realigned for readability:

HTML.escape_io no_html_short   1.34M (743.83ns) (± 2.78%)       fastest
   HTML.escape no_html_short   1.02M (976.64ns) (± 3.55%)  1.31× slower
HTML.escape_io no_html_long  373.33k (  2.68µs) (± 1.70%)       fastest
   HTML.escape no_html_long  312.29k (   3.2µs) (± 1.73%)  1.20× slower
HTML.escape_io html_short      1.26M (793.38ns) (± 3.16%)       fastest
   HTML.escape html_short      1.03M (970.21ns) (± 2.64%)  1.22× slower
HTML.escape_io html_long     369.11k (  2.71µs) (± 2.09%)       fastest
   HTML.escape html_long     322.86k (   3.1µs) (± 1.50%)  1.14× slower

I’m not 100% sure what are the ramifications of choosing IO::Memory#to_s over String#gsub’s solution based on String.build(bytesize) (I’d guess potentially using up to twice the memory?), though, so do feel free to correct me on the usefulness of this, of course. ❤️

@straight-shoota
Copy link
Member

The overhead seems to come from using String::Builder instead of IO::Memory which seems odd since it's purpose is to build strings...

@straight-shoota
Copy link
Member

I've taken a look at the link where the samples come from... are these benchmarks seriously showing that in Rust a regex is faster than a plain loop with simple replacements??

@akzhan
Copy link
Contributor

akzhan commented Jul 5, 2017

@straight-shoota big improvement by byte vs char replacement. good catch.

@chastell
Copy link
Contributor Author

chastell commented Jul 5, 2017

@straight-shoota

The overhead seems to come from using String::Builder instead of IO::Memory

It might also come from the longer callstack of HTML.escape(string)String#gsub(hash)String#gsub(&block)String.build(capacity)String::Builder.build(capacity).

It might be also that in the particular case of replacing HTML (single-byte) special characters with (multibyte) entities the String.build(bytesize) almost always allocates too few bytes upfront (but then IO::Memory doesn’t preallocate anything…).

are these benchmarks seriously showing that in Rust a regex is faster than a plain loop with simple replacements??

Yup. 🤷

@sdogruyol
Copy link
Member

I'm really curious to learn how come Rust is faster in Regex @chastell ?

@RX14
Copy link
Contributor

RX14 commented Jul 5, 2017

@sdogruyol See the linked reddit thread here.

@sdogruyol
Copy link
Member

thank you @RX14 👍

@chastell
Copy link
Contributor Author

chastell commented Jul 6, 2017

Sooo… anything else I should do with this PR? 😉

(I’ll happily look into String::Builder performance separately, this is super interesting!)

@@ -30,7 +30,9 @@ module HTML
# HTML.escape("Crystal & You") # => "Crystal &amp; You"
# ```
def self.escape(string : String) : String
string.gsub(SUBSTITUTIONS)
io = IO::Memory.new
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better (more idiomatic) to use String.build { |io| escape(string, io) } here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IO::Memory is somehow faster as the OP shows...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.build is the idiomatic way to build strings. If alternatives are faster, then String.build should be improved (global overall benefit), instead of using alternatives in stdlib (duplications).

@asterite
Copy link
Member

I'm getting different results:

HTML.escape_io no_html_short 780.35k (  1.28µs) (± 8.47%)  1.39× slower
   HTML.escape no_html_short   1.09M (920.18ns) (± 5.61%)       fastest
HTML.escape_io no_html_long 252.86k (  3.95µs) (± 9.09%)  1.16× slower
   HTML.escape no_html_long 293.18k (  3.41µs) (± 3.37%)       fastest
HTML.escape_io html_short 711.76k (   1.4µs) (± 8.08%)  1.19× slower
   HTML.escape html_short 846.82k (  1.18µs) (± 5.36%)       fastest
HTML.escape_io html_long 271.04k (  3.69µs) (± 1.76%)  1.03× slower
   HTML.escape html_long 279.72k (  3.58µs) (± 8.31%)       fastest

String::Builder can be a bit fast (it's missing overriding write_byte for performance), but I still see it faster than IO::Memory...

This is all on a MacBook Pro.

@RX14
Copy link
Contributor

RX14 commented Aug 18, 2017

Same here (linux, i5-3570k):

HTML.escape_io no_html_short   1.45M (691.44ns) (± 2.51%)  1.19× slower
   HTML.escape no_html_short   1.72M (580.67ns) (± 2.84%)       fastest
 HTML.escape_io no_html_long 409.16k (  2.44µs) (± 2.04%)  1.13× slower
    HTML.escape no_html_long 461.91k (  2.16µs) (± 4.16%)       fastest
   HTML.escape_io html_short    1.3M (770.96ns) (± 2.82%)  1.18× slower
      HTML.escape html_short   1.53M (652.86ns) (± 2.79%)       fastest
    HTML.escape_io html_long 419.95k (  2.38µs) (± 1.94%)  1.08× slower
       HTML.escape html_long  452.9k (  2.21µs) (± 3.69%)       fastest

@asterite
Copy link
Member

Closing this because the PR doesn't seem to optimize the code (at least for me and @RX14 it results in slower code). It's also longer code.

@asterite asterite closed this Sep 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants