Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csv.rb uses postfix exceptions when converters are specified #1816

Closed
headius opened this issue Jul 15, 2014 · 18 comments
Closed

csv.rb uses postfix exceptions when converters are specified #1816

headius opened this issue Jul 15, 2014 · 18 comments

Comments

@headius
Copy link
Member

headius commented Jul 15, 2014

See code in csv.rb, in CSV#convert_fields and the lambda-based converters around line 947.

Simple benchmark to show the devastating effect:

system ~/projects/jruby $ jruby -rbenchmark -rcsv -e 'data = "\"one\",\"two\",\"three\"\n\"1\",\"2\",\"3\"".encode("UTF-16BE") * 10_000; loop { puts Benchmark.measure { CSV.parse(data, converters: :integer) } }'
 19.900000   0.230000  20.130000 ( 12.954000)
 12.390000   0.070000  12.460000 ( 11.322000)
^C
system ~/projects/jruby $ rvm ruby-2.1.1 do ruby -rbenchmark -rcsv -e 'data = "\"one\",\"two\",\"three\"\n\"1\",\"2\",\"3\"".encode("UTF-16BE") * 10_000; loop { puts Benchmark.measure { CSV.parse(data, converters: :integer) } }'
  0.820000   0.000000   0.820000 (  0.828674)
  0.820000   0.000000   0.820000 (  0.828970)

Without converters, we perform about like MRI.

@headius headius added this to the JRuby 9000 milestone Jul 15, 2014
@headius
Copy link
Member Author

headius commented Jul 15, 2014

Link to the converters, where the postfix rescues live: https://github.com/jruby/jruby/blob/master/lib/ruby/2.1/csv.rb#L947

Link to the convert_fields method: https://github.com/jruby/jruby/blob/master/lib/ruby/2.1/csv.rb#L2165

Note that if working on this, JRuby master is currently a little unstable...better to use JRuby 1.7.x

@enebo
Copy link
Member

enebo commented Jul 15, 2014

Admittedly, a JRuby peformance bug, but we also do not invoke lambdas the same speed as ordinary methods; So if anyone also wants to change converters to simple classes with a convert method it might be interesting to see the perf difference from that activity too (but postfix exception is the bigger target obviously).

@headius
Copy link
Member Author

headius commented Jul 15, 2014

Copying @JEG2. Modifying the Integer converter to use to_i and a nil check (as opposed to the postfix rescue) is around 25% faster in MRI too.

@JEG2
Copy link

JEG2 commented Jul 15, 2014

But Integer() does more than a nil check, right?

>> Integer("bug")
ArgumentError: invalid value for Integer(): "bug"
    from (irb):1:in `Integer'
    from (irb):1
    from /Users/jeg2/.rubies/ruby-2.1.2/bin/irb:11:in `<main>'

@headius
Copy link
Member Author

headius commented Jul 15, 2014

@JEG2 I believe @tenderlove and I came up with a reasonable no-exception "maybe convert" logic for Psych, but I don't recall exactly where it was. I believe it involved some simple regexp matches to filter out strings we know won't parse.

f.to_i || f is almost reasonable, but it will allow through strings that only start with a digit.

@JEG2
Copy link

JEG2 commented Jul 15, 2014

So do we need to change this in CSV to make JRuby perform better?

@tenderlove
Copy link
Contributor

Looks like changing CSV makes MRI faster:

diff --git a/lib/csv.rb b/lib/csv.rb
index 60b22e7..af9c918 100644
--- a/lib/csv.rb
+++ b/lib/csv.rb
@@ -945,7 +945,13 @@ class CSV
   # can be nested with other combo fields.
   #
   Converters  = { integer:   lambda { |f|
-                    Integer(f.encode(ConverterEncoding)) rescue f
+                    begin
+                      num = f.encode(ConverterEncoding)
+                      return f unless num =~ /^[-+]?\d+$/
+                      Integer(num)
+                    rescue
+                      f
+                    end
                   },
                   float:     lambda { |f|
                     Float(f.encode(ConverterEncoding)) rescue f

Before:

[aaron@higgins ruby (trunk)]$ ruby -Ilib -rbenchmark -rcsv -e 'data = "\"one\",\"two\",\"three\"\n\"1\",\"2\",\"3\"".encode("UTF-16BE") * 10_000; loop { puts Benchmark.measure { CSV.parse(data, converters: :integer) } }'
  1.830000   0.020000   1.850000 (  1.845980)
  1.840000   0.010000   1.850000 (  1.860188)
  1.830000   0.000000   1.830000 (  1.849151)
  1.820000   0.010000   1.830000 (  1.831595)

After:

[aaron@higgins ruby (trunk)]$ ruby -Ilib -rbenchmark -rcsv -e 'data = "\"one\",\"two\",\"three\"\n\"1\",\"2\",\"3\"".encode("UTF-16BE") * 10_000; loop { puts Benchmark.measure { CSV.parse(data, converters: :integer) } }'
  1.460000   0.010000   1.470000 (  1.473279)
  1.440000   0.010000   1.450000 (  1.466242)
  1.460000   0.010000   1.470000 (  1.470897)
  1.500000   0.000000   1.500000 (  1.506282)
  1.460000   0.010000   1.470000 (  1.476430)

Should make all rubies faster AFAICT.

@headius
Copy link
Member Author

headius commented Jul 15, 2014

@JEG2 @tenderlove Yes, as mentioned in #1816 (comment) fixing it makes MRI faster too. About 25% for MRI, and about 100x faster for JRuby. Seems like a win.

@headius
Copy link
Member Author

headius commented Jul 15, 2014

Ideally the other converters would not be so prone to throw-away exceptions too.

@JEG2
Copy link

JEG2 commented Jul 15, 2014

I'll be honest and say that I'm less inclined to view this as a win.

Integer() has the desired behavior here and trying to reimplement it sounds error prone to me. For example, @tenderlove's Regexp misses this input: "tricky\n42". I think we can fix that by using /\A[+-]\d+\z/ instead, but I'm worried about how deep this rabbit hole goes.

However, if we need this, I guess we need it. I'm happy to apply a patch if someone wants to try and work out the equivalent code.

@JEG2
Copy link

JEG2 commented Jul 15, 2014

Just in case I wasn't very clear there, I feel we're trading simpler, easier to maintain code for faster code with this change. I've never felt the pain of slow CSV conversions, so it's hard for me to justify doing that. I will though, if needed.

@tenderlove
Copy link
Contributor

For me, it's two wins: speed and easier to use debugging output. Check out the output with ruby -d:

Before the patch:

[aaron@higgins ruby (trunk)]$ ruby -Ilib -d -rcsv -e 'data = "\"one\",\"two\",\"three\"\n\"1\",\"2\",\"3\"".encode("UTF-16BE") * 10; CSV.parse(data, converters: :integer)'
Exception `LoadError' at /Users/aaron/git/ruby/lib/rubygems.rb:1194 - cannot load such file -- rubygems/defaults/operating_system
Exception `LoadError' at /Users/aaron/git/ruby/lib/rubygems.rb:1203 - cannot load such file -- rubygems/defaults/ruby
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"

After:

[aaron@higgins ruby (trunk)]$ ruby -Ilib -d -rcsv -e 'data = "\"one\",\"two\",\"three\"\n\"1\",\"2\",\"3\"".encode("UTF-16BE") * 10; CSV.parse(data, converters: :integer)'
Exception `LoadError' at /Users/aaron/git/ruby/lib/rubygems.rb:1194 - cannot load such file -- rubygems/defaults/operating_system
Exception `LoadError' at /Users/aaron/git/ruby/lib/rubygems.rb:1203 - cannot load such file -- rubygems/defaults/ruby
[aaron@higgins ruby (trunk)]$

@headius
Copy link
Member Author

headius commented Jul 15, 2014

@tenderlove You win for best example of why rescue nil here is a problem. I always end up fixating on perf, but that output is completely unusable.

@headius
Copy link
Member Author

headius commented Jul 15, 2014

@JEG2 The problem here is that Integer does something else we don't want: it raises an exception if it can't handle the input. Your use of rescue f is basically hacking around Integer behavior we don't want, rather than implementing the behavior we do want. And rescue f is well-known to be a really bad antipattern for Ruby...so bad matz wishes he hadn't added it.

If Integer returned false or nil when it can't parse, your logic would be simplest of all: Integer(f) || f. Unfortunately that's not the case, and so we have to hack around the decision to raise a (heavy on all impls...some more than others) throw-away exception for every field that fails to convert.

Perhaps there's a feature request possible here, like a form of Integer that takes an option, or a more robust String#to_i. That's not in scope for me :-)

@JEG2
Copy link

JEG2 commented Jul 15, 2014

As I said, I'll apply a patch. 😄

@headius
Copy link
Member Author

headius commented Jul 15, 2014

@JEG2 Great, thank you very much!

I'll leave this open until we can import the modified version from MRI.

@enebo enebo modified the milestone: JRuby 9.0.0.0 Jul 14, 2015
@headius headius added this to the JRuby 1.7.23 milestone Sep 25, 2015
@headius
Copy link
Member Author

headius commented Sep 25, 2015

Marking to fix for 1.7.23 (and 9.0.2.0).

@enebo enebo modified the milestones: JRuby 1.7.23, JRuby 1.7.24 Nov 24, 2015
@enebo enebo modified the milestones: JRuby 1.7.24, JRuby 1.7.25 Jan 20, 2016
@enebo enebo modified the milestones: JRuby 1.7.25, JRuby 1.7.24 Jan 20, 2016
@enebo
Copy link
Member

enebo commented Apr 11, 2016

I am closing this. We did indirectly address performance issue itself on 9k already without applying the better error reporting logic which was mentioned above. I think the train for improving general performance on 1.7 should be spent on 9k at this point. It's dead...

@enebo enebo closed this as completed Apr 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants