Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Rabin-Karp algorithm for String#rindex #3876

Merged

Conversation

makenowjust
Copy link
Contributor

Reverse version of this.

Benchmark is here.

My result is:

$ crystal run --release rindex_bench.cr
For long string:
       rindex_old 258.31  (  3.87ms) (± 6.69%) 11.04× slower
rindex_rabin_karp   2.85k ( 350.8µs) (± 6.77%)       fastest
For short string:
       short_rindex_old   1.92M (520.84ns) (± 6.08%)  2.66× slower
short_rindex_rabin_karp   5.11M (195.87ns) (± 8.64%)       fastest

(I recommend you run this benchmark on your environment ;)

@drhuffman12
Copy link

Is it faster for short and long strings?

@makenowjust
Copy link
Contributor Author

Old implementation is to scan string from head to end, and short string benchmark has no match but long string benchmark has match in middle of string, so long string benchmark is better score. Generally, the longer string length is, the better score.

@asterite
Copy link
Member

Rabin-Karp is also faster for shorter strings. The old algorithm was very naive, for every char in the string check if there's a match from there on, so O(n*m), and Rabin-Karp will almost always be faster because of it's rolling hash algorithm.

Note that I knew the old algorithm was slow, it just wasn't a priority to improve it. There are many parts in the std that could use an optimization review.

@makenowjust By the way, there's also String#byte_index that currently has a very slow algorithm but could use Rabin-Karp (and it's easier, because it's byte index, not char index)

@makenowjust
Copy link
Contributor Author

@asterite I have known it. I'll work tomorrow. Good night...

@drhuffman12
Copy link

Awesome! Crystal is already quite fast; but I won't complain if it gets even faster! :)

Copy link
Member

@bcardiff bcardiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the spec that was mysteriously removed, if you can squash the format tool commit to the parent one we will be fine to merge this as is.

Thanks for the performance addition and sending clear benchmarks with them 👍

@@ -653,7 +653,8 @@ describe "String" do

describe "with offset" do
assert { "foo baro baz".rindex("o b", 6).should eq(2) }
assert { "foo baro baz".rindex("fg").should be_nil }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this spec was removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is this in "with offset" section, but it has no offset option and it exists already in another line, so I thought it is just copy-and-paste mistaking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, you are right.

@bcardiff bcardiff self-assigned this Jan 12, 2017
@MaxLap
Copy link
Contributor

MaxLap commented Jan 13, 2017

I posted a possible improvement to the algorithm in the previous (for index) change #3873. I think it should be reviewed before merging this copy of it.

@makenowjust makenowjust force-pushed the feature/string/rindex-rabin-karp branch from 52dbfe9 to e460317 Compare January 13, 2017 00:58
@makenowjust
Copy link
Contributor Author

@MaxLap Your idea is awesome, but I think we cannot apply it to String#rindex.

@makenowjust makenowjust force-pushed the feature/string/rindex-rabin-karp branch from e460317 to 85a6044 Compare January 13, 2017 02:03
@makenowjust
Copy link
Contributor Author

@bcardiff Rebased

@MaxLap
Copy link
Contributor

MaxLap commented Jan 13, 2017

Ah, i see what you mean. A check could still be made to do another step unless the current byte is a valid starting one, but there may be less to gain here. You could still try it if you want.

if (byte & 0x80) == 0 || (byte & 0x40) != 0 is the same as if (byte & 0xC0) != 0x80. But maybe the compiler handles that one itself.

@makenowjust makenowjust force-pushed the feature/string/rindex-rabin-karp branch 2 times, most recently from e61369e to 991c6bd Compare January 13, 2017 09:18
@sdogruyol
Copy link
Member

Keep up the great work people 🎉 😍

@makenowjust makenowjust force-pushed the feature/string/rindex-rabin-karp branch from 991c6bd to dac6c94 Compare January 13, 2017 09:27
@makenowjust
Copy link
Contributor Author

@MaxLap Thanks. I applied some optimizations.

@bcardiff bcardiff merged commit 88e5cf2 into crystal-lang:master Jan 13, 2017
@bcardiff
Copy link
Member

Thanks @makenowjust and @MaxLap ! ⚡️ ⚡️

@bcardiff bcardiff added this to the Next milestone Jan 13, 2017
@makenowjust makenowjust deleted the feature/string/rindex-rabin-karp branch January 13, 2017 20:55
@kostya
Copy link
Contributor

kostya commented Jan 22, 2017

little changed behavior after this:

0.20.4
crystal eval 'p "".includes?("")'
false
0.20.5
crystal eval 'p "".includes?("")'
true

not sure what is really correct.

@MaxLap
Copy link
Contributor

MaxLap commented Jan 22, 2017

Really? for me, on 0.20.5, I still get true O_o

@kostya
Copy link
Contributor

kostya commented Jan 22, 2017

ops i miss versions, fixed
so, true, i think correct, ok

@MaxLap
Copy link
Contributor

MaxLap commented Jan 22, 2017

True would be the same result as "".include?("") in ruby.

It may be interesting to add this edge case to the specs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants