Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor String#char_bytesize_at #5501

Closed

Conversation

chastell
Copy link
Contributor

@chastell chastell commented Jan 1, 2018

I tried to wrap my mind around how String#char_bytesize_at works and ended up refactoring it to this version to facilitate understanding. I used the UTF-8 bit patterns to make it clearer what’s happening in this method.

Note: I’m not sure why the current implementation uses the 0x90, 0xc2 and 0xf4 values for comparisons, so my assumption that this method counts the bytesize of an UTF-8 sequence might be wrong.

@chastell chastell force-pushed the refactor_String#char_bytesize_at branch from 24442fc to cceb19b Compare January 1, 2018 14:42
@straight-shoota
Copy link
Member

The code looks clearer, but this method is used pretty often and therefore needs to be as performant as possible. Can you provide benchmarks to compare the implementations?

@chastell
Copy link
Contributor Author

chastell commented Jan 1, 2018

Right, I actually had this thought right after opening this PR. 🤦‍♀️

A quick bench to the tune of

string = String.new(Bytes[104, 101, 108, 108, 111, 32, 255, 32, 255, 32, 119, 111, 114, 108, 100, 33])

Benchmark.ips do |bench|
  bench.report("old") { (0...16).map { |index| string.old(index) } }
  bench.report("new") { (0...16).map { |index| string.new(index) } }
end

shows a 6% slowdown:

old   8.85M (112.95ns) (± 3.25%)       fastest
new   8.37M (119.48ns) (± 2.30%)  1.06× slower

I’ll work on this PR to make it faster, but in the meantime: how can I best create a plausible set of representative strings for benching?

@chastell
Copy link
Contributor Author

chastell commented Jan 1, 2018

Interestingly, on a random string of bytes the new implementation is actually significantly faster, which strongly suggests I’m missing something here:

string = String.new(Random.new.random_bytes(256))

Benchmark.ips do |bench|
  bench.report("old") { string.size.times { |index| string.old(index) } }
  bench.report("new") { string.size.times { |index| string.new(index) } }
end
old 345.98M (  2.89ns) (± 3.16%)  1.40× slower
new 484.79M (  2.06ns) (± 2.10%)       fastest

old  349.6M (  2.86ns) (± 2.75%)  1.38× slower
new 481.84M (  2.08ns) (± 2.84%)       fastest

old  347.7M (  2.88ns) (± 3.09%)  1.38× slower
new  479.3M (  2.09ns) (± 4.00%)       fastest

🤔

@straight-shoota
Copy link
Member

The individual bachmark measurements are so small (only nanoseconds!) that there won't be any reliable results. You could wrap the individual examples in 1000.times { ... } to get a more feasible number.
And it would be great to have a benchmark covering consecutive calls through each_byte_index_and_char_index, like String#size (when it's not already initialized).

Did you look at other implementations? No need to reinvent every wheel. Char::Reader#decode_char_at is also very similar, allthough it works a bit different. I guess it makes sense to keep both implemented differently but improvements to the algorithm could be added there as well.

@chastell
Copy link
Contributor Author

chastell commented Jan 1, 2018

Ah, yes, the first String#size call in my benchmark precomputes the size. 🤦‍♂️ 🤦‍♂️ 🤦‍♂️

When I removed it, a random 256-byte string takes exactly the same amount of time:

string = String.new(Random.new.random_bytes(256))

Benchmark.ips do |bench|
  bench.report("old") { 256.times { |index| string.old(index) } }
  bench.report("new") { 256.times { |index| string.new(index) } }
end
old 552.94M (  1.81ns) (± 2.21%)  1.00× slower
new 554.01M (  1.81ns) (± 1.99%)       fastest

old 555.95M (   1.8ns) (± 2.39%)       fastest
new 555.41M (   1.8ns) (± 2.35%)  1.00× slower

old 562.24M (  1.78ns) (± 2.58%)  1.00× slower
new  562.4M (  1.78ns) (± 2.33%)       fastest

The individual bachmark measurements are so small (only nanoseconds!) that there won't be any reliable results. You could wrap the individual examples in 1000.times { ... } to get a more feasible number.

Hmmm, I thought this is exactly what Benchmark.ips does, no? It even shows the standard deviation, so we know how reliable they are.

I’ll look at your other pointers, thanks so much!

@chastell
Copy link
Contributor Author

chastell commented Jan 1, 2018

OK, I think the culprit here might be that the whole block is compiled out as a no-op. 🤦‍♂️

(This is a very teaching start to the new year, thanks for bearing with me…)

@chastell
Copy link
Contributor Author

chastell commented Jan 1, 2018

OK, just for posterity:

string = String.new(Random.new.random_bytes(256))

Benchmark.ips do |bench|
  bench.report("old") { (0...256).map { |index| string.old(index) } }
  bench.report("new") { (0...256).map { |index| string.new(index) } }
end
old  933.0k (  1.07µs) (± 2.65%)  1.16× slower
new   1.08M (927.91ns) (± 6.22%)       fastest

old 938.74k (  1.07µs) (± 2.42%)       fastest
new 908.57k (   1.1µs) (± 2.02%)  1.03× slower

old 906.63k (   1.1µs) (± 2.65%)       fastest
new 883.05k (  1.13µs) (± 2.42%)  1.03× slower

old  934.1k (  1.07µs) (± 2.34%)       fastest
new 922.33k (  1.08µs) (± 2.30%)  1.01× slower

old 914.57k (  1.09µs) (± 2.56%)  1.17× slower
new   1.07M (931.93ns) (± 5.93%)       fastest

Not bad. 😉

I’m still not sure why the current implementation uses byte values like 0x90, 0xc2 and 0xf4, though. 🤷‍♂️

@straight-shoota
Copy link
Member

straight-shoota commented Jan 1, 2018

Benchmark.ips runs each example multiple times but each is measured on it's own and these measurements are used to calculate average and deviation. When each measurement is too small of a time span (say less than a few microseconds) the results will not be reliable.

Unfortunately, the results from #size are not useful because it is cached and and the cache is already populated in the constructor. You could add a non-caching implementation of the size method to properly test it.

@RX14
Copy link
Contributor

RX14 commented Jan 1, 2018

@straight-shoota this is absolutely not true. Not every iteration of the benchmark is timed seperately. First the benchmark warms up and calculates approximately how fast each iteration is. It then uses it's approximation to work out approximately how many times the benchmark needs to run to reach 100ms, then runs the benchmark in 100ms blocks. That means only one Time measurement every 100ms. There is no need to use 256.times in Benchmark.ips. Please look at the sourcecode before stating "facts".

@straight-shoota
Copy link
Member

Oh sry, I mixed that up. Yes, this is only true for Benchmark.bm.

I guess I'm doing to many things simultaneously... no real multithreading support.

@RX14
Copy link
Contributor

RX14 commented Jan 1, 2018

Oh, and 1.8ns is way too short for 256 iterations of anything. You can't fit more than 10 clock cycles in 1.8ns even at 4GHz. It's obviously being optimized out.

@straight-shoota
Copy link
Member

straight-shoota commented Jan 1, 2018

@RX14 The 1.8 ns is just reading @length

@RX14
Copy link
Contributor

RX14 commented Jan 1, 2018

@straight-shoota how?

@RX14
Copy link
Contributor

RX14 commented Jan 1, 2018

We need a JMH-style blackhole class to get accurate benchmark results.

@chastell
Copy link
Contributor Author

chastell commented Jan 1, 2018

OK, I reimplemented it for speed and the results for a random 10000-byte string are quite optimistic:

old  19.92k ( 50.19µs) (± 1.97%)  1.01× slower
new   20.1k ( 49.75µs) (± 1.57%)       fastest

old  20.25k ( 49.38µs) (± 1.05%)       fastest
new  20.14k ( 49.65µs) (± 1.38%)  1.01× slower

old  20.18k ( 49.56µs) (± 0.69%)  1.00× slower
new  20.25k ( 49.39µs) (± 0.97%)       fastest

Thanks so much for the pointer to Char::Reader#decode_char_at, that explains the shape of the current implemenation! I’m still not sure why 0xc1 is a valid byte sequence there, though. 🤷‍♂️

@straight-shoota
Copy link
Member

straight-shoota commented Jan 1, 2018

I'd assume 0xC2 might just have been a typo or some other mistake (off by one). It's very unlikely to cause any issues therefore it was not noticed.

I think you can safely skip the 0b0_0000000 <= byte comparison, thats the minimum for UInt8 anyway. But I like the subsequent_utf8_byte_at? method, makes it easier readable and shouldn't add any costs.

@asterite
Copy link
Member

asterite commented Jan 1, 2018

The values are because of how UTF8 works. I can't explain because I'm on the cellphone,but you can find it on Wikipedia. I find the old code easier to read, understand, shorter and faster to compile and faster for llvm to optimize (I know this because si know how the compiler works). Please don't change this method.

As a side note, we need a general Benchmark suite of the whole standard library which we can easily run against this kind of PRs. The benchmarks should also show number of allocations, so ips should be improved in that regard. I'll later open an issue about this.

@chastell
Copy link
Contributor Author

chastell commented Jan 1, 2018

I think you can safely skip the 0b0_0000000 <= byte comparison, thats the minimum for UInt8 anyway.

Right, I assumed the compiler would compile it out anyway, and it made the code nicely symmetric.

The values are because of how UTF8 works. I can't explain because I'm on the cellphone, but you can find it on Wikipedia.

I even linked Wikipedia from this PR. 😉

I gave it some thought and I assume the first < 0xc2 check is because U+007F is 0x7f and U+0080 is 0xc2 0x80 (i.e., the first byte is never 0xc0 nor 0xc1, even though these values match the 110..... pattern). I assume it’s the same for 0x90 (the continuation pattern is 10......, but in practice only the last continuation byte can have a value lower than 0x90).

@asterite Am I right in the above? If so, can I add the relevant comments to the original source (instead of this refactoring)? The values seemed a bit arbitrary on my first read of the method (even when I more-or-less know how UTF-8 works).

@asterite
Copy link
Member

asterite commented Jan 2, 2018

Oh, I read it too quickly to note the wikipedia link. Sorry.

I gave it some thought and I assume the first < 0xc2 check is because U+007F is 0x7f and U+0080 is 0xc2 0x80 (i.e., the first byte is never 0xc0 nor 0xc1, even though these values match the 110..... pattern)

Now I can't remember why this is true (but it is). So yes, adding comments, and maybe even using 0b11000010 instead of 0xc2, and even linking to Wikipedia in the implementation comment or explaining how UTF-8 is encoded right there (just a reminder) are all good things to do. Implementation-wise, as long as it doesn't make performance worse, it's probably good for me.

@chastell
Copy link
Contributor Author

chastell commented Jan 2, 2018

Excellent, thanks!

I assumed what this method returns is ‘the size of the UTF-8 byte sequence starting at byte_index or 1 if it’s not a valid UTF-8 sequence’.

But for the byte sequence 11110_000 10_001111 10_001111 (f0 8f 8f) the current implementation will return 3:

    if first == 0xf0 && second < 0x90
      return 3
    end

To the best of my reading this byte sequence is not valid, as three-byte sequences should have a 1110xxxx leading byte, while a 11110xxx leading byte (above) indicates a four-byte sequence. (In particular, 11110_000 10_001111 10_001111 looks like an overlong sequence trying to encode the 1111001111 codepoint, which should be encoded as 110_01111 10_001111.)

Similarly, the current implementation will return 4 even when the fourth byte is invalid, as it doesn’t validate the fourth byte at all.

Let me know if I misunderstood what the method should return on invalid sequences or whether there is room for improvement.

@chastell
Copy link
Contributor Author

chastell commented Jan 2, 2018

The current implementation will return 4 even for a three-byte sequence:

String.new(Bytes[0b11110_001, 0b10_000000, 0b10_000000]).char_bytesize_at(0)
#=> 4

@asterite
Copy link
Member

asterite commented Jan 2, 2018

The method alone doesn't matter, it matters in the context its used. It returns the number of bytes occupied by a utf-8 codepoint starting from the given index, regardless of if it ends up being well-formed in the end.

@chastell
Copy link
Contributor Author

chastell commented Jan 6, 2018

It returns the number of bytes occupied by a utf-8 codepoint starting from the given index, regardless of if it ends up being well-formed in the end.

So why does it check whether the second and third (but not fourth) bytes are well-formed? If this doesn’t care whether the UTF-8 codepoint is well-formed (nor whether there even are four bytes in a four-byte codepoint), it could check just the first byte and be done. 🤔

@asterite
Copy link
Member

asterite commented Jan 6, 2018

Sorry, I said it wrong. It counts the number of UTF-8 codepoints in total. If there's an invalid sequence, it counts it as the maximum number of bytes that conform it (until it becomes invalid).

Sorry for not putting too much effort into this issue, but this is something that's working fine and doesn't need a refactor or improvement (at least it's not apparent). I'm trying to focus on fixing bugs and getting new stuff done.

@asterite
Copy link
Member

asterite commented Jan 6, 2018

(that is, unless you find a bug in this method that you can reproduce with a snippet of code, then we can discuss how to fix it)

@chastell
Copy link
Contributor Author

chastell commented Jan 8, 2018

It counts the number of UTF-8 codepoints in total. If there's an invalid sequence, it counts it as the maximum number of bytes that conform it (until it becomes invalid).

Hm, reading the method:

    if first < 0x80
      return 1
    end

• a valid one-byte sequence returns 1

    if first < 0xc2
      return 1
    end

• this is either a non-UTF-8 byte or a continuation byte on the first position, so it’s an invalid sequence, returns 1

    second = unsafe_byte_at(byte_index + 1)
    if (second & 0xc0) != 0x80
      return 1
    end

• any potentially UTF-8 (or not) sequence but with invalid second-byte returns 1

    if first < 0xe0
      return 2
    end

• a valid two-byte sequence returns 2

    third = unsafe_byte_at(byte_index + 2)
    if (third & 0xc0) != 0x80
      return 2
    end

• a three-or-more-byte-suggesting leading byte + a valid second byte + and invalid third byte returns 2, even when the leading byte is between 0xf5 and 0xff, which it cannot possibly be for a valid UTF-8 sequence 🤔

    if first < 0xf0
      return 3
    end

• a valid three-byte sequence returns 3

    if first == 0xf0 && second < 0x90
      return 3
    end

• a sequence with exactly 1111_0000 (four-byte-suggesting) leading byte and a second byte between 1000_0000 and 1000_1111 and a third byte being a continuation byte returns 3 🤔

    if first == 0xf4 && second >= 0x90
      return 3
    end

• a sequence with exactly 1111_0100 (four-byte-suggesting) leading byte and a second byte between 1001_0000 and 1011_1111 and a third byte being a continuation byte returns 3 🤔

    return 4

any other sequence, including three-byte sequences not covered above, and sequences with the invalid leading byte between 0xf5 and 0xff (but with valid second and third continuation bytes), returns 4 🤔

I’m puzzled by the cases denoted by 🤔 , because they do feel like errors to me. In particular, I would expect any sequence starting with 0xf5 to 0xff (invalid as UTF-8 leading bytes) to return 1, and the first == 0xf0 and first == 0xf4 checks look strange (why are those leading byte values singled out?).

I'm trying to focus on fixing bugs and getting new stuff done.

I totally understand this, and I feel really bad pulling you off from other stuff. I feel like I already wasted a lot of your time on this. 😢

I’m just trying to figure out how is this method supposed to work, because it looks like the checks are not very consistent in what they’re supposed to accept and what they supposed to fail. I’m perfectly fine if you answer is ‘I still think the above looks correct, I don’t have the time to explain why’ – it just looks to me like the method is not very consistent and thus potentially buggy.

@asterite
Copy link
Member

asterite commented Jan 8, 2018

Yes, it seems there are some cases not considered by the algorithm, probably also not covered by specs. We should also check IO#read_char and Char::Reader#deocde_char_at.

@asterite
Copy link
Member

asterite commented Jan 8, 2018

Awesome explanation by the way, you made it super clear that there's something wrong. I think we need to check that the first byte is valid for 2, 3 and 4 byte sequences before checking the other bytes.

Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case for this change was explained in detail and it should be accepted.
But it needs a rebase against master. @chastell are you still up for it?

Copy link
Contributor

@RX14 RX14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really readable, nice!

@straight-shoota straight-shoota added the pr:needs-work A PR requires modifications by the author. label Feb 19, 2020
@straight-shoota
Copy link
Member

straight-shoota commented Mar 5, 2022

I rebased this PR on current master (the signature of char_bytesite_at had changed in #9713) and tried to polish it up.

While the proposed implementation is certainly a lot more readable, and apparently more performant as well, it is not accurate. It manages to recognize the general format of UTF-8 sequences, but it fails to account for malformed data. For example, some codepoints can be represented by different, technically equivalent byte sequences. Only the most minimal encoding is considered wellformed, though. All others are to be rejected. Then there's a range of invalid codepoints between U+D800 and U+DFFF (surrogate halves) which are also to be rejected. And the maximum codepoint value U+10ffff is lower than the number of values that could technically be represented in a 4 byte UTF-8 sequence, so there's a range of invalid codepoints at the upper end.

Of course, this could all be fixed by adding more conditions to the algorithm. But this would taint the improvements in performance and readability that I believe it's not worth progressing.

Instead, I propose to look for a more efficient and proven UTF-8 decoding algorithm in #11873. This is a very important piece of code, so efficiency should be highly valued.

As a productive outcome of my evaluation of this PR, I have extracted the specs I used for verification into #11872.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:refactor pr:needs-work A PR requires modifications by the author. topic:stdlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants