Refactor String#char_bytesize_at #5501

chastell · 2018-01-01T14:42:12Z

I tried to wrap my mind around how String#char_bytesize_at works and ended up refactoring it to this version to facilitate understanding. I used the UTF-8 bit patterns to make it clearer what’s happening in this method.

Note: I’m not sure why the current implementation uses the 0x90, 0xc2 and 0xf4 values for comparisons, so my assumption that this method counts the bytesize of an UTF-8 sequence might be wrong.

straight-shoota · 2018-01-01T16:00:03Z

The code looks clearer, but this method is used pretty often and therefore needs to be as performant as possible. Can you provide benchmarks to compare the implementations?

chastell · 2018-01-01T16:06:56Z

Right, I actually had this thought right after opening this PR. 🤦‍♀️

A quick bench to the tune of

string = String.new(Bytes[104, 101, 108, 108, 111, 32, 255, 32, 255, 32, 119, 111, 114, 108, 100, 33])

Benchmark.ips do |bench|
  bench.report("old") { (0...16).map { |index| string.old(index) } }
  bench.report("new") { (0...16).map { |index| string.new(index) } }
end

shows a 6% slowdown:

old   8.85M (112.95ns) (± 3.25%)       fastest
new   8.37M (119.48ns) (± 2.30%)  1.06× slower

I’ll work on this PR to make it faster, but in the meantime: how can I best create a plausible set of representative strings for benching?

chastell · 2018-01-01T16:15:03Z

Interestingly, on a random string of bytes the new implementation is actually significantly faster, which strongly suggests I’m missing something here:

string = String.new(Random.new.random_bytes(256))

Benchmark.ips do |bench|
  bench.report("old") { string.size.times { |index| string.old(index) } }
  bench.report("new") { string.size.times { |index| string.new(index) } }
end

old 345.98M (  2.89ns) (± 3.16%)  1.40× slower
new 484.79M (  2.06ns) (± 2.10%)       fastest

old  349.6M (  2.86ns) (± 2.75%)  1.38× slower
new 481.84M (  2.08ns) (± 2.84%)       fastest

old  347.7M (  2.88ns) (± 3.09%)  1.38× slower
new  479.3M (  2.09ns) (± 4.00%)       fastest

🤔

straight-shoota · 2018-01-01T16:18:13Z

The individual bachmark measurements are so small (only nanoseconds!) that there won't be any reliable results. You could wrap the individual examples in 1000.times { ... } to get a more feasible number.
And it would be great to have a benchmark covering consecutive calls through each_byte_index_and_char_index, like String#size (when it's not already initialized).

Did you look at other implementations? No need to reinvent every wheel. Char::Reader#decode_char_at is also very similar, allthough it works a bit different. I guess it makes sense to keep both implemented differently but improvements to the algorithm could be added there as well.

chastell · 2018-01-01T16:21:31Z

Ah, yes, the first String#size call in my benchmark precomputes the size. 🤦‍♂️ 🤦‍♂️ 🤦‍♂️

When I removed it, a random 256-byte string takes exactly the same amount of time:

string = String.new(Random.new.random_bytes(256))

Benchmark.ips do |bench|
  bench.report("old") { 256.times { |index| string.old(index) } }
  bench.report("new") { 256.times { |index| string.new(index) } }
end

old 552.94M (  1.81ns) (± 2.21%)  1.00× slower
new 554.01M (  1.81ns) (± 1.99%)       fastest

old 555.95M (   1.8ns) (± 2.39%)       fastest
new 555.41M (   1.8ns) (± 2.35%)  1.00× slower

old 562.24M (  1.78ns) (± 2.58%)  1.00× slower
new  562.4M (  1.78ns) (± 2.33%)       fastest

The individual bachmark measurements are so small (only nanoseconds!) that there won't be any reliable results. You could wrap the individual examples in 1000.times { ... } to get a more feasible number.

Hmmm, I thought this is exactly what Benchmark.ips does, no? It even shows the standard deviation, so we know how reliable they are.

I’ll look at your other pointers, thanks so much!

chastell · 2018-01-01T16:23:11Z

OK, I think the culprit here might be that the whole block is compiled out as a no-op. 🤦‍♂️

(This is a very teaching start to the new year, thanks for bearing with me…)

chastell · 2018-01-01T16:27:17Z

OK, just for posterity:

string = String.new(Random.new.random_bytes(256))

Benchmark.ips do |bench|
  bench.report("old") { (0...256).map { |index| string.old(index) } }
  bench.report("new") { (0...256).map { |index| string.new(index) } }
end

old  933.0k (  1.07µs) (± 2.65%)  1.16× slower
new   1.08M (927.91ns) (± 6.22%)       fastest

old 938.74k (  1.07µs) (± 2.42%)       fastest
new 908.57k (   1.1µs) (± 2.02%)  1.03× slower

old 906.63k (   1.1µs) (± 2.65%)       fastest
new 883.05k (  1.13µs) (± 2.42%)  1.03× slower

old  934.1k (  1.07µs) (± 2.34%)       fastest
new 922.33k (  1.08µs) (± 2.30%)  1.01× slower

old 914.57k (  1.09µs) (± 2.56%)  1.17× slower
new   1.07M (931.93ns) (± 5.93%)       fastest

Not bad. 😉

I’m still not sure why the current implementation uses byte values like 0x90, 0xc2 and 0xf4, though. 🤷‍♂️

straight-shoota · 2018-01-01T16:43:20Z

Benchmark.ips runs each example multiple times but each is measured on it's own and these measurements are used to calculate average and deviation. When each measurement is too small of a time span (say less than a few microseconds) the results will not be reliable.

Unfortunately, the results from #size are not useful because it is cached and and the cache is already populated in the constructor. You could add a non-caching implementation of the size method to properly test it.

RX14 · 2018-01-01T16:49:47Z

@straight-shoota this is absolutely not true. Not every iteration of the benchmark is timed seperately. First the benchmark warms up and calculates approximately how fast each iteration is. It then uses it's approximation to work out approximately how many times the benchmark needs to run to reach 100ms, then runs the benchmark in 100ms blocks. That means only one Time measurement every 100ms. There is no need to use 256.times in Benchmark.ips. Please look at the sourcecode before stating "facts".

straight-shoota · 2018-01-01T16:54:18Z

Oh sry, I mixed that up. Yes, this is only true for Benchmark.bm.

I guess I'm doing to many things simultaneously... no real multithreading support.

RX14 · 2018-01-01T16:54:23Z

Oh, and 1.8ns is way too short for 256 iterations of anything. You can't fit more than 10 clock cycles in 1.8ns even at 4GHz. It's obviously being optimized out.

straight-shoota · 2018-01-01T16:55:43Z

@RX14 The 1.8 ns is just reading @length

RX14 · 2018-01-01T16:56:23Z

@straight-shoota how?

RX14 · 2018-01-01T16:57:18Z

We need a JMH-style blackhole class to get accurate benchmark results.

chastell · 2018-01-01T17:55:13Z

OK, I reimplemented it for speed and the results for a random 10000-byte string are quite optimistic:

old  19.92k ( 50.19µs) (± 1.97%)  1.01× slower
new   20.1k ( 49.75µs) (± 1.57%)       fastest

old  20.25k ( 49.38µs) (± 1.05%)       fastest
new  20.14k ( 49.65µs) (± 1.38%)  1.01× slower

old  20.18k ( 49.56µs) (± 0.69%)  1.00× slower
new  20.25k ( 49.39µs) (± 0.97%)       fastest

Thanks so much for the pointer to Char::Reader#decode_char_at, that explains the shape of the current implemenation! I’m still not sure why 0xc1 is a valid byte sequence there, though. 🤷‍♂️

straight-shoota · 2018-01-01T18:12:16Z

I'd assume 0xC2 might just have been a typo or some other mistake (off by one). It's very unlikely to cause any issues therefore it was not noticed.

I think you can safely skip the 0b0_0000000 <= byte comparison, thats the minimum for UInt8 anyway. But I like the subsequent_utf8_byte_at? method, makes it easier readable and shouldn't add any costs.

asterite · 2018-01-01T18:17:36Z

The values are because of how UTF8 works. I can't explain because I'm on the cellphone,but you can find it on Wikipedia. I find the old code easier to read, understand, shorter and faster to compile and faster for llvm to optimize (I know this because si know how the compiler works). Please don't change this method.

As a side note, we need a general Benchmark suite of the whole standard library which we can easily run against this kind of PRs. The benchmarks should also show number of allocations, so ips should be improved in that regard. I'll later open an issue about this.

chastell · 2018-01-01T18:55:36Z

I think you can safely skip the 0b0_0000000 <= byte comparison, thats the minimum for UInt8 anyway.

Right, I assumed the compiler would compile it out anyway, and it made the code nicely symmetric.

The values are because of how UTF8 works. I can't explain because I'm on the cellphone, but you can find it on Wikipedia.

I even linked Wikipedia from this PR. 😉

I gave it some thought and I assume the first < 0xc2 check is because U+007F is 0x7f and U+0080 is 0xc2 0x80 (i.e., the first byte is never 0xc0 nor 0xc1, even though these values match the 110..... pattern). I assume it’s the same for 0x90 (the continuation pattern is 10......, but in practice only the last continuation byte can have a value lower than 0x90).

@asterite Am I right in the above? If so, can I add the relevant comments to the original source (instead of this refactoring)? The values seemed a bit arbitrary on my first read of the method (even when I more-or-less know how UTF-8 works).

asterite · 2018-01-02T00:18:25Z

Oh, I read it too quickly to note the wikipedia link. Sorry.

I gave it some thought and I assume the first < 0xc2 check is because U+007F is 0x7f and U+0080 is 0xc2 0x80 (i.e., the first byte is never 0xc0 nor 0xc1, even though these values match the 110..... pattern)

Now I can't remember why this is true (but it is). So yes, adding comments, and maybe even using 0b11000010 instead of 0xc2, and even linking to Wikipedia in the implementation comment or explaining how UTF-8 is encoded right there (just a reminder) are all good things to do. Implementation-wise, as long as it doesn't make performance worse, it's probably good for me.

chastell · 2018-01-02T21:17:42Z

Excellent, thanks!

I assumed what this method returns is ‘the size of the UTF-8 byte sequence starting at byte_index or 1 if it’s not a valid UTF-8 sequence’.

But for the byte sequence 11110_000 10_001111 10_001111 (f0 8f 8f) the current implementation will return 3:

    if first == 0xf0 && second < 0x90
      return 3
    end

To the best of my reading this byte sequence is not valid, as three-byte sequences should have a 1110xxxx leading byte, while a 11110xxx leading byte (above) indicates a four-byte sequence. (In particular, 11110_000 10_001111 10_001111 looks like an overlong sequence trying to encode the 1111001111 codepoint, which should be encoded as 110_01111 10_001111.)

Similarly, the current implementation will return 4 even when the fourth byte is invalid, as it doesn’t validate the fourth byte at all.

Let me know if I misunderstood what the method should return on invalid sequences or whether there is room for improvement.

chastell · 2018-01-02T21:41:06Z

The current implementation will return 4 even for a three-byte sequence:

String.new(Bytes[0b11110_001, 0b10_000000, 0b10_000000]).char_bytesize_at(0)
#=> 4

asterite · 2018-01-02T23:00:31Z

The method alone doesn't matter, it matters in the context its used. It returns the number of bytes occupied by a utf-8 codepoint starting from the given index, regardless of if it ends up being well-formed in the end.

chastell · 2018-01-06T13:02:47Z

It returns the number of bytes occupied by a utf-8 codepoint starting from the given index, regardless of if it ends up being well-formed in the end.

So why does it check whether the second and third (but not fourth) bytes are well-formed? If this doesn’t care whether the UTF-8 codepoint is well-formed (nor whether there even are four bytes in a four-byte codepoint), it could check just the first byte and be done. 🤔

asterite · 2018-01-06T13:38:59Z

Sorry, I said it wrong. It counts the number of UTF-8 codepoints in total. If there's an invalid sequence, it counts it as the maximum number of bytes that conform it (until it becomes invalid).

Sorry for not putting too much effort into this issue, but this is something that's working fine and doesn't need a refactor or improvement (at least it's not apparent). I'm trying to focus on fixing bugs and getting new stuff done.

asterite · 2018-01-06T13:39:27Z

(that is, unless you find a bug in this method that you can reproduce with a snippet of code, then we can discuss how to fix it)

chastell · 2018-01-08T21:13:48Z

It counts the number of UTF-8 codepoints in total. If there's an invalid sequence, it counts it as the maximum number of bytes that conform it (until it becomes invalid).

Hm, reading the method:

    if first < 0x80
      return 1
    end

• a valid one-byte sequence returns 1 ✅

    if first < 0xc2
      return 1
    end

• this is either a non-UTF-8 byte or a continuation byte on the first position, so it’s an invalid sequence, returns 1 ✅

    second = unsafe_byte_at(byte_index + 1)
    if (second & 0xc0) != 0x80
      return 1
    end

• any potentially UTF-8 (or not) sequence but with invalid second-byte returns 1 ✅

    if first < 0xe0
      return 2
    end

• a valid two-byte sequence returns 2 ✅

    third = unsafe_byte_at(byte_index + 2)
    if (third & 0xc0) != 0x80
      return 2
    end

• a three-or-more-byte-suggesting leading byte + a valid second byte + and invalid third byte returns 2, even when the leading byte is between 0xf5 and 0xff, which it cannot possibly be for a valid UTF-8 sequence 🤔

    if first < 0xf0
      return 3
    end

• a valid three-byte sequence returns 3 ✅

    if first == 0xf0 && second < 0x90
      return 3
    end

• a sequence with exactly 1111_0000 (four-byte-suggesting) leading byte and a second byte between 1000_0000 and 1000_1111 and a third byte being a continuation byte returns 3 🤔

    if first == 0xf4 && second >= 0x90
      return 3
    end

• a sequence with exactly 1111_0100 (four-byte-suggesting) leading byte and a second byte between 1001_0000 and 1011_1111 and a third byte being a continuation byte returns 3 🤔

    return 4

• any other sequence, including three-byte sequences not covered above, and sequences with the invalid leading byte between 0xf5 and 0xff (but with valid second and third continuation bytes), returns 4 🤔

I’m puzzled by the cases denoted by 🤔 , because they do feel like errors to me. In particular, I would expect any sequence starting with 0xf5 to 0xff (invalid as UTF-8 leading bytes) to return 1, and the first == 0xf0 and first == 0xf4 checks look strange (why are those leading byte values singled out?).

I'm trying to focus on fixing bugs and getting new stuff done.

I totally understand this, and I feel really bad pulling you off from other stuff. I feel like I already wasted a lot of your time on this. 😢

I’m just trying to figure out how is this method supposed to work, because it looks like the checks are not very consistent in what they’re supposed to accept and what they supposed to fail. I’m perfectly fine if you answer is ‘I still think the above looks correct, I don’t have the time to explain why’ – it just looks to me like the method is not very consistent and thus potentially buggy.

asterite · 2018-01-08T22:38:27Z

Yes, it seems there are some cases not considered by the algorithm, probably also not covered by specs. We should also check IO#read_char and Char::Reader#deocde_char_at.

asterite · 2018-01-08T23:17:49Z

Awesome explanation by the way, you made it super clear that there's something wrong. I think we need to check that the first byte is valid for 2, 3 and 4 byte sequences before checking the other bytes.

straight-shoota

The case for this change was explained in detail and it should be accepted.
But it needs a rebase against master. @chastell are you still up for it?

RX14

This is really readable, nice!

straight-shoota · 2022-03-05T11:59:29Z

I rebased this PR on current master (the signature of char_bytesite_at had changed in #9713) and tried to polish it up.

While the proposed implementation is certainly a lot more readable, and apparently more performant as well, it is not accurate. It manages to recognize the general format of UTF-8 sequences, but it fails to account for malformed data. For example, some codepoints can be represented by different, technically equivalent byte sequences. Only the most minimal encoding is considered wellformed, though. All others are to be rejected. Then there's a range of invalid codepoints between U+D800 and U+DFFF (surrogate halves) which are also to be rejected. And the maximum codepoint value U+10ffff is lower than the number of values that could technically be represented in a 4 byte UTF-8 sequence, so there's a range of invalid codepoints at the upper end.

Of course, this could all be fixed by adding more conditions to the algorithm. But this would taint the improvements in performance and readability that I believe it's not worth progressing.

Instead, I propose to look for a more efficient and proven UTF-8 decoding algorithm in #11873. This is a very important piece of code, so efficiency should be highly valued.

As a productive outcome of my evaluation of this PR, I have extracted the specs I used for verification into #11872.

Refactor String#char_bytesize_at

cceb19b

chastell force-pushed the refactor_String#char_bytesize_at branch from 24442fc to cceb19b Compare January 1, 2018 14:42

Refactor String#char_bytesize_at for performance

eb3511f

sdogruyol added topic:stdlib kind:refactor labels Apr 19, 2018

straight-shoota approved these changes Feb 16, 2020

View reviewed changes

RX14 approved these changes Feb 19, 2020

View reviewed changes

straight-shoota added the pr:needs-work label Feb 19, 2020

straight-shoota closed this Mar 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor String#char_bytesize_at #5501

Refactor String#char_bytesize_at #5501

chastell commented Jan 1, 2018

straight-shoota commented Jan 1, 2018

chastell commented Jan 1, 2018

chastell commented Jan 1, 2018

straight-shoota commented Jan 1, 2018

chastell commented Jan 1, 2018

chastell commented Jan 1, 2018

chastell commented Jan 1, 2018

straight-shoota commented Jan 1, 2018 •

edited

Loading

RX14 commented Jan 1, 2018 •

edited

Loading

straight-shoota commented Jan 1, 2018

RX14 commented Jan 1, 2018 •

edited

Loading

straight-shoota commented Jan 1, 2018 •

edited

Loading

RX14 commented Jan 1, 2018

RX14 commented Jan 1, 2018

chastell commented Jan 1, 2018 •

edited

Loading

straight-shoota commented Jan 1, 2018 •

edited

Loading

asterite commented Jan 1, 2018

chastell commented Jan 1, 2018 •

edited

Loading

asterite commented Jan 2, 2018

chastell commented Jan 2, 2018 •

edited

Loading

chastell commented Jan 2, 2018

asterite commented Jan 2, 2018

chastell commented Jan 6, 2018 •

edited

Loading

asterite commented Jan 6, 2018

asterite commented Jan 6, 2018

chastell commented Jan 8, 2018 •

edited

Loading

asterite commented Jan 8, 2018

asterite commented Jan 8, 2018

straight-shoota left a comment

RX14 left a comment

straight-shoota commented Mar 5, 2022 •

edited

Loading

Refactor String#char_bytesize_at #5501

Refactor String#char_bytesize_at #5501

Conversation

chastell commented Jan 1, 2018

straight-shoota commented Jan 1, 2018

chastell commented Jan 1, 2018

chastell commented Jan 1, 2018

straight-shoota commented Jan 1, 2018

chastell commented Jan 1, 2018

chastell commented Jan 1, 2018

chastell commented Jan 1, 2018

straight-shoota commented Jan 1, 2018 • edited Loading

RX14 commented Jan 1, 2018 • edited Loading

straight-shoota commented Jan 1, 2018

RX14 commented Jan 1, 2018 • edited Loading

straight-shoota commented Jan 1, 2018 • edited Loading

RX14 commented Jan 1, 2018

RX14 commented Jan 1, 2018

chastell commented Jan 1, 2018 • edited Loading

straight-shoota commented Jan 1, 2018 • edited Loading

asterite commented Jan 1, 2018

chastell commented Jan 1, 2018 • edited Loading

asterite commented Jan 2, 2018

chastell commented Jan 2, 2018 • edited Loading

chastell commented Jan 2, 2018

asterite commented Jan 2, 2018

chastell commented Jan 6, 2018 • edited Loading

asterite commented Jan 6, 2018

asterite commented Jan 6, 2018

chastell commented Jan 8, 2018 • edited Loading

asterite commented Jan 8, 2018

asterite commented Jan 8, 2018

straight-shoota left a comment

Choose a reason for hiding this comment

RX14 left a comment

Choose a reason for hiding this comment

straight-shoota commented Mar 5, 2022 • edited Loading

straight-shoota commented Jan 1, 2018 •

edited

Loading

RX14 commented Jan 1, 2018 •

edited

Loading

RX14 commented Jan 1, 2018 •

edited

Loading

straight-shoota commented Jan 1, 2018 •

edited

Loading

chastell commented Jan 1, 2018 •

edited

Loading

straight-shoota commented Jan 1, 2018 •

edited

Loading

chastell commented Jan 1, 2018 •

edited

Loading

chastell commented Jan 2, 2018 •

edited

Loading

chastell commented Jan 6, 2018 •

edited

Loading

chastell commented Jan 8, 2018 •

edited

Loading

straight-shoota commented Mar 5, 2022 •

edited

Loading