Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use slice instead of StaticArray for String::CHAR_TO_DIGIT #5616

Closed
wants to merge 1 commit into from

Conversation

S-YOU
Copy link

@S-YOU S-YOU commented Jan 20, 2018

By using Pointer and Slice instead of StaticArray,
less LLVM IR generation for String::CHAR_TO_DIGIT (1054 hits to 76)

@ysbaddaden
Copy link
Contributor

But requires a HEAP allocation.

@S-YOU
Copy link
Author

S-YOU commented Jan 21, 2018

Is there any performance benefit (or something) over having this in stack?

@asterite
Copy link
Member

What does hits mean here?

@S-YOU
Copy link
Author

S-YOU commented Jan 21, 2018

I meant "hits" as number of occurrences of CHAR_TO_DIGIT including CHAR_TO_DIGIT62 in outputted .ll file. But I now realized, that number depends on I use it or not, and how many times I used.

@S-YOU
Copy link
Author

S-YOU commented Jan 21, 2018

This is minimal code I used to generate .ll file in release mode

puts String::CHAR_TO_DIGIT

I see like following in .ll file

%.unpack90 = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 1), align 1
  %.unpack91 = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 2), align 2
  %.unpack92 = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 3), align 1
  %.unpack93 = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 4), align 4
  %.unpack94 = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 5), align 1
  %.unpack95 = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 6), align 2
  %.unpack96 = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 7), align 1
  %.unpack97 = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 8), align 8
  %.unpack98 = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 9), align 1
  %
... x 256 times

%obj1.repack1.i.i.i = getelementptr inbounds [256 x i8], [256 x i8]* %obj1.i.i.i, i64 0, i64 1
  store i8 %.unpack90, i8* %obj1.repack1.i.i.i, align 1
  %obj1.repack3.i.i.i = getelementptr inbounds [256 x i8], [256 x i8]* %obj1.i.i.i, i64 0, i64 2
  store i8 %.unpack91, i8* %obj1.repack3.i.i.i, align 1
  %obj1.repack5.i.i.i = getelementptr inbounds [256 x i8], [256 x i8]* %obj1.i.i.i, i64 0, i64 3
  store i8 %.unpack92, i8* %obj1.repack5.i.i.i, align 1
  %obj1.repack7.i.i.i = getelementptr inbounds [256 x i8], [256 x i8]* %obj1.i.i.i, i64 0, i64 4
  store i8 %.unpack93, i8* %obj1.repack7.i.i.i, align 1
  %obj1.repack9.i.i.i = getelementptr inbounds [256 x i8], [256 x i8]* %obj1.i.i.i, i64 0, i64 5
  store i8 %.unpack94, i8* %obj1.repack9.i.i.i, align 1
  %obj1.repack11.i.i.i = getelementptr inbounds [256 x i8], [256 x i8]* %obj1.i.i.i, i64 0, i64 6
... x 256 times

 %table.repack7 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 1
  %table.repack9 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 2
  %table.repack11 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 3
  %table.repack13 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 4
  %table.repack15 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 5
  %table.repack17 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 6
  %table.repack19 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 7
  %table.repack21 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 8
  %table.repack23 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 9
  %table.repack25 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 10
  %table.repack27 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 11
  %table.repack29 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 12
  %table.repack31 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 13
  %table.repack33 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 14
  %table.repack35 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 15
  %table.repack37 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 16
... x 256 times

store i8 %.unpack519, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 1), align 1
  store i8 %.unpack521, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 2), align 2
  store i8 %.unpack523, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 3), align 1
  store i8 %.unpack525, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 4), align 4
  store i8 %.unpack527, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 5), align 1
  store i8 %.unpack529, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 6), align 2
  store i8 %.unpack531, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 7), align 1
  store i8 %.unpack533, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 8), align 8
  store i8 %.unpack535, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 9), align 1
  store i8 %.unpack537, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 10), align 2
  store i8 %.unpack539, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 11), align 1
  store i8 %.unpack541, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 12), align 4
  store i8 %.unpack543, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 13), align 1
  store i8 %.unpack545, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 14), align 2
  store i8 %.unpack547, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 15), align 1
  store i8 %.unpack549, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 16), align 16
  store i8 %.unpack551, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 17), align 1
  store i8 %.unpack553, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 18), align 2
  x 256 times

%array.i.sroa.0.0.copyload = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 0), align 16
  %array.i.sroa.4.0.copyload = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 1), align 1
  %array.i.sroa.5.0.copyload = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 2), align 2
  %array.i.sroa.6.0.copyload = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 3), align 1
  %array.i.sroa.7.0.copyload = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 4), align 4
  %array.i.sroa.8.0.copyload = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 5), align 1
  %array.i.sroa.9.0.copyload = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 6), align 2
  %array.i.sroa.10.0.copyload = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 7), align 1
  %array.i.sroa.11.0.copyload = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 8), align 8
  %array.i.sroa.12.0.copyload = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 9), align 1
  %array.i.sroa.13.0.copyload = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 10), align 2
  %array.i.sroa.14.0.copyload = load i8, i8* getelementptr inbounds ([256 x i8], [256 x i8]* @"String::CHAR_TO_DIGIT", i64 0, i64 11), align 1
  %arra
... x 256 times

%table.repack2 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 1
  store i8 %array.i.sroa.4.0.copyload, i8* %table.repack2, align 1
  %table.repack4 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 2
  store i8 %array.i.sroa.5.0.copyload, i8* %table.repack4, align 1
  %table.repack6 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 3
  store i8 %array.i.sroa.6.0.copyload, i8* %table.repack6, align 1
  %table.repack8 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 4
  store i8 %array.i.sroa.7.0.copyload, i8* %table.repack8, align 1
  %table.repack10 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 5
  store i8 %array.i.sroa.8.0.copyload, i8* %table.repack10, align 1
  %table.repack12 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 6
  store i8 %array.i.sroa.9.0.copyload, i8* %table.repack12, align 1
  %table.repack14 = getelementptr inbounds [256 x i8], [256 x i8]* %table, i64 0, i64 7
  store i8 %array.i.sroa.10.0.copyload, i8* %table.repack14, align 1
  ... x 256 times

%.unpack514 = load i8, i8* %table.repack2, align 1
  %.unpack516 = load i8, i8* %table.repack4, align 1
  %.unpack518 = load i8, i8* %table.repack6, align 1
  %.unpack520 = load i8, i8* %table.repack8, align 1
  %.unpack522 = load i8, i8* %table.repack10, align 1
  %.unpack524 = load i8, i8* %table.repack12, align 1
  %.unpack526 = load i8, i8* %table.repack14, align 1
  %.unpack528 = load i8, i8* %table.repack16, align 1
  %.unpack530 = load i8, i8* %table.repack18, align 1
  %.unpack532 = load i8, i8* %table.repack20, align 1
  %.unpack534 = load i8, i8* %table.repack22, align 1
  %.unpack536 = load i8, i8* %table.repack24, align 1
  %.unpack538 = load i8, i8* %table.repack26, align 1
  ... x 256 times

and some more

Comparing by line numbers in .ll file

with stack: 41438
with heap: 40203 (1235 less)

By filesize

2725276   char_to_digit_stack.ll
2544674   char_to_digit_heap.ll (180k less)

@lbguilherme
Copy link
Contributor

The way LLVM generates code for static array is quite poor. But as I understand, the code is generated in a way that is easy to be optimized in the later phase. Perhaps you should compare the optimized version of the ll, or even the final optimized binary. Using heap here is clearly more runtime work and if it is better then staticarray is wrong all the way down.

@ysbaddaden
Copy link
Contributor

The difference is that StaticArray (and Slices) is bound checked, but the changed code uses direct pointer accesses.

Keep the static array and skip the malloc but operate on #to_unsafe to directly access pointers, skipping bound checks and I'm pretty sure LLVM will optimize in the same way.

But you get unsafe code, and must be perfectly sure to never have out of bounds value.

@lbguilherme
Copy link
Contributor

@ysbaddaden the previous code already had to_unsafe

@S-YOU
Copy link
Author

S-YOU commented Jan 21, 2018

How to get optimized version of ll?

@S-YOU
Copy link
Author

S-YOU commented Jan 21, 2018

I tried to compare in assembly dump, but I am not sure those really refer to CHAR_TO_DIGITS array, because every lines differ in comparison.

But I do see similar in objdump output too.

8 84 24 00 01 00 00 	movb	%al, 256(%rsp)
44 88 94 24 01 01 00 00 	movb	%r10b, 257(%rsp)
8a 84 24 f8 00 00 00 	movb	248(%rsp), %al
88 84 24 02 01 00 00 	movb	%al, 258(%rsp)
8a 84 24 f9 00 00 00 	movb	249(%rsp), %al
88 84 24 03 01 00 00 	movb	%al, 259(%rsp)
44 88 9c 24 04 01 00 00 	movb	%r11b, 260(%rsp)
44 88 8c 24 05 01 00 00 	movb	%r9b, 261(%rsp)
40 88 bc 24 06 01 00 00 	movb	%dil, 262(%rsp)
8a 84 24 fa 00 00 00 	movb	250(%rsp), %al
88 84 24 07 01 00 00 	movb	%al, 263(%rsp)
8a 84 24 fc 00 00 00 	movb	252(%rsp), %al
88 84 24 08 01 00 00 	movb	%al, 264(%rsp)
8a 84 24 fb 00 00 00 	movb	251(%rsp), %al
88 84 24 09 01 00 00 	movb	%al, 265(%rsp)
8a 84 24 ff 00 00 00 	movb	255(%rsp), %al
88 84 24 0a 01 00 00 	movb	%al, 266(%rsp)
40 88 b4 24 0b 01 00 00 	movb	%sil, 267(%rsp)
44 88 84 24 0c 01 00 00 	movb	%r8b, 268(%rsp)
8a 84 24 fe 00 00 00 	movb	254(%rsp), %al
88 84 24 0d 01 00 00 	movb	%al, 269(%rsp)
8a 84 24 fd 00 00 00 	movb	253(%rsp), %al
88 84 24 0e 01 00 00 	movb	%al, 270(%rsp)
44 88 a4 24 0f 01 00 00 	movb	%r12b, 271(%rsp)
44 88 bc 24 10 01 00 00 	movb	%r15b, 272(%rsp)
40 88 ac 24 11 01 00 00 	movb	%bpl, 273(%rsp)
8a 84 24 d4 00 00 00 	movb	212(%rsp), %al
88 84 24 12 01 00 00 	movb	%al, 274(%rsp)
8a 84 24 cf 00 00 00 	movb	207(%rsp), %al
88 84 24 13 01 00 00 	movb	%al, 275(%rsp)
8a 84 24 c5 00 00 00 	movb	197(%rsp), %al
88 84 24 14 01 00 00 	movb	%al, 276(%rsp)
8a 84 24 c4 00 00 00 	movb	196(%rsp), %al
.. many more lines

a 44 24 7f 	movb	127(%rsp), %al
88 84 24 5c 01 00 00 	movb	%al, 348(%rsp)
8a 44 24 7e 	movb	126(%rsp), %al
88 84 24 5d 01 00 00 	movb	%al, 349(%rsp)
8a 44 24 7d 	movb	125(%rsp), %al
88 84 24 5e 01 00 00 	movb	%al, 350(%rsp)
8a 44 24 7c 	movb	124(%rsp), %al
88 84 24 5f 01 00 00 	movb	%al, 351(%rsp)
8a 44 24 7b 	movb	123(%rsp), %al
88 84 24 60 01 00 00 	movb	%al, 352(%rsp)
8a 44 24 7a 	movb	122(%rsp), %al
88 84 24 61 01 00 00 	movb	%al, 353(%rsp)
8a 44 24 79 	movb	121(%rsp), %al
88 84 24 62 01 00 00 	movb	%al, 354(%rsp)
8a 44 24 78 	movb	120(%rsp), %al
88 84 24 63 01 00 00 	movb	%al, 355(%rsp)
8a 44 24 77 	movb	119(%rsp), %al
88 84 24 64 01 00 00 	movb	%al, 356(%rsp)
8a 44 24 76 	movb	118(%rsp), %al
88 84 24 65 01 00 00 	movb	%al, 357(%rsp)
8a 44 24 75 	movb	117(%rsp), %al
88 84 24 66 01 00 00 	movb	%al, 358(%rsp)
8a 44 24 74 	movb	116(%rsp), %al
... many more lines

84 24 cd 01 00 00 	movb	%al, 461(%rsp)
8a 84 24 c6 00 00 00 	movb	198(%rsp), %al
88 84 24 ce 01 00 00 	movb	%al, 462(%rsp)
8a 84 24 c7 00 00 00 	movb	199(%rsp), %al
88 84 24 cf 01 00 00 	movb	%al, 463(%rsp)
8a 84 24 c8 00 00 00 	movb	200(%rsp), %al
88 84 24 d0 01 00 00 	movb	%al, 464(%rsp)
8a 84 24 ca 00 00 00 	movb	202(%rsp), %al
88 84 24 d1 01 00 00 	movb	%al, 465(%rsp)
8a 84 24 cb 00 00 00 	movb	203(%rsp), %al
88 84 24 d2 01 00 00 	movb	%al, 466(%rsp)
8a 84 24 cc 00 00 00 	movb	204(%rsp), %al
88 84 24 d3 01 00 00 	movb	%al, 467(%rsp)
8a 84 24 cd 00 00 00 	movb	205(%rsp), %al
88 84 24 d4 01 00 00 	movb	%al, 468(%rsp)
8a 84 24 d0 00 00 00 	movb	208(%rsp), %al
88 84 24 d5 01 00 00 	movb	%al, 469(%rsp)
8a 84 24 d1 00 00 00 	movb	209(%rsp), %al
88 84 24 d6 01 00 00 	movb	%al, 470(%rsp)
8a 84 24 d2 00 00 00 	movb	210(%rsp), %al
88 84 24 d7 01 00 00 	movb	%al, 471(%rsp)
8a 84 24 d3 00 00 00 	movb	211(%rsp), %al
88 84 24 d8 01 00 00 	movb	%al, 472(%rsp)
8a 84 24 d5 00 00 00 	movb	213(%rsp), %al
88 84 24 d9 01 00 00 	movb	%al, 473(%rsp)
8a 84 24 d6 00 00 00 	movb	214(%rsp), %al
88 84 24 da 01 00 00 	movb	%al, 474(%rsp)
8a 84 24 d7 00 00 00 	movb	215(%rsp), %al
88 84 24 db 01 00 00 	movb	%al, 475(%rsp)
8a 84 24 d8 00 00 00 	movb	216(%rsp), %al
... many more lines

@S-YOU
Copy link
Author

S-YOU commented Jan 21, 2018

The core issue is StaticArray using a loop to fill a value to all items.

def self.new(&block : Int32 -> T)
    array = uninitialized self
    N.times do |i|
      array.to_unsafe[i] = yield i
    end
    array
  end

but I think those can actually done with one memset (for Int8 | UInt8) call for initialization.
I could try to send a PR to use memset, instead of this change, if that is preferable.

EDIT: I tried to change StaticArray and used memset instead of Loop but that does not change the result, so uninitialized self causing all those stuff.

@S-YOU
Copy link
Author

S-YOU commented Jan 21, 2018

Not a typical usage, but doing an assignment on benchmark went a bit slower.
Closing, Thanks for taking your time.

require "benchmark"

Benchmark.ips do |x|
  x.report("CHAR_TO_DIGIT2") { b = String::CHAR_TO_DIGIT2 }
  x.report("CHAR_TO_DIGIT") { a = String::CHAR_TO_DIGIT }
end
CHAR_TO_DIGIT2 468.55M (  2.13ns) (± 8.64%)  0 B/op   1.18× slower
CHAR_TO_DIGIT 552.43M (  1.81ns) (± 8.64%)  0 B/op        fastest

@S-YOU S-YOU closed this Jan 21, 2018
@S-YOU S-YOU deleted the chore_char_to_digit branch May 4, 2018 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants