Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Vectored I/O #6078

Closed
wants to merge 6 commits into from
Closed

Conversation

carlhoerberg
Copy link
Contributor

Optimize writes by using writev

WIP, feedback appreciated

Copy link
Contributor

@RX14 RX14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs specs.

I'm a little concerned that the cost of the heap allocation for the to_a call will be larger than the performance gained using scatter-gather IO. Looking at writev_syscall_heaper makes me think that write_syscall_helper might have been the wrong abstraction and we should have encapsulated only the loop and the errno check, not any of the slice accounting. (i.e. i'd like to use write_syscall_helper for writev as well)

I think the best effort to avoid heap allocs should be made if this is a performance-sensitive syscall.

private def unbuffered_write(*slices : Bytes)
writev_syscall_helper(slices.to_a, "Error writing file") do |slices|
iovecs = slices.map do |slice|
iovec = LibC::IoVec.new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LibC::IoVec.new(iov_base: slice.to_unsafe, iov_len: slice.size) should work and be a lot shorter.

return unbuffered_write(*slices)
end

count = slices.sum &.size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: this is changing terminology from size to count, how about just total_size?

@@ -38,4 +43,5 @@ lib LibC
fun sysconf(x0 : Int) : Long
fun unlink(x0 : Char*) : Int
fun write(x0 : Int, x1 : Void*, x2 : SizeT) : SSizeT
fun writev(x0 : Int, x1 : IoVec*, x2 : SizeT) : SSizeT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will obviously need to be copied to all the other libcs.

@RX14 RX14 changed the title Vectored I/O [WIP] Vectored I/O May 8, 2018
@@ -26,6 +26,22 @@ module Crystal::System::FileDescriptor
end
end

private def unbuffered_write(*slices : Bytes)
writev_syscall_helper(slices.to_a, "Error writing file") do |slices|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This to_a defeats the purpose of the method (it allocates). Does it work without it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, but if I could convert a Tuple to a StaticArray it would.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could add Tuple#static_array, as Number#static_array

crystal/src/number.cr

Lines 65 to 71 in 80a975c

macro static_array(*nums)
%array = uninitialized StaticArray({{@type}}, {{nums.size}})
{% for num, i in nums %}
%array.to_unsafe[{{i}}] = {{@type}}.new({{num}})
{% end %}
%array
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a splat argument with type like *slices : Bytes should just be a StaticArray?

@carlhoerberg
Copy link
Contributor Author

yes, to my understanding write_syscall_helper does basically what read_fully does for reads. Should IO#write always write everything, and in that case, why does not IO#read read everything? I guess it's few times a user would not want to read or write everything?

@RX14
Copy link
Contributor

RX14 commented May 8, 2018

@carlhoerberg write should always write everything because partial writes are always the wrong behaviour, read shouldn't read everything because usually full reads are often the wrong behaviour (you can't implement most protocols efficiently without partial read)

@ysbaddaden
Copy link
Contributor

It would be nice to have a longer description in the pull request:

  • What's the problem?
  • What solution(s) are you proposing?
  • Is using writev(2) noticeably faster for small, medium and/or large collections than a naive and simpler solution, given that you must allocate 2 arrays (#to_a then #map to build the IoVec)?
def write(slices : Enumerable(Bytes))
  slices.each { |slice| write(slice) }
end

def write(*slices : Bytes)
  write(slices)
end

@straight-shoota
Copy link
Member

@ysbaddaden Tuple#map actually returns a Tuple, so no allocation there.

The array allocation can also be skipped if the splash argument is converted to a StaticArray. StaticArray#map also returns a StaticArray.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented May 8, 2018

Still a naive solution is 6 lines of code, doesn't need external bindings, accepts any Enumerable without allocating, ... Using writev(2) has better be worth it :)

@RX14
Copy link
Contributor

RX14 commented May 8, 2018

writev must be worth it for some usecases to be added as a syscall.

@carlhoerberg what is your usecase exactly and can you benchmark it for us?

Before obsessing over the implementation of this PR it'd be nice to see some benchmarks.

@asterite
Copy link
Member

asterite commented May 8, 2018

Worth reading: https://stackoverflow.com/a/10520793/4990850

The use cases are very specific.

In Ruby there's a gem for that: https://github.com/tenderlove/writev

If it's not too difficult to implement, we can have it in the standard library.

@RX14
Copy link
Contributor

RX14 commented May 8, 2018

Hence why I think it'd be interesting to hear @carlhoerberg's usecase :)

@carlhoerberg
Copy link
Contributor Author

Benchmark:

require "benchmark"

data_set = {
  small: Tuple.new("a" * 5, "b" * 5, "c" * 5, "d" * 5, "e" * 5, "f" * 5, "g" * 5, "h" * 5, "i" * 5, "j" * 5).map &.to_slice,
  medium: Tuple.new("a" * 500, "b" * 500, "c" * 500, "d" * 500, "e" * 500, "f" * 500, "g" * 500, "h" * 500, "i" * 500, "j" * 500).map &.to_slice,
  large: Tuple.new("a" * 5000, "b" * 5000, "c" * 5000, "d" * 5000, "e" * 5000, "f" * 5000, "g" * 5000, "h" * 5000, "i" * 5000, "j" * 5000).map &.to_slice,
}

buffered = File.open("/tmp/write-buf", "w")
synced = File.open("/tmp/write-sync", "w").tap { |f| f.sync = true }
writev = File.open("/tmp/writev", "w").tap { |f| f.sync = true }

data_set.each do |type, data|
  Benchmark.ips do |x|
    x.report("write buffered #{type}") do
      data.each do |d|
        buffered.write d
      end
      buffered.flush
    end
    x.report("write sync #{type}") do
      data.each do |d|
        synced.write d
      end
    end
    x.report("writev #{type}") do
      writev.write *data
    end
  end
  puts
end

Output:

write buffered small 144.25k (  6.93µs) (± 6.05%)  0 B/op        fastest
    write sync small  15.52k ( 64.45µs) (± 6.49%)  0 B/op   9.30× slower
        writev small 127.73k (  7.83µs) (± 7.45%)  0 B/op   1.13× slower

write buffered medium  66.11k ( 15.13µs) (± 6.05%)  0 B/op        fastest
    write sync medium   11.6k ( 86.18µs) (±11.17%)  0 B/op   5.70× slower
        writev medium  62.87k ( 15.91µs) (± 7.01%)  0 B/op   1.05× slower

write buffered large   6.36k (157.11µs) (± 7.59%)  0 B/op   2.10× slower
    write sync large   6.42k (155.74µs) (± 8.60%)  0 B/op   2.08× slower
        writev large  13.34k ( 74.96µs) (±12.05%)  0 B/op        fastest

When the combined writes are larger than the buffer's 4096 byte limit the faster (relatively) writev becomes.

@carlhoerberg
Copy link
Contributor Author

Benchmarking TCP writes:

require "benchmark"
require "socket"

server = TCPServer.new("localhost", 1234)

fork do
  client = TCPSocket.new("localhost", 1234)
  client.skip_to_end
end

data_set = {
  small: Tuple.new("a" * 5, "b" * 5, "c" * 5, "d" * 5, "e" * 5, "f" * 5, "g" * 5, "h" * 5, "i" * 5, "j" * 5).map &.to_slice,
  medium: Tuple.new("a" * 500, "b" * 500, "c" * 500, "d" * 500, "e" * 500, "f" * 500, "g" * 500, "h" * 500, "i" * 500, "j" * 500).map &.to_slice,
  large: Tuple.new("a" * 5000, "b" * 5000, "c" * 5000, "d" * 5000, "e" * 5000, "f" * 5000, "g" * 5000, "h" * 5000, "i" * 5000, "j" * 5000).map &.to_slice,
}

socket = server.accept
data_set.each do |type, data|
  Benchmark.ips do |x|
    socket.sync = false
    x.report("write buffered #{type}") do
      data.each do |d|
        socket.write d
      end
      socket.flush
    end
    socket.sync = true
    x.report("write sync #{type}") do
      data.each do |d|
        socket.write d
      end
    end
    x.report("writev #{type}") do
      socket.write *data
    end
  end
  puts
end

output:

write buffered small  28.57k (  35.0µs) (± 7.25%)  0 B/op   7.18× slower
    write sync small  28.53k ( 35.06µs) (± 5.52%)  0 B/op   7.19× slower
        writev small 205.15k (  4.87µs) (± 6.60%)  0 B/op        fastest

write buffered medium  23.65k ( 42.29µs) (± 6.56%)  0 B/op   6.86× slower
    write sync medium   23.9k ( 41.85µs) (± 4.51%)  0 B/op   6.79× slower
        writev medium 162.24k (  6.16µs) (± 7.51%)  0 B/op        fastest

write buffered large  20.91k ( 47.82µs) (± 7.41%)  0 B/op   1.21× slower
    write sync large  20.98k ( 47.66µs) (± 5.44%)  0 B/op   1.20× slower
        writev large  25.25k (  39.6µs) (± 5.91%)  0 B/op        fastest

@@ -81,6 +81,34 @@ module IO::Syscall
end
end

def writev_syscall_helper(slices, errno_msg : String) : Nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errno_msg argument here is IMHO pointless, why not just use Error writing to IO?

@carlhoerberg
Copy link
Contributor Author

Benchmark TCP writes, where I compare it with using a IO::Memory as buffer:

require "benchmark"
require "socket"

server = TCPServer.new("localhost", 1234)

fork do
  client = TCPSocket.new("localhost", 1234)
  client.skip_to_end
end

data_set = {
  small: StaticArray(Bytes, 1).new { ("0" * 5).to_slice },
  medium: StaticArray(Bytes, 10).new { ("0" * 5).to_slice },
  large: StaticArray(Bytes, 100).new { ("0" * 5).to_slice },
}

socket = server.accept
socket.tcp_nodelay = true
socket.sync = true
data_set.each do |type, data|
  Benchmark.ips do |x|
    x.report("write w/ io::memory #{type}") do
      mem = IO::Memory.new
      data.each do |d|
        mem.write d
      end
      socket.write mem.to_slice
    end
    x.report("write sync #{type}") do
      data.each do |d|
        socket.write d
      end
    end
    x.report("writev #{type}") do
      socket.write data
    end
  end
  puts
end

output:

write w/ io::memory small 116.99k (  8.55µs) (± 5.37%)  177 B/op        fastest
         write sync small 105.84k (  9.45µs) (± 5.25%)    0 B/op   1.11× slower
             writev small 113.92k (  8.78µs) (± 4.29%)    0 B/op   1.03× slower

write w/ io::memory medium 102.97k (  9.71µs) (± 1.54%)  176 B/op        fastest
         write sync medium  10.17k ( 98.29µs) (± 9.34%)    0 B/op  10.12× slower
             writev medium  98.14k ( 10.19µs) (± 3.78%)    0 B/op   1.05× slower

write w/ io::memory large  86.69k ( 11.54µs) (± 1.36%)  1168 B/op        fastest
         write sync large   1.08k (928.12µs) (± 6.78%)     0 B/op  80.45× slower
             writev large  40.02k ( 24.99µs) (± 0.85%)     0 B/op   2.17× slower

@@ -26,6 +26,19 @@ module Crystal::System::FileDescriptor
end
end

private def unbuffered_write(slices : Indexable(Bytes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be Enumerable instead of Indexable? (ditto all below)

@ysbaddaden
Copy link
Contributor

Thanks for taking the time to benchmark!

@wooster0
Copy link
Contributor

wooster0 commented Feb 5, 2019

Should someone else maybe continue this PR?

@yxhuvud
Copy link
Contributor

yxhuvud commented Nov 16, 2019

FWIW, looking at the glibc implementation, it looks like they already to a solution that looks quite similar to what RX14 talked about in the puts thread: https://github.com/lattera/glibc/blob/master/sysdeps/posix/writev.c . That one should work fine and atomically with buffered io, I think.

The variant provided by the kernel is on the other hand used by musl: https://github.com/davidlazar/musl/blob/master/src/unistd/writev.c on the other hand. That variant should probably not be mixed with any buffered io as it will write directly to the underlying file descriptor, so any competing buffered writes may end up being applied in an unexpected order.

(I accdentally pinged ysbaddaden in an earlier edit)

@RX14
Copy link
Contributor

RX14 commented Nov 21, 2019

@yxhuvud writev is a linux syscall, i'm 100% certain that glibc implementation is just a fallback for ancient kernels with no writev support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants