Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workaround to handle zero sized files in /proc #3411

Closed
wants to merge 5 commits into from

Conversation

Val
Copy link
Contributor

@Val Val commented Oct 11, 2016

Fix #3410

@@ -417,6 +417,11 @@ class File < IO::FileDescriptor
# ```
def self.read(filename, encoding = nil, invalid = nil) : String
File.open(filename, "r") do |file|
{% if flag?(:linux) %}
# workaround to handle zero sized files in /proc Linux virtual filesystem
encoding ||= "UTF-8" if File.expand_path(filename) =~ /^\/proc/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use starts_with?? Regexes are quite slow and unnecessary for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be more clear to add a specific boolean variable that makes it enter the gets_to_end branch instead of overloading encoding?

@@ -37,6 +37,13 @@ describe "File" do
str.should eq("Hello World\n" * 20)
end

{% if !flag?(:windows) %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be {% if flag?(:linux) %}; this doesn't work on OSX.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, /proc is only pretty much guaranteed to be mounted on windows, but it can be mounted on most unixes.

@asterite
Copy link
Member

I would fix this by changing File.read to open the file and invoke gets_to_end when the file size is zero, or always. The problem has nothing to do with encoding.

@Val Val force-pushed the proposed_fix_for_#3410 branch from 8479c45 to b3c826c Compare October 11, 2016 19:38
@@ -417,6 +417,10 @@ class File < IO::FileDescriptor
# ```
def self.read(filename, encoding = nil, invalid = nil) : String
File.open(filename, "r") do |file|
{% if flag?(:linux) %}
return file.gets_to_end if File.expand_path(filename).starts_with?("/proc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now ignores encoding for /proc*.

@asterite
Copy link
Member

@Val we finally decided with @waj that the safest thing to do, even if it's a bit slower, is to always invoke gets_to_end. So basically it should be:

  def self.read(filename, encoding = nil, invalid = nil) : String
    File.open(filename, "r") do |file|
      file.set_encoding(encoding, invalid: invalid) if encoding
      file.gets_to_end
    end
  end

But I'll let you change this in your PR because you have the specs, and you found the bug :-)

@RX14
Copy link
Member

RX14 commented Oct 11, 2016

@asterite I think File.read is a common enough call that optimising it is worthwhile.
I think we should at least do some benchmarks to determine the slowdown factor before we merge this.

@Val
Copy link
Contributor Author

Val commented Oct 11, 2016

path = "/proc/self/cmdline"

def read1(filename, encoding = nil, invalid = nil) : String
  File.open(filename, "r") do |file|
    if encoding
      file.set_encoding(encoding, invalid: invalid)
      file.gets_to_end
    else
      {% if flag?(:linux) %}
      return file.gets_to_end if File.expand_path(filename).starts_with?("/proc")
      {% end %}

      size = file.size.to_i
      String.new(size) do |buffer|
        file.read Slice.new(buffer, size)
        {size.to_i, 0}
      end
    end
  end
end

def read2(filename, encoding = nil, invalid = nil) : String
  File.open(filename, "r") do |file|
    file.set_encoding(encoding, invalid: invalid) if encoding
    file.gets_to_end
  end
end

require "benchmark"
Benchmark.ips do |x|
  x.report("read1") { read1 path }
  x.report("read2") { read2 path }
end

On a Debian jessie amd64:

>  crystal build --release file_read_bench.rb
> ./file_read_bench
read1 174.99k (± 6.28%)  1.73× slower
read2 301.93k (± 5.34%)       fastest
> ./file_read_bench
read1 212.89k (± 5.30%)  1.24× slower
read2 263.87k (± 3.41%)       fastest

@RX14
Copy link
Member

RX14 commented Oct 11, 2016

@Val can you also test on various sizes of normal file, say 1kb, 1mb,10mb?

@Val
Copy link
Contributor Author

Val commented Oct 11, 2016

path = "/tmp/#{File.basename(__FILE__)}"

def read1(filename, encoding = nil, invalid = nil) : String
  File.open(filename, "r") do |file|
    if encoding
      file.set_encoding(encoding, invalid: invalid)
      file.gets_to_end
    else
      {% if flag?(:linux) %}
      return file.gets_to_end if File.expand_path(filename).starts_with?("/proc")
      {% end %}

      size = file.size.to_i
      String.new(size) do |buffer|
        file.read Slice.new(buffer, size)
        {size.to_i, 0}
      end
    end
  end
end

def read2(filename, encoding = nil, invalid = nil) : String
  File.open(filename, "r") do |file|
    file.set_encoding(encoding, invalid: invalid) if encoding
    file.gets_to_end
  end
end

BLOBS = {} of Symbol => String
kb_01 = BLOBS[:kb_01] = "X" * 1024
mb_01 = BLOBS[:mb_01] = kb_01 * 1024
BLOBS[:mb_10] = mb_01 * 10

require "benchmark"

BLOBS.each do |name, blob|
  File.open(path, "w") { |f| f.print blob }

  puts "#{name} _____________________________________\n"
  Benchmark.ips do |x|
    x.report("read1 #{name}") { read1 path }
    x.report("read2 #{name}") { read2 path }
  end
  puts
end

File.delete path

Output

> crystal build --release file_read_bench.cr 
> ./file_read_bench 
kb_01 _____________________________________

read1 kb_01 349.94k (± 2.17%)       fastest
read2 kb_01 336.11k (± 2.58%)  1.04× slower

mb_01 _____________________________________

read1 mb_01   9.43k (± 2.27%)       fastest
read2 mb_01   2.32k (± 2.69%)  4.07× slower

mb_10 _____________________________________

read1 mb_10  532.3  (± 3.12%)       fastest
read2 mb_10 157.96  (± 1.78%)  3.37× slower

@Val
Copy link
Contributor Author

Val commented Oct 11, 2016

Subtle, changing return file.gets_to_end if File.expand_path(filename).starts_with?("/tmp").

> ./file_read_bench 
kb_01 _____________________________________

read1 kb_01 298.51k (± 2.54%)  1.13× slower
read2 kb_01 337.64k (± 3.03%)       fastest

mb_01 _____________________________________

read1 mb_01   2.32k (± 4.93%)       fastest
read2 mb_01   2.31k (± 6.39%)  1.01× slower

mb_10 _____________________________________

read1 mb_10 154.89  (± 5.72%)  1.04× slower
read2 mb_10 161.59  (± 4.46%)       fastest

@Val
Copy link
Contributor Author

Val commented Oct 11, 2016

@RX14 & @asterite it seems that 416b0e7 should be reverted, what do you think?

@asterite asterite closed this in a33c0c2 Oct 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants