Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Int32 overflow when using File.read on big files #3209 #3218

Closed

Conversation

olbat
Copy link
Contributor

@olbat olbat commented Aug 31, 2016

No description provided.

@oprypin
Copy link
Member

oprypin commented Aug 31, 2016

I hate that in Crystal such things might keep coming up and I don't think this is the right solution, but it's an improvement 👍

def self.read(filename, encoding = nil, invalid = nil) : String
File.open(filename, "r") do |file|
if encoding
file.set_encoding(encoding, invalid: invalid)
file.gets_to_end
else
if file.size > Int32::MAX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this check be performed always and not just when no encoding is defined?

I believe it will be more consistent behavior that File.read checks size on any scenario, providing or not an encoding for the file being read.

@jhass
Copy link
Member

jhass commented Aug 31, 2016

Shouldn't we switch String to an UInt32 header at the same time?

@ysbaddaden
Copy link
Contributor

I'm afraid this is a bad solution: File musn't care about any limitation of String or Slice. This must be checked at the actual place that has the limitation.

@jhass definitely, sizes should be expressed with an unsigned integer (as is usually the case in C) so we could grow up to 4GB strings/slices. Maybe we could even consider the case where

@oprypin
Copy link
Member

oprypin commented Aug 31, 2016

To me it sounds ridiculous to switch from one arbitrary limitation to another slightly bigger arbitrary limitation, especially at the price of having to deal with unsigned types (let's face it, non-Int32 types are a pain to deal with currently).

In my opinion, all sizes should be expressed with a 64-bit type (yes, ideally it would be platform-dependent, but that is also a pain to deal with in Crystal). The 32-bit plague can't go on. I've seen it in many places throughout the codebase, and this is only the beginning of the problems that will happen and may hurt the reputation of Crystal: "Oh, I can't work with big files in Crystal? Well, that's too bad."

@ozra
Copy link
Contributor

ozra commented Aug 31, 2016

Yes, sizes and indeces should definitely be signed 64 bit integer on a 64 bit capable platform imo.

Or simply: 64 bit wide integers should be the norm (from Crystal number literals too).

Int32 should only be used when doing something very specific, like heavy integer-division or for size optimization of structs / arrays or whatever.

@ysbaddaden
Copy link
Contributor

I'd 👍 an Int64 for 64 bits platforms, but maybe we should use LibC::Long instead, for a platform specific limitation. Could this avoid some performance hit on 32bits platforms? Unless it would be marginal?

Thought, using a signed integer, we do arbitrarily limit the maximum size to half of what it could it be. This is especially remarkable on 32bits.

I'd like some input from @waj and @asterite here.

@ozra
Copy link
Contributor

ozra commented Aug 31, 2016

The limitation in size because of sign is not arbitrary, it's quite an acceptable one, if it causes less boring type-conflicts when coding (UInt* vs Int*).

Most hardware addressing schemes can't handle more than 47 bits. And those last bits of addressing space are usually forbidden in user-space anyway (must be zero according to spec).

I think it's safe to say that sacrificing that last bit is perfectly alright, since we won't be near touching it anyway. Such allocation sizes won't happen until 64-bit arch has been dead for decades. (Just like we now are trying to finally put 32-bit arch in the second room - it's time!)

However, this is probably bike-shedding nit-picking, sorry.

@olbat
Copy link
Contributor Author

olbat commented Aug 31, 2016

@luislavena this is only done in the case no encoding is specified because if I understood the code properly, file.gets_to_end is not impacted by the bug since it's reading the file another way (using String.build + a buffer).
Another way to fix this would be to use IO#gets_to_end in both cases but file will not be read in "one shot" if not encoding is specified anymore (will be more consistent).

@ysbaddaden @BlaXpirit @ozra Why not introduce a size_t like type and use it when having to deal with sizes ? (see #3209)

Like this every modifications on the primitive type used to represent sizes would be done in one single place (and the type may be chosen at compile time depending on the arch ?).
Unfortunately, this implies a lot of modifications in the standard lib ...

@olbat
Copy link
Contributor Author

olbat commented Aug 31, 2016

BTW, if making this fix the right way (use 64-bits Slices/sizes, ...) takes too much time/effort, could we at least merge a quick fix like this one to avoid silent errors ? (see #3209)

@RX14
Copy link
Member

RX14 commented Aug 31, 2016

@olbat considering the 0.19 release is a while away, we have time to settle on and implement the correct solution here.

@olbat
Copy link
Contributor Author

olbat commented Aug 31, 2016

Okay, sorry I'm not really aware of the releases schedule I was thinking that the 0.19 will be released soon.

@ysbaddaden Why not LibC::SizeT instead of LibC::Long ? I think it would ensure a better compatibility with C bindings.

@akzhan
Copy link
Contributor

akzhan commented Jun 14, 2017

Just note that File.size has LibC::OffT type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants