Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Zip::File, Zip::Reader and Zip::Writer #3901

Merged
merged 1 commit into from Jan 19, 2017
Merged

Add Zip::File, Zip::Reader and Zip::Writer #3901

merged 1 commit into from Jan 19, 2017

Conversation

asterite
Copy link
Member

Please check the docs for Zip, Zip::File, Zip::Reader and Zip::Writer. The docs explain why there are two types for reading a zip file.

I needed to add File#read_at and IO::Memory#read_at for this. This isn't strictly necessary, but provides a way to read multiple files from a zip file (stored in a File or IO::Memory) concurrently without problems, so you could for example extract multiple files concurrently. For File I used pread, and for IO::Memory I use a sub-view. Go also uses pread for reading from a zip file.

@bcardiff
Copy link
Member

I think there we are missing specs for zip with files nested in directories.
From this SO thread we should be able to write with / but both \ and / should be checked at parsing.

@asterite
Copy link
Member Author

@bcardiff Zip entries have a filename, for example "foo/bar/baz.txt". There's no notion of "directories" inside a zip file, you just add entries to it. A directory entry is assumed to have a trailing "/" and no content, but that's not enforced by the zip spec. What would the missing spec look like?

@bcardiff
Copy link
Member

I checked with a .zip created in windows and they do have forward slashes as path separator. This SO thread had better references.

Regarding missing specs, I though entries were traversed in a tree fashion. Maybe a spec could be added to clarify that.

@asterite
Copy link
Member Author

Entries are stored sequentially inside a zip file. I must admit that I learned everything about the zip file format for implementing this functionality and I also didn't know how zip file entries were stored. I don't know how much of the spec we should repeat in the docs, though... For example, in the spec it says:

4.3 General Format of a .ZIP file
---------------------------------

 4.3.3 Files MAY be stored in arbitrary order within a ZIP file.

So you can put first a file named "foo/bar/baz.txt", then a directory named "foo/", then a directory named "foo/bar/" and so on, without any order that relates to a tree in the filesystem.

There's also the thing that Zip::Reader reads zip entries sequentially from a stream, so this means the order comes from the zip file and can't be that of a tree. For Zip::File we could do something "smarter", but maybe people expect the entries to be in a specified order, because a zip file maintains an order (so if at one point I compress a file and put file "a.txt" in the beginning and in another point in the program I decompress it, I can be very sure that the first entry will be "a.txt")

@spalladino spalladino self-requested a review January 17, 2017 15:31
@Sija
Copy link
Contributor

Sija commented Jan 17, 2017

Does it support ZIP64 archives?

@asterite
Copy link
Member Author

@Sija No, ZIP64 is not supported, there's a note I left in the doc comment of the Zip module. It can be implemented later in a separate PR. ZIP64 allows you to compress files bigger than 4GB, which shouldn't be terribly common.

@Sija
Copy link
Contributor

Sija commented Jan 17, 2017

@asterite Roger that, thx for clarifying.

#
# `Zip::File` is the preferred method to read zip files if you have
# can provide a `File`, because it's a bit more flexible and provides
# more complete information for zip entries (such as comments).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] if you have can provide [...] — there's either too much or too lil' of sth ;)

Copy link
Contributor

@spalladino spalladino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asterite I've added a few comments: most of them questions or suggestions. There is only one small thing I think could be a bug, but maybe I'm wrong.


if self_bytesize < 0
raise ArgumentError.new("negative bytesize")
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this check be on bytesize instead of self_bytesize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! I'll fix it

io.gets_to_end.should eq(File.read(filename))
end
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a spec on the invalid arguments cases: negative offset and offset plus size out of bounds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I see there are such specs for memory. Maybe add them to file as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'll add a spec too :-)


def unbuffered_write(slice : Bytes)
raise IO::Error.new("can't flush read-only IO")
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flush => write?

end

typeof(Zip::File.new("file.zip"))
typeof(Zip::File.open("file.zip") { })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these two typeof calls for here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They check that these calls compile, regardless of what they do. Tests for these should almost be the same as previous tests, but I want to make sure that their body compiles fine.

typeof(Zip::Reader.open("file.zip") { })

typeof(Zip::Writer.new("file.zip"))
typeof(Zip::Writer.open("file.zip") { })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, maybe I'm missing some standard practice in the specs?

# Computes a CRC32 while reading from an underlying IO,
# optionally verifying the computed value against an
# expected one.
private class ChecksumReader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this class to be private and scoped in the Zip module? As far as I understand, CRC32 is quite useful in many other applications, so maybe we could move it to Zlib and make it public. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that, maybe in a next PR. I checked some other implementations, for example Go has something similar and it's private to the zip package, maybe it's not that useful in other contexts.

module Zip
# Counts written bytes and optionally computes a CRC32
# checksum while writing to an underlying IO.
private class ChecksumWriter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as with Reader

directory_end_offset = find_directory_end_offset
entries_size, directory_offset = read_directory_end(directory_end_offset)
@entries = Array(Entry).new(entries_size)
@entries_by_filename = {} of String => Entry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zip allows duplicate entries, though it seems to be a VERY rare use case, and can still be handled by simply iterating the @entries. If we want to support it, then @entries_by_filename should be a hash from string to Array(Entry), which I don't particularly like since it makes the API more awkward. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took inspiration from Java's API. Apparently C# has it too. They don't mention the possibility of duplicated entries. Maybe it's so uncommon that it isn't worth considering, and a nicer API is better.

io.skip(22)
filename_length = read(io, UInt16)
extra_length = read(io, UInt16)
@offset + 30 + filename_length + extra_length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna go ahead and trust you in the 22 and 30 there :-P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's basically skipping 22 bytes from the header instead of reading them one by one. 30 is the size of the header without counting the filename length and extra length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a code comment about those constants?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RX14 @spalladino here you go: 6952e76

zip["one.txt"].open(&.gets_to_end).should eq("One")
zip["two.txt"].open(&.gets_to_end).should eq("Two")
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was the test.zip generated? I was going to suggest adding a spec to read a zip file generated with a standard tool (otherwise we are just checking that we can read the zip files that we have written ourselves), but if this file was created somehow else, then that covers it.

However, the other way around would also apply. Can we can rely to have a zip command-line tool on the CI (at least for the major architectures), so we test that it can open zipfiles generated from Crystal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generated test.zip by creating a zip file using Mac, right-click + compress. In fact when I did that I found some bugs in the original code I had, so that's why later I decided to add a zip file that wasn't generated by the same Zip module.

I don't know if we have a zip command line tool on the CI, but it's definitely a possibility (but maybe fore the future)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding zip to CI shouldn't be hard in the future.

Additionally add File#read_at and IO::Memory#read_at
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants