Add MIME multipart parsing #2967

RX14 · 2016-07-07T16:24:18Z

This pull request adds classes for parsing and generating MIME multipart/* messages, based on RFC2046 Section 5.1.

This pr depends on #2972, #2973.

kostya · 2016-07-07T22:36:53Z

interesting how this performance compare to https://github.com/kostya/http_parser.cr/blob/master/benchmark/bench_multipart.cr

RX14 · 2016-07-07T23:03:58Z

From preliminary benchmarks it looks about half as fast:

https://gist.github.com/RX14/e74d8bb3e812485259274591d5f45d67

Not sure exactly what to improve, i'm sure the somewhat increased user-friendliness of this implementation has something to do with it.

ysbaddaden · 2016-07-08T13:27:01Z

What about having a MIME::Multipart class to handle generic MIME multipart (for form data or email parts) along with MIME::Multipart::Parser, then having a HTTP::FormData class to build a multipart/form-data body along with HTTP::FormData::Parser to parse an incoming request?

Maybe have a simpler API to build a multipart/form-data, with headers being generated automatically, or even skipped to directly write the single Content-Disposition (for FormData). Also body_part looks like a getter, what about add_part or just add or append (to follow the DOM FormData interface)?

data = HTTP::FormData.new
data.add("username", "john")

File.open("avatar.jpg", "r") do |file|
  data.add("file", file, HTTP::Headers{"content-type" => "image/jpeg"})
end

We miss a mecanism in HTTP::Client to directly write the body to the IO, but we could allow body to accept a HTTP::FormData in addition to String|Nil?

HTTP::Client.put("https://api.example.com/user/avatar", body: data)

RX14 · 2016-07-08T17:29:50Z

I'm ok with renaming HTTP::Multipart to MIME::Multipart, but I wish to keep Parser and Generator.

I was thinking of having a simpler API for multipart/form-data, but I wasn't quite sure what it would be. Being able to provide the form data as a hash is something I wish to support however. As for parsing, I intentionally wish to make the default parsing method able to handle large files without resorting to the ugly hack that rack (and many other libraries) use: writing all large files out to disk and providing uploads as files. This means callbacks with an IO given, at least for all files.

The kind of interface I am currently thinking of for parsing multipart requests would be callback-based like this:

HTTP::FormData.parse(request) do |p|
  p.field("username") do |username|
    # handle username field
  end
  p.file("upload") do |io|
    # use IO
  end
end

However this approach is a little unwieldy for parsing non-file fields so maybe this would be better, if a little less consistent:

mp = HTTP::FormData::Parser.new(request)
mp.handle_file("upload") do |io, filename(, headers)|
  # handle file somehow
end
mp.handle_files do |field, io, filename(, headers)|
  # callback for every file upload
  # cannot be used with `#handle_file`
end
mp.parse # => {"username" => {value: "rx14", headers: HTTP::Headers{...}}, ...}

My preference for generating FormData would be with a builder, or providing a Hash(String, String | IO | {value: String | IO, headers: HTTP::Headers}).

ysbaddaden · 2016-07-08T20:21:56Z

What about yielding each part as it comes, where field would be a struct with some info (eg: headers, name, filename, content type, ...) and value either being a String or an IO when the part is a file?

For example (yes, it uses a tempfile, but this is an usage example):

record UploadedFile, filename : String, file : Tempfile
params = Hash(String, String|UploadedFile).new

HTTP::FormData.parse(request) do |field, value|
  case value
  when IO
    tmp = Tempfile.new("multipart")
    IO.copy(value, tmp)
    params[field.name] = UploadedFile.new(field.filename, tmp)
  else
    params[field.name] = value
  end
end

RX14 · 2016-07-08T20:53:31Z

That kind of boilerplate was what I wanted to avoid. With that method you're essentially going to end up with people switch-casing in their block based on field names. Having callbacks based on field-name seems like the most user-friendly way to return io objects, and returning a hash seems to be the most user-friendly way to return strings/non-files. The question is whether to use callbacks for both for consistency, or do both which will likely be easier to use.

Maybe if the FormData callbacks were basically transformers from io to another storable value.

For example:

mp = HTTP::FormData::Parser.new(request)
mp.transform_file("upload") do |io, filename, headers|
  tempfile = Tempfile.new(filename)
  IO.copy(io, tempfile)
  tempfile
end
mp.parse # => {"username" => {value: "rx14", headers: HTTP::Headers{...}}, "upload" => {value: Tempfile, headers: HTTP::Headers{...}}}

would be the way to implement saving files to disk for later usage. This is flexible, because the transformer could handle the io right there, and map to nil. Not quite sure how to implement it with regards to typing though.

RX14 · 2016-07-08T22:16:36Z

Ok, After discussing on IRC I think the correct interface is like so:

username = nil
password = nil
email = nil
avatar_url = nil
HTTP::FormData.parse(request) do |p|
  p.field("username") { |u| username = u }
  p.field("password") { |p| password = p }
  p.field("email") { |e| email = e }
  p.file("avatar") do |io|
    # write avatar
    # generate url
    avatar_url = url
  end
end

It's got a little bit of boilerplate for collecting fields, but I think it's consistent. If anyone can work out a way of getting rid of the boilerplate cleanly, it would be appreciated.

RX14 · 2016-07-09T12:56:58Z

I can't rename HTTP::Multipart to MIME::Multipart because I depend on HTTP::Headers and the http header parsing stuff. Returning HTTP::Headers, and requiring http, from the MIME namespace seems weird. Neither do I want to duplicate HTTP::Headers etc. to the MIME namespace.

The most technically correct option would be to note that rfc5322 is the basis for MIME and HTTP headers, although i'm not quite sure HTTP headers and MIME headers have exactly the same format and semantics.

asterite · 2016-07-09T14:24:35Z

bin/crystal

@@ -89,7 +89,7 @@ _canonicalize_file_path() {
    local dir file
    dir=$(dirname -- "$1")
    file=$(basename -- "$1")
-    (cd "$dir" 2>/dev/null && printf '%s/%s\n' "$(pwd -P)" "$file")
+    (cd "$dir" 2>/dev/null >/dev/null && printf '%s/%s\n' "$(pwd -P)" "$file")


Why is this needed? What's a way to reproduce something that, without this change, breaks?

I set CDPATH on my system, which causes cd to output which directory it's cding to, because it can be ambiguous.

asterite · 2016-07-09T15:06:05Z

Overall looks good!

Maybe you could split this PR into several PRs so it's easy to discuss some parts of it? For example IO::Sized is never used outside its specs, so probably doesn't belong here. Then another PR for just IO::Delimited would be good. And once we accept those, we can discuss this PR, which should be much smaller.

What do you think?

asterite · 2016-07-09T15:06:34Z

Oh, and a separate PR for bin/crystal, it's not clear to me why it's needed, so it should be discussed separately.

RX14 · 2016-07-09T20:04:31Z

Ok, PR has been split into 4.

RX14 · 2016-07-15T22:39:37Z

This also needs HTTP::Request's body to be an IO to actually be useful. What issues/problems are we going to face doing that? Seems to be a bit of a pain to do from the way HTTP::Request works at the moment.

sdogruyol · 2016-07-30T18:55:31Z

Up! What's the status on this please 🙏

RX14 · 2016-07-30T22:09:09Z

Just have to document and finish the specs. Nearly there!

asterite · 2016-08-04T14:51:04Z

I wouldn't expose the IO, I would make the parser an IO that's always positioned at some multipart part. So in the beginning it would probably start in the epilogue (or maybe this should be automatically skipped, I don't know), then you read until you don't have any more data. Then you call next_part and the IO now points to the next part, and so on. So you can't really keep a reference to some part, you are always working with the entire multipart body positioned at some point.

RX14 · 2016-08-04T14:55:30Z

@asterite the Multipart::Parser doesn't do callbacks, it just passes the io into a block to try to show the lifetime of the IO::Delimited clearly. FormData::Parser does do callbacks because I just couldn't find a better way to do it while keeping it clean and not buffering the whole formdata in memory.

I also think that having the parser be an IO and a parser is a bit messy, and tries to combine two seperate concerns into one object. I believe the current method is cleaner and more intuitive.

asterite · 2016-08-04T16:32:25Z

@RX14 Oh, now I see only the form data has callbacks... I'm still not sure about that, I'll probably need to review this with @waj at some point (and I'll make a big effort to do this before 0.19.0)

sdogruyol · 2016-08-08T18:30:34Z

Up up up 👍

sdogruyol · 2016-09-01T12:33:57Z

@asterite aren't you gonna release this with 0.19.0 😢

RX14 · 2016-09-01T12:47:37Z

@sdogruyol why wouldn't it? It's still on the 0.19.0 milestone.

asterite · 2016-09-01T12:47:46Z

I'm removing the 0.19.0 milestone. I won't merge this as long as it uses callbacks. Programming against callbacks requires a different way of thinking and organizing code than in regular sequential code (or pull parsers).

asterite · 2016-09-01T13:15:49Z

@sdogruyol It's not a big deal, one can create a shard with this code and use it until we have this functionality in the std

sdogruyol · 2016-09-01T13:49:09Z

@asterite actually it's not that easy to create an efficient and correct multipart parser since it's kinda hard to deal with(i tried at least 3 times and failed).

I was actually waiting on this to release the next version of Kemal 😞

RX14 · 2016-09-01T14:04:04Z

I'm not quite sure how the callbacks in this PR are semantically different to the server callbacks in HTTP::Server etc.

For example, see these two examples:

HTTP::Server.listen(8080) do |context|
  context.response.content_type = "text/plain"
  context.response.print "Hello world, got #{context.request.path}!"
end

HTTP::FormData.parse(request) do |parser|
  parser.field("field1") do |data|
    data # => "field data"
  end
end

Both use blocks (callbacks). Both of which are executed in response to an event (HTTP request/FormData section parse). Both blocks are executed before the call returns. Both accept a raw .new form which requires an extra call to start listening, or parsing. As far as I can see the only difference is that the multipart parser allows multiple callbacks (which Benchmark.bm has too, in the same style as this PR even). Where is the line between a do block and a callback anyway? When it's stored for later execution? Both examples do that.

Even if you could draw a line between this use of callbacks and all the others in the standard library (which is every single do block if you take the wikipedia definition), then I still wouldn't reject this PR just because it has a different style of programming. You use the correct tools for the correct jobs, and I have spent a lot of time thinking about this PR, and I think callbacks are the correct tool in this case.

jwoertink · 2016-09-01T15:51:21Z

Just out of curiosity on this subject, couldn't this work similar to how File.open stuff works? (at least in ruby since I haven't done it in crystal yet). Maybe there could be advantages/disadvantages to each style like callback style could be executed immediately, and object style could be lazy evaluated (similar to Rails Model.where()).

# As an object
f = File.open('filename.txt', 'w+')
# ... do stuff

# As a callback
File.open('filename.txt', 'w+') do |f|
  # ... do stuff
end

ozra · 2016-09-02T01:43:13Z

Seems like "iterating" over the fields would be best? They do come in sequentially, and so it should be possible to stream the data as it arrives? (I was hit by this when customers started uploading GB's of video to their presentations pages in a system we wrote in node.js. After changing to continually parsing and handling the multipart there was no problem uploading bluerays through it.)

RX14 · 2016-09-03T15:50:00Z

@asterite Would it be possible to get this merged in 0.19.1? Can you reply to my comment?

asterite · 2016-09-03T16:02:15Z

@RX14 No, sorry. This PR is huge for me to digest and I wouldn't have used callbacks for this.

I'd suggest to move this to a shard and use that. If with time it turns out to work well, and I or someone else didn't make an alternative implementation in the standard library, we can include it.

asterite · 2016-09-03T16:07:23Z

@RX14 The main issue I have with callback-based parsers is that you can't easily make a pull-parser out of them. But you can make it the other way around. Also, callbacks will probably involve closures, so heap memory. So pull parsers are faster and more flexible.

RX14 · 2016-09-03T16:22:54Z

@asterite Currently the Multipart part of the PR is a pull parser, I think we should merge that at least. The FormData part is a thin wrapper around the Multipart pull parser. If I separate a pull parser out of the FormData parser (which I think is kind of ugly), would you merge the PR if it still retained the callback parser?

RX14 · 2016-09-03T16:27:38Z

@asterite My problem with a pull parser for FormData is that you often have a large number of fields to your form and you end up having a massive switch statement, where the string fields are kind of ugly using gets_to_end. So I would advocate including both types of parsers for practical reasons.

RX14 · 2016-09-03T16:36:59Z

This PR is now solely for the raw multipart/mixed part of the PR. Seperate PRs for a pull-based multipart/form-data and a block-based multipart/form-data parser are incoming.

RX14 · 2016-10-01T15:45:59Z

This PR and #3243 are now in a shard: https://github.com/RX14/multipart.cr

RX14 · 2016-12-18T12:31:58Z

@asterite the shard has been in production on a site I develop, handing uploads with no complaints for a month. I'm pretty sure it's stable. Is it worth attempting to get this merged again?

asterite · 2016-12-18T17:48:35Z

@RX14 Sure! Please send another PR, we can probably merge it. We can always improve things just before 1.0, if we think it's worth it (your shard implementation is probably good enough ^_^)

sdogruyol · 2016-12-18T17:50:20Z

@asterite i second @RX14. It's been performing great at production 👍

RX14 force-pushed the feature/multipart branch from 38c3399 to 00ccf9f Compare July 7, 2016 16:53

RX14 mentioned this pull request Jul 7, 2016

HTTP::Request multipart/form-data handling #1989

Closed

RX14 force-pushed the feature/multipart branch from 00ccf9f to ab28aa8 Compare July 9, 2016 11:23

asterite reviewed Jul 9, 2016
View reviewed changes

RX14 force-pushed the feature/multipart branch 3 times, most recently from b3bce73 to ab82e2a Compare July 9, 2016 21:14

jhass added the topic:stdlib label Jul 11, 2016

sdogruyol mentioned this pull request Jul 13, 2016

Handling multipart/form-data kemalcr/kemal#40

Closed

RX14 force-pushed the feature/multipart branch 3 times, most recently from b0fb06c to 3a1367c Compare July 14, 2016 20:58

RX14 force-pushed the feature/multipart branch from 38cd703 to b1c4694 Compare July 28, 2016 23:30

RX14 force-pushed the feature/multipart branch 3 times, most recently from 8df2e45 to eaa153f Compare July 30, 2016 23:32

jhass added this to the 0.19.0 milestone Aug 4, 2016

asterite removed this from the 0.19.0 milestone Sep 1, 2016

RX14 added 2 commits September 3, 2016 17:31

Add HTTP::Multipart::Parser and convenience methods

0fdc856

Add HTTP::Multipart::Generator and convenience methods

ec774e5

RX14 force-pushed the feature/multipart branch 2 times, most recently from 3f717be to ec774e5 Compare September 3, 2016 16:35

RX14 mentioned this pull request Sep 3, 2016

Add HTTP::FormData parsing and generating #3243

Closed

RX14 closed this Dec 18, 2016

Add MIME multipart parsing #2967

Add MIME multipart parsing #2967

Conversation

RX14 commented Jul 7, 2016 • edited Loading

kostya commented Jul 7, 2016

RX14 commented Jul 7, 2016 • edited Loading

ysbaddaden commented Jul 8, 2016

RX14 commented Jul 8, 2016

ysbaddaden commented Jul 8, 2016

RX14 commented Jul 8, 2016

RX14 commented Jul 8, 2016

RX14 commented Jul 9, 2016 • edited Loading

asterite Jul 9, 2016

Choose a reason for hiding this comment

RX14 Jul 9, 2016

Choose a reason for hiding this comment

asterite commented Jul 9, 2016

asterite commented Jul 9, 2016

RX14 commented Jul 9, 2016

RX14 commented Jul 15, 2016

sdogruyol commented Jul 30, 2016

RX14 commented Jul 30, 2016

asterite commented Aug 4, 2016

RX14 commented Aug 4, 2016 • edited Loading

asterite commented Aug 4, 2016

sdogruyol commented Aug 8, 2016

sdogruyol commented Sep 1, 2016

RX14 commented Sep 1, 2016

asterite commented Sep 1, 2016

asterite commented Sep 1, 2016

sdogruyol commented Sep 1, 2016

RX14 commented Sep 1, 2016

jwoertink commented Sep 1, 2016

ozra commented Sep 2, 2016

RX14 commented Sep 3, 2016

asterite commented Sep 3, 2016

asterite commented Sep 3, 2016

RX14 commented Sep 3, 2016

RX14 commented Sep 3, 2016 • edited Loading

RX14 commented Sep 3, 2016

RX14 commented Oct 1, 2016

RX14 commented Dec 18, 2016

asterite commented Dec 18, 2016

sdogruyol commented Dec 18, 2016

RX14 commented Jul 7, 2016 •

edited

Loading

RX14 commented Jul 7, 2016 •

edited

Loading

RX14 commented Jul 9, 2016 •

edited

Loading

RX14 commented Aug 4, 2016 •

edited

Loading

RX14 commented Sep 3, 2016 •

edited

Loading