-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MIME multipart parsing #2967
Conversation
interesting how this performance compare to https://github.com/kostya/http_parser.cr/blob/master/benchmark/bench_multipart.cr |
From preliminary benchmarks it looks about half as fast: https://gist.github.com/RX14/e74d8bb3e812485259274591d5f45d67 Not sure exactly what to improve, i'm sure the somewhat increased user-friendliness of this implementation has something to do with it. |
What about having a Maybe have a simpler API to build a data = HTTP::FormData.new
data.add("username", "john")
File.open("avatar.jpg", "r") do |file|
data.add("file", file, HTTP::Headers{"content-type" => "image/jpeg"})
end We miss a mecanism in HTTP::Client.put("https://api.example.com/user/avatar", body: data) |
I'm ok with renaming I was thinking of having a simpler API for The kind of interface I am currently thinking of for parsing multipart requests would be callback-based like this: HTTP::FormData.parse(request) do |p|
p.field("username") do |username|
# handle username field
end
p.file("upload") do |io|
# use IO
end
end However this approach is a little unwieldy for parsing non-file fields so maybe this would be better, if a little less consistent: mp = HTTP::FormData::Parser.new(request)
mp.handle_file("upload") do |io, filename(, headers)|
# handle file somehow
end
mp.handle_files do |field, io, filename(, headers)|
# callback for every file upload
# cannot be used with `#handle_file`
end
mp.parse # => {"username" => {value: "rx14", headers: HTTP::Headers{...}}, ...} My preference for generating FormData would be with a builder, or providing a |
What about yielding each part as it comes, where For example (yes, it uses a tempfile, but this is an usage example): record UploadedFile, filename : String, file : Tempfile
params = Hash(String, String|UploadedFile).new
HTTP::FormData.parse(request) do |field, value|
case value
when IO
tmp = Tempfile.new("multipart")
IO.copy(value, tmp)
params[field.name] = UploadedFile.new(field.filename, tmp)
else
params[field.name] = value
end
end |
That kind of boilerplate was what I wanted to avoid. With that method you're essentially going to end up with people switch-casing in their block based on field names. Having callbacks based on field-name seems like the most user-friendly way to return io objects, and returning a hash seems to be the most user-friendly way to return strings/non-files. The question is whether to use callbacks for both for consistency, or do both which will likely be easier to use. Maybe if the FormData callbacks were basically transformers from io to another storable value. For example: mp = HTTP::FormData::Parser.new(request)
mp.transform_file("upload") do |io, filename, headers|
tempfile = Tempfile.new(filename)
IO.copy(io, tempfile)
tempfile
end
mp.parse # => {"username" => {value: "rx14", headers: HTTP::Headers{...}}, "upload" => {value: Tempfile, headers: HTTP::Headers{...}}} would be the way to implement saving files to disk for later usage. This is flexible, because the transformer could handle the io right there, and map to nil. Not quite sure how to implement it with regards to typing though. |
Ok, After discussing on IRC I think the correct interface is like so: username = nil
password = nil
email = nil
avatar_url = nil
HTTP::FormData.parse(request) do |p|
p.field("username") { |u| username = u }
p.field("password") { |p| password = p }
p.field("email") { |e| email = e }
p.file("avatar") do |io|
# write avatar
# generate url
avatar_url = url
end
end It's got a little bit of boilerplate for collecting fields, but I think it's consistent. If anyone can work out a way of getting rid of the boilerplate cleanly, it would be appreciated. |
I can't rename The most technically correct option would be to note that rfc5322 is the basis for MIME and HTTP headers, although i'm not quite sure HTTP headers and MIME headers have exactly the same format and semantics. |
@@ -89,7 +89,7 @@ _canonicalize_file_path() { | |||
local dir file | |||
dir=$(dirname -- "$1") | |||
file=$(basename -- "$1") | |||
(cd "$dir" 2>/dev/null && printf '%s/%s\n' "$(pwd -P)" "$file") | |||
(cd "$dir" 2>/dev/null >/dev/null && printf '%s/%s\n' "$(pwd -P)" "$file") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? What's a way to reproduce something that, without this change, breaks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set CDPATH
on my system, which causes cd
to output which directory it's cding to, because it can be ambiguous.
Overall looks good! Maybe you could split this PR into several PRs so it's easy to discuss some parts of it? For example What do you think? |
Oh, and a separate PR for |
Ok, PR has been split into 4. |
b3bce73
to
ab82e2a
Compare
b0fb06c
to
3a1367c
Compare
This also needs |
Up! What's the status on this please 🙏 |
Just have to document and finish the specs. Nearly there! |
8df2e45
to
eaa153f
Compare
I wouldn't expose the |
@asterite the I also think that having the parser be an IO and a parser is a bit messy, and tries to combine two seperate concerns into one object. I believe the current method is cleaner and more intuitive. |
Up up up 👍 |
@asterite aren't you gonna release this with 0.19.0 😢 |
@sdogruyol why wouldn't it? It's still on the 0.19.0 milestone. |
I'm removing the 0.19.0 milestone. I won't merge this as long as it uses callbacks. Programming against callbacks requires a different way of thinking and organizing code than in regular sequential code (or pull parsers). |
@sdogruyol It's not a big deal, one can create a shard with this code and use it until we have this functionality in the std |
@asterite actually it's not that easy to create an efficient and correct multipart parser since it's kinda hard to deal with(i tried at least 3 times and failed). I was actually waiting on this to release the next version of Kemal 😞 |
I'm not quite sure how the callbacks in this PR are semantically different to the server callbacks in For example, see these two examples: HTTP::Server.listen(8080) do |context|
context.response.content_type = "text/plain"
context.response.print "Hello world, got #{context.request.path}!"
end HTTP::FormData.parse(request) do |parser|
parser.field("field1") do |data|
data # => "field data"
end
end Both use blocks (callbacks). Both of which are executed in response to an event (HTTP request/FormData section parse). Both blocks are executed before the call returns. Both accept a raw Even if you could draw a line between this use of callbacks and all the others in the standard library (which is every single do block if you take the wikipedia definition), then I still wouldn't reject this PR just because it has a different style of programming. You use the correct tools for the correct jobs, and I have spent a lot of time thinking about this PR, and I think callbacks are the correct tool in this case. |
Just out of curiosity on this subject, couldn't this work similar to how # As an object
f = File.open('filename.txt', 'w+')
# ... do stuff
# As a callback
File.open('filename.txt', 'w+') do |f|
# ... do stuff
end |
Seems like "iterating" over the fields would be best? They do come in sequentially, and so it should be possible to stream the data as it arrives? (I was hit by this when customers started uploading GB's of video to their presentations pages in a system we wrote in node.js. After changing to continually parsing and handling the multipart there was no problem uploading bluerays through it.) |
@asterite Would it be possible to get this merged in 0.19.1? Can you reply to my comment? |
@RX14 No, sorry. This PR is huge for me to digest and I wouldn't have used callbacks for this. I'd suggest to move this to a shard and use that. If with time it turns out to work well, and I or someone else didn't make an alternative implementation in the standard library, we can include it. |
@RX14 The main issue I have with callback-based parsers is that you can't easily make a pull-parser out of them. But you can make it the other way around. Also, callbacks will probably involve closures, so heap memory. So pull parsers are faster and more flexible. |
@asterite Currently the |
@asterite My problem with a pull parser for FormData is that you often have a large number of fields to your form and you end up having a massive switch statement, where the string fields are kind of ugly using |
3f717be
to
ec774e5
Compare
This PR is now solely for the raw |
This PR and #3243 are now in a shard: https://github.com/RX14/multipart.cr |
@asterite the shard has been in production on a site I develop, handing uploads with no complaints for a month. I'm pretty sure it's stable. Is it worth attempting to get this merged again? |
@RX14 Sure! Please send another PR, we can probably merge it. We can always improve things just before 1.0, if we think it's worth it (your shard implementation is probably good enough ^_^) |
This pull request adds classes for parsing and generating MIME
multipart/*
messages, based on RFC2046 Section 5.1.This pr depends on #2972, #2973.