Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and improve URI implementation #6323

Merged
merged 13 commits into from Apr 5, 2019

Conversation

straight-shoota
Copy link
Member

@straight-shoota straight-shoota commented Jul 3, 2018

This PR applies a ton of mostly small improvements to URI and URI::Parser.

  • The type of URI#path is changed to String. It is no longer nilable. The semantic is: every URI has a path, but it may be empty. This avoids ambiguity between empty path and nil path.
  • The macro URI::Parser.step is removed. It was previously used to spec internals of URI::Parser but it no longer works because the struct can't be inherited. It doesn't make much sense to spec the internal API anyway, having unit tests for URI.parse is totally fine. Consequently, I have removed the specs for URI::Parser and integrated them into the specs for URI.parse.
  • Most notable is probably the removal of the property URI#opaque. It is simply not necessary to have an extra ivar for this. It is replaced by URI#path which contains both a path in the hierarchical part as well as a non-hierarchical, opaque path. URI#opaque? answers if the URI is considered opaque.
  • The spec suite has been reorganized and I added a bunch of new examples, borrows from gloang and akka.
  • During that process I discovered and fixed a few issues in the parser and formatter:
    • #to_s no longer hides the port if it is the default port. A default port can now be stripped using #normalize.
    • #parse downcases the scheme to make it case-insensitive. This is recommended by RFC 3986.
    • #parse decodes escape sequences in host, and #to_s applies them.
    • #parse correctly parses an empty port. Previously, scheme://host:/ would set port to 0, now it's nil.
    • #parse parses a string starting with three slashes as empty host. Previously, host was nil.
    • #to_s inserts a slash if path does not start with one and follows an authority. Paths starting with double slash // are prefixed by /. if it does not follow an authority. Otherwise the first path component would be interpreted as authority.

describe ".parse" do
# scheme
assert_uri("http:", scheme: "http")
it { URI.parse("HttP:").should eq(URI.new(scheme: "http")) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are missing a spec of what the result of #scheme on a missing scheme is. Some tests below lacks scheme but none tests what #scheme is in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean. There are many specs for no-scheme URIs. For example assert_uri("/foo", path: "/foo"). It ensures that the URI parsed from "/foo" has a path of /foo and all other properties at the default value. This means scheme should be nil.

@yxhuvud
Copy link
Contributor

yxhuvud commented Jul 3, 2018

The test failures make me wonder if eq should handle default ports.

@straight-shoota
Copy link
Member Author

Nah, the specs are only failing because of the previous incorrect behavior of ommitting the port if it was specified but equal to the default.

Comparing two URIs should be strict. If a port is specified, even if it's the default, there are chances this is on purpose. http://example.com and http://example.com:80 are not identical. You can easily use #normalize to remove default ports.

@asterite
Copy link
Member

asterite commented Jul 3, 2018

$ irb
irb(main):001:0> require "uri"
=> true
irb(main):002:0> u1 = URI.parse("http://foo.com:80")
=> #<URI::HTTP http://foo.com>
irb(main):003:0> u2 = URI.parse("http://foo.com")
=> #<URI::HTTP http://foo.com>
irb(main):004:0> u1 == u2
=> true

@asterite
Copy link
Member

asterite commented Jul 3, 2018

I'm not saying we should copy Ruby... but specifying a default port when it's the default one, and keeping it around... I don't know... there's no difference, really.

@straight-shoota
Copy link
Member Author

straight-shoota commented Jul 3, 2018

@asterite Yeah, Ruby does that. But it also uses a scheme-specific type URI::HTTP. Maybe that can serve as an excuse...

Besides, just a few counter examples:

u, err := url.Parse("http://foo.org:80")
fmt.Println(u.Port()) // => 80
System.out.println(new URI("http://foo.com:80").getPort()); // => 80
urlparse('http://foo.com:80').port # => 80

@straight-shoota
Copy link
Member Author

there's no difference, really.

For well-known schemes, probably not. But when you're using a custom scheme, it might not be given that the next consumer of a URI has the same default port configured.

URI should be a simple data container and serialize to and from string representation without any further processing. If you want to get rid of unnecessary default ports, use #normalize.

URI.parse(string).to_s == string should hold for any well-formed URI.

@asterite
Copy link
Member

asterite commented Jul 3, 2018

Sounds good

src/uri.cr Show resolved Hide resolved
@straight-shoota
Copy link
Member Author

🏓 A second review would be great!

@RX14 RX14 requested a review from bcardiff October 6, 2018 12:01
src/uri.cr Show resolved Hide resolved
@straight-shoota
Copy link
Member Author

@bcardiff 🏓

@straight-shoota
Copy link
Member Author

I'd really like to get this merged for 0.28.0. It's so annoying to have to work around URI bugs I fixed six months ago.

@straight-shoota straight-shoota added this to the 0.28.0 milestone Jan 7, 2019
@bararchy
Copy link
Contributor

bararchy commented Jan 7, 2019

I'm with @straight-shoota please let's get this merged

it { URI.new("http", "www.example.com", 80, "/hello", "a=1").to_s.should eq("http://www.example.com/hello?a=1") }
it { URI.new("mailto", opaque: "foo@example.com").to_s.should eq("mailto:foo@example.com") }

it "removes default port" do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not normalize ports from default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no generic default ports in the URI specification. It depends solely on configuration which port is considered "default".

When no port is specified, the resolver is entrusted to choose whichever the port it seems fit. But when a port is explicitly specified, the resolver is asked to use that specific port, regardless of it's configuration for default ports.

Thus removing a port (even if it is considered default) mutates the URI. It looses information.
A well-formed URI should not be automatically stripped of anything unless explicitly asked (for example using #normalize).

@straight-shoota straight-shoota force-pushed the jm/fix/uri-refactor branch 2 times, most recently from 8f1be0b to a88776b Compare January 7, 2019 22:16
@asterite
Copy link
Member

asterite commented Jan 8, 2019

I think it makes sense to keep the port if the URI has it set explicitly, even if it's the default port for that scheme.

src/uri.cr Show resolved Hide resolved
@straight-shoota
Copy link
Member Author

Rebased and ready to ship.

@jkthorne
Copy link
Contributor

@straight-shoota should all the parse_ methods in the src/uri/uri_parser.cr file be private?

@jkthorne
Copy link
Contributor

I just realized that might have been out of scope for this refactor but thanks for moving the parse methods to private.

src/uri.cr Show resolved Hide resolved
@straight-shoota
Copy link
Member Author

@wontruefree Yeah I thought so at first. But then, what the heck... it's a simple thing and fits to the set of changes in this PR.

Copy link
Member

@bcardiff bcardiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a rebase and fix some conflict and we can merge this.

Every URI has a path, but it may be empty. Making the ivar non-nilable
avoids any ambiguity whether an empty path is represented by an empty
string (`""`) or `nil`.
This macro was previously used to spec internals of URI::Parser, but it
no longer works because the struct can't be inherited. It doesn't make
much sense to spec the internal API anyway. Having unit tests for
URI.parse is sufficient.
The specs for URI::Parser integrated into the specs for URI.parse.
Previously, `#to_s` would omit the port if it matches the default port,
even if it is explicitly specified. This ommitance changes the URI and
must be avoided, unless explicitly asked to normalize.

Changes `#normalize` to remove the port if matches the schema's
default port.
It is not necessary to have an extra ivar for an opaque path, URI#path
is sufficient and can hold the path of both opaque and hiearchicyl URIs.

Adds URI#opaque? to tell whether the URI is considered opaque.
URI schemes are case-insensitive, this ensures that. This approach
is recommended by RFC 3986.
URI.prase decodes escape sequences in host, URI#to_s applies them where
necessary.
This is already some implicity normalization but results in the same
URI, although in a differen string representation.
An empty port was previously interpreted as port `0`. Now it's `nil`.
An URI starting with three slashes (`///`) has an empty authority but
it's still a hierarchical URI. `host` must be the empty string,
not `nil`.
`#to_s` inserts a slash if `path` does not start with one and follows a
non-nil authority.
Paths starting with double slash (`"//"`) are prefixed by ("/.`) if it
does not follow an authority. Otherwise the URI would be considered
relative to the schema and the first path component interpreted as authority.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants