-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make String and Char methods like downcase and upcase unicode aware #3560
Conversation
20bbf4a
to
2bbe968
Compare
This is great. Is there a list of exceptions to handle in the future, other than Turkic? |
Would it be useful to add an optional param (e.g.: something like |
I like go approach of extending the data structure by overriding some cases. But definitely something that could be refactor when we want to handle more than 1-char exception. @drhuffman12 for the ascii string is not just to use a subset of the table but also to iterate the strings in a more efficient way. Hence the api difference. |
Oh, yes, in fact we can add ASCII to the CaseOptions enum if you only want ascii behaviour. |
@bcardiff Yes, Go's approach is nice but it exposes some data that might be hard to digest. Also, I don't know if in the future a crazy rule appears that doesn't match with that data definition, so I'd prefer to avoid exposing that as an API. I actually considered doing that but then noticed that it's just two special cases for turkic (the |
2bbe968
to
1a81114
Compare
I added an option to only transform ASCII chars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the size of UnicodeData vs the generated crystal file? Would it be better to generate the file as part of the build process instead of store it in git.
entries = [] of Entry | ||
|
||
url = "http://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt" | ||
body = HTTP::Client.get(url).body |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the crystal build process depend on an internet connection feels like a bad idea. This file should be copied into git.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only for generate once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is generated once. Unicode data doesn't change very often so it's better here to generate it before compile-time. Generating it at compile-time, maybe with a macro, would be overkill (Elixir does that, though, but Elixir compiles modules to beam so they need to do it just once)
@asterite yeah Turkic is a really big problem and it's very common 😄 (i'm a Turk btw) |
@[Flags] | ||
enum CaseOptions | ||
# Only transform ASCII characters. | ||
ASCII |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
# (`Char#upcase`, `Char#downcase`, `Char#whitespace?`, etc.) | ||
module Unicode | ||
# Options to pass to `upcase`, `downcase`, `uppercase?` | ||
# and `lowecase?` to control their behaviour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: lowecase?
-> lowercase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
1a81114
to
d2e361b
Compare
As the title says....
Previously:
Now:
There's also support for the special case of turkic:
This is a breaking change, because I renamed and changed some methods. Specifically,
Char#uppercase?
now checks for unicode uppercase, and there's alsoChar#ascii_uppercase?
that only checks that in the ASCII range. The same applies towhitespace?
andascii_whitespace?
. This shouldn't affect existing code but it might make them just a tiny bit slower.I also renamed
Char#alpha?
toChar#letter?
because that's the name used in unicode categories (this is the real breaking change). AlsoChar#digit?
is nowChar#number?
(again, the name used in unicode)The unicode data is put at
src/unicode/data.cr
in the form of code, and that file is generated with a script. I used a format similar to how Go stores this data, which is quite interesting because it tries to store as little data as possible, so case ranges, alternate ranges and ascending strides are stored, and then looked up with a binary search.I also added
String#blank?
because it's pretty useful and it has been asked a few times before.As a side note, there's still a lot to be done regarding unicode: normalization, grapheme clusters, etc. In particular, tests found in this article fail in most cases in Crystal... but they also fail in Ruby, and I'm sure in other languages too (it's a hard problem, or maybe just tedious/long to implement). Elixir does it well. But Elixir doesn't have a Char type, only String. We'd probably need to add a
Grapheme
type or similar that would be a series of unicode codepoints that all form a single visible character. But... later. This PR is a good start towards that :-)Fixes #1648
Fixes #2439