Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make String and Char methods like downcase and upcase unicode aware #3560

Merged
merged 1 commit into from
Nov 19, 2016

Conversation

asterite
Copy link
Member

As the title says....

Previously:

"un árbol".upcase # => "UN áRBOL"
"ЙЦУКЕНГШЩЗХЪФЫВАПРОЛДЖЭЁЯЧСМИТЬБЮ".downcase # => "ЙЦУКЕНГШЩЗХЪФЫВАПРОЛДЖЭЁЯЧСМИТЬБЮ"

Now:

"un árbol".upcase # => "UN ÁRBOL"
"ЙЦУКЕНГШЩЗХЪФЫВАПРОЛДЖЭЁЯЧСМИТЬБЮ".downcase # => "йцукенгшщзхъфывапролджэёячсмитьбю"

There's also support for the special case of turkic:

"İI".downcase(Unicode::CaseOptions::Turkic) # =>"iı"
"".upcase(Unicode::CaseOptions::Turkic) # =>"İI"

This is a breaking change, because I renamed and changed some methods. Specifically, Char#uppercase? now checks for unicode uppercase, and there's also Char#ascii_uppercase? that only checks that in the ASCII range. The same applies to whitespace? and ascii_whitespace?. This shouldn't affect existing code but it might make them just a tiny bit slower.

I also renamed Char#alpha? to Char#letter? because that's the name used in unicode categories (this is the real breaking change). Also Char#digit? is now Char#number? (again, the name used in unicode)

The unicode data is put at src/unicode/data.cr in the form of code, and that file is generated with a script. I used a format similar to how Go stores this data, which is quite interesting because it tries to store as little data as possible, so case ranges, alternate ranges and ascending strides are stored, and then looked up with a binary search.

I also added String#blank? because it's pretty useful and it has been asked a few times before.

As a side note, there's still a lot to be done regarding unicode: normalization, grapheme clusters, etc. In particular, tests found in this article fail in most cases in Crystal... but they also fail in Ruby, and I'm sure in other languages too (it's a hard problem, or maybe just tedious/long to implement). Elixir does it well. But Elixir doesn't have a Char type, only String. We'd probably need to add a Grapheme type or similar that would be a series of unicode codepoints that all form a single visible character. But... later. This PR is a good start towards that :-)

Fixes #1648
Fixes #2439

@bcardiff
Copy link
Member

This is great. Is there a list of exceptions to handle in the future, other than Turkic?

@asterite
Copy link
Member Author

@bcardiff Maybe. I took this both from Ruby and Go and turkic seems to be the only exception so far. In the future if new exceptions are added we could upgrade the code, probably handle it in a different way. In Ruby it seems you can pass multiple options so that's why I used a flags enum.

@drhuffman12
Copy link

Would it be useful to add an optional param (e.g.: something like restrict_to or char_set) to Char#uppercase? (and similar methods), e.g.: Char#uppercase?(restrict_to: ASCII_SET),Char#uppercase?(restrict_to: :ascii) or Char#uppercase?(restrict_to: CHAR::SET[:ascii])?

@bcardiff
Copy link
Member

I like go approach of extending the data structure by overriding some cases. But definitely something that could be refactor when we want to handle more than 1-char exception.

@drhuffman12 for the ascii string is not just to use a subset of the table but also to iterate the strings in a more efficient way. Hence the api difference.

@asterite
Copy link
Member Author

Oh, yes, in fact we can add ASCII to the CaseOptions enum if you only want ascii behaviour.

@asterite
Copy link
Member Author

@bcardiff Yes, Go's approach is nice but it exposes some data that might be hard to digest. Also, I don't know if in the future a crazy rule appears that doesn't match with that data definition, so I'd prefer to avoid exposing that as an API. I actually considered doing that but then noticed that it's just two special cases for turkic (the i and I letters)

@asterite
Copy link
Member Author

I added an option to only transform ASCII chars

Copy link
Member

@RX14 RX14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the size of UnicodeData vs the generated crystal file? Would it be better to generate the file as part of the build process instead of store it in git.

entries = [] of Entry

url = "http://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt"
body = HTTP::Client.get(url).body
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the crystal build process depend on an internet connection feels like a bad idea. This file should be copied into git.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only for generate once

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is generated once. Unicode data doesn't change very often so it's better here to generate it before compile-time. Generating it at compile-time, maybe with a macro, would be overkill (Elixir does that, though, but Elixir compiles modules to beam so they need to do it just once)

@sdogruyol
Copy link
Member

@asterite yeah Turkic is a really big problem and it's very common 😄 (i'm a Turk btw)

@[Flags]
enum CaseOptions
# Only transform ASCII characters.
ASCII

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

# (`Char#upcase`, `Char#downcase`, `Char#whitespace?`, etc.)
module Unicode
# Options to pass to `upcase`, `downcase`, `uppercase?`
# and `lowecase?` to control their behaviour.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: lowecase? -> lowercase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@asterite asterite added this to the 0.20.0 milestone Nov 19, 2016
@asterite asterite merged commit 4e3eb11 into master Nov 19, 2016
@asterite asterite deleted the feature/unicode branch November 20, 2016 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants