Make String and Char methods like downcase and upcase unicode aware #3560

asterite · 2016-11-19T15:10:42Z

As the title says....

Previously:

"un árbol".upcase # => "UN áRBOL"
"ЙЦУКЕНГШЩЗХЪФЫВАПРОЛДЖЭЁЯЧСМИТЬБЮ".downcase # => "ЙЦУКЕНГШЩЗХЪФЫВАПРОЛДЖЭЁЯЧСМИТЬБЮ"

Now:

"un árbol".upcase # => "UN ÁRBOL"
"ЙЦУКЕНГШЩЗХЪФЫВАПРОЛДЖЭЁЯЧСМИТЬБЮ".downcase # => "йцукенгшщзхъфывапролджэёячсмитьбю"

There's also support for the special case of turkic:

"İI".downcase(Unicode::CaseOptions::Turkic) # =>"iı"
"iı".upcase(Unicode::CaseOptions::Turkic) # =>"İI"

This is a breaking change, because I renamed and changed some methods. Specifically, Char#uppercase? now checks for unicode uppercase, and there's also Char#ascii_uppercase? that only checks that in the ASCII range. The same applies to whitespace? and ascii_whitespace?. This shouldn't affect existing code but it might make them just a tiny bit slower.

I also renamed Char#alpha? to Char#letter? because that's the name used in unicode categories (this is the real breaking change). Also Char#digit? is now Char#number? (again, the name used in unicode)

The unicode data is put at src/unicode/data.cr in the form of code, and that file is generated with a script. I used a format similar to how Go stores this data, which is quite interesting because it tries to store as little data as possible, so case ranges, alternate ranges and ascending strides are stored, and then looked up with a binary search.

I also added String#blank? because it's pretty useful and it has been asked a few times before.

As a side note, there's still a lot to be done regarding unicode: normalization, grapheme clusters, etc. In particular, tests found in this article fail in most cases in Crystal... but they also fail in Ruby, and I'm sure in other languages too (it's a hard problem, or maybe just tedious/long to implement). Elixir does it well. But Elixir doesn't have a Char type, only String. We'd probably need to add a Grapheme type or similar that would be a series of unicode codepoints that all form a single visible character. But... later. This PR is a good start towards that :-)

Fixes #1648
Fixes #2439

bcardiff · 2016-11-19T16:09:07Z

This is great. Is there a list of exceptions to handle in the future, other than Turkic?

asterite · 2016-11-19T16:15:20Z

@bcardiff Maybe. I took this both from Ruby and Go and turkic seems to be the only exception so far. In the future if new exceptions are added we could upgrade the code, probably handle it in a different way. In Ruby it seems you can pass multiple options so that's why I used a flags enum.

drhuffman12 · 2016-11-19T16:19:24Z

Would it be useful to add an optional param (e.g.: something like restrict_to or char_set) to Char#uppercase? (and similar methods), e.g.: Char#uppercase?(restrict_to: ASCII_SET),Char#uppercase?(restrict_to: :ascii) or Char#uppercase?(restrict_to: CHAR::SET[:ascii])?

bcardiff · 2016-11-19T16:30:38Z

I like go approach of extending the data structure by overriding some cases. But definitely something that could be refactor when we want to handle more than 1-char exception.

@drhuffman12 for the ascii string is not just to use a subset of the table but also to iterate the strings in a more efficient way. Hence the api difference.

asterite · 2016-11-19T16:38:42Z

Oh, yes, in fact we can add ASCII to the CaseOptions enum if you only want ascii behaviour.

asterite · 2016-11-19T16:45:27Z

@bcardiff Yes, Go's approach is nice but it exposes some data that might be hard to digest. Also, I don't know if in the future a crazy rule appears that doesn't match with that data definition, so I'd prefer to avoid exposing that as an API. I actually considered doing that but then noticed that it's just two special cases for turkic (the i and I letters)

asterite · 2016-11-19T16:46:12Z

I added an option to only transform ASCII chars

RX14

What's the size of UnicodeData vs the generated crystal file? Would it be better to generate the file as part of the build process instead of store it in git.

RX14 · 2016-11-19T17:02:49Z

scripts/generate_unicode_data.cr

+entries = [] of Entry
+
+url = "http://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt"
+body = HTTP::Client.get(url).body


Making the crystal build process depend on an internet connection feels like a bad idea. This file should be copied into git.

this is only for generate once

Yes, this is generated once. Unicode data doesn't change very often so it's better here to generate it before compile-time. Generating it at compile-time, maybe with a macro, would be overkill (Elixir does that, though, but Elixir compiles modules to beam so they need to do it just once)

sdogruyol · 2016-11-19T17:23:33Z

@asterite yeah Turkic is a really big problem and it's very common 😄 (i'm a Turk btw)

drhuffman12 · 2016-11-19T17:36:19Z

src/unicode/unicode.cr

+  @[Flags]
+  enum CaseOptions
+    # Only transform ASCII characters.
+    ASCII


Sija · 2016-11-19T18:16:11Z

src/unicode/unicode.cr

+# (`Char#upcase`, `Char#downcase`, `Char#whitespace?`, etc.)
+module Unicode
+  # Options to pass to `upcase`, `downcase`, `uppercase?`
+  # and `lowecase?` to control their behaviour.


typo: lowecase? -> lowercase?

asterite added kind:feature topic:stdlib labels Nov 19, 2016

asterite force-pushed the feature/unicode branch from 20bbf4a to 2bbe968 Compare November 19, 2016 15:38

asterite force-pushed the feature/unicode branch from 2bbe968 to 1a81114 Compare November 19, 2016 16:45

RX14 reviewed Nov 19, 2016

View reviewed changes

drhuffman12 reviewed Nov 19, 2016

View reviewed changes

src/unicode/unicode.cr

@[Flags]

enum CaseOptions

# Only transform ASCII characters.

ASCII

Copy link

drhuffman12 Nov 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sija reviewed Nov 19, 2016

View reviewed changes

Make String and Char methods like downcase and upcase unicode aware

d2e361b

asterite force-pushed the feature/unicode branch from 1a81114 to d2e361b Compare November 19, 2016 20:24

asterite added this to the 0.20.0 milestone Nov 19, 2016

asterite merged commit 4e3eb11 into master Nov 19, 2016

asterite deleted the feature/unicode branch November 20, 2016 00:23

Sija mentioned this pull request Nov 22, 2016

Adds Char#swapcase and String#swapcase instance methods. #2805

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make String and Char methods like downcase and upcase unicode aware #3560

Make String and Char methods like downcase and upcase unicode aware #3560

asterite commented Nov 19, 2016

bcardiff commented Nov 19, 2016

asterite commented Nov 19, 2016

drhuffman12 commented Nov 19, 2016

bcardiff commented Nov 19, 2016

asterite commented Nov 19, 2016

asterite commented Nov 19, 2016

asterite commented Nov 19, 2016

RX14 left a comment

RX14 Nov 19, 2016

kostya Nov 19, 2016

asterite Nov 19, 2016

sdogruyol commented Nov 19, 2016

drhuffman12 Nov 19, 2016

Sija Nov 19, 2016

asterite Nov 19, 2016

Make String and Char methods like downcase and upcase unicode aware #3560

Make String and Char methods like downcase and upcase unicode aware #3560

Conversation

asterite commented Nov 19, 2016

bcardiff commented Nov 19, 2016

asterite commented Nov 19, 2016

drhuffman12 commented Nov 19, 2016

bcardiff commented Nov 19, 2016

asterite commented Nov 19, 2016

asterite commented Nov 19, 2016

asterite commented Nov 19, 2016

RX14 left a comment

Choose a reason for hiding this comment

RX14 Nov 19, 2016

Choose a reason for hiding this comment

kostya Nov 19, 2016

Choose a reason for hiding this comment

asterite Nov 19, 2016

Choose a reason for hiding this comment

sdogruyol commented Nov 19, 2016

drhuffman12 Nov 19, 2016

Choose a reason for hiding this comment

Sija Nov 19, 2016

Choose a reason for hiding this comment

asterite Nov 19, 2016

Choose a reason for hiding this comment