Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifiers are dropped in \X regular expression matches #4832

Closed
olleolleolle opened this issue Oct 30, 2017 · 5 comments
Closed

Modifiers are dropped in \X regular expression matches #4832

olleolleolle opened this issue Oct 30, 2017 · 5 comments

Comments

@olleolleolle
Copy link
Member

olleolleolle commented Oct 30, 2017

The \X regular expression matches on "extended grapheme cluster".

This Issue is about how that match becomes wrong.

Environment

Versions:

  • JRuby version: jruby 9.1.13.0 (2.3.3) 2017-09-06 8e1c115 Java HotSpot(TM) 64-Bit Server VM 25.92-b14 on 1.8.0_92-b14 +jit [darwin-x86_64]
  • Operating system and platform: Darwin Olles-MacBook-Pro.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64

Expected Behavior

$ /usr/bin/ruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nfd).match /(\X)/"
#<MatchData "å" 1:"å">

The circle above the a is a "modifier". Here, in MRI, it's in the MatchData.

Actual Behavior

$ jruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nfd).match /(\X)/"
#<MatchData "a" 1:"a">

Note the absence of the "modifier".

Read more

In order to know what I'm talking about, here are links.

StackOverflow answer about "What is even \X?"

Keywords: extended grapheme cluster

@headius
Copy link
Member

headius commented Oct 30, 2017

The unicode_normalize appears to be working properly here, expanding (in this case) all the diacritics to their combined character forms. The subsequent \\X scan fails to consume those characters and only produces the non-combing 'a', 'o', 'A', 'O' characters.

This is likely missing or incorrect logic in joni. I'm reading up on how parsers and regex are expected to handle unicode normalized into combining characters.

@headius
Copy link
Member

headius commented Oct 30, 2017

@olleolleolle So one obvious workaround would be to not normalize, or normalize to the one of the complete forms NFC or NFKC rather than the decomposed forms:

$ jruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nf\c).match(/(\X)/)[1]"
"å"

@headius headius added this to the JRuby 9.2.0.0 milestone Oct 30, 2017
@auroranockert
Copy link

This just solves the test case, if you use the string "أُحِبُّ ٱلْقِرَاءَةَ كَثِيرًا‎" instead, then it fails even in NFC, since there's no precomposed variants for Arabic with vowels.

@olleolleolle
Copy link
Member Author

olleolleolle commented Oct 31, 2017

Oh, closed by mistake! Re-opened.

@olleolleolle olleolleolle reopened this Oct 31, 2017
@headius
Copy link
Member

headius commented Nov 28, 2017

I'm going to close this in favor of #4568, which details the standard, the commits that added it to MRI, and so on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants