-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modifiers are dropped in \X regular expression matches #4832
Comments
The unicode_normalize appears to be working properly here, expanding (in this case) all the diacritics to their combined character forms. The subsequent This is likely missing or incorrect logic in joni. I'm reading up on how parsers and regex are expected to handle unicode normalized into combining characters. |
@olleolleolle So one obvious workaround would be to not normalize, or normalize to the one of the complete forms NFC or NFKC rather than the decomposed forms:
|
This just solves the test case, if you use the string "أُحِبُّ ٱلْقِرَاءَةَ كَثِيرًا" instead, then it fails even in NFC, since there's no precomposed variants for Arabic with vowels. |
Oh, closed by mistake! Re-opened. |
I'm going to close this in favor of #4568, which details the standard, the commits that added it to MRI, and so on. |
The
\X
regular expression matches on "extended grapheme cluster".This Issue is about how that match becomes wrong.
Environment
Versions:
Expected Behavior
The circle above the
a
is a "modifier". Here, in MRI, it's in the MatchData.Actual Behavior
Note the absence of the "modifier".
Read more
In order to know what I'm talking about, here are links.
StackOverflow answer about "What is even \X?"
Keywords: extended grapheme cluster
The text was updated successfully, but these errors were encountered: