-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: NKF.nkf failing to convert certain characters #4921
Comments
Thank you for the detailed report! NKF has indeed been a weak point for us, which we mostly ignored because of the move to the newer Encoding subsystem in Ruby 1.9. However we'd always like to see compatibility improved. It may be as simple as duplicating what MRI uses for NKF, or there might be a pure-Ruby NKF out there build atop the Encoding subsystem. |
This looks like a promising library, should we chose to wrap an existing one: http://mariten.github.io/kanatools-java/en/kana-converter/ |
Thank you for responding! It reminded me to dig a little bit deeper into the situation myself, and I think that I've found the root of the issue. It lies not in NKF itself, but rather in JRuby's choice of definition for SHIFT_JIS. As you may well know, SHIFT_JIS is a blanket name for a variety of encodings. MRI Ruby (like the majority of applications) appears to use MS932 (Microsoft CP932/Windows-31J) as its definition of choice (across all operating systems), while JRuby does not. If we change my example code to this: require 'nkf'
ex = [0x87, 0x5B].pack('C*').force_encoding('cp932') # 'Ⅷ' in SJIS
puts NKF.nkf('-X -Z -w', ex) It no longer errors and, in fact, returns the output that I would expect! The obvious conclusion is that the The really curious thing is that, as JRuby directly copied Ruby's list of Encoding aliases, I never would have found this at all if I had used the name |
I did a lot of poking around, and there seem to be precious few libraries for explicitly dealing with full and half-width kana and such things. Many links just lead back to NKF, which seems to have little use outside the Ruby bindings. |
I can't say I'm terribly surprised -- MRI Ruby itself seems to rely on a bundled copy of iconv for most of its conversions. While effective, that does present something of a challenge for a JVM-based implementation. |
@aabryant That's very interesting! I'm pulling in @lopex here since he did the original port of the encoding logic and has continued to help us maintain it. I have not spent a lot of time in our NKF logic, but it does appear to be using the ported encoding subsystem. That makes me think that perhaps our version of that logic is improperly aliasing "shift_jis" to the wrong actual encoding. We have even run into Ruby compatibility tests that expected an encoding to be "Windows31J" instead of "Shift_JIS" and never quite knew why. |
Ok, trying to get this off my plate. So in jcodings EncodingList, I found this line: EncodingDB.declare("Shift_JIS", "SJIS"); But later in the same file I found this: EncodingDB.alias("SJIS", "Windows-31J"); This seems odd. |
If I modify the logic to use Windows-31J all the time for "Shift_JIS", your script passes. |
A few more discoveries.
static const char*
get_guessed_code(void)
{
if (input_codename && !*input_codename) {
input_codename = "BINARY";
} else {
struct input_code *p = find_inputcode_byfunc(iconv);
if (!input_codename) {
input_codename = "ASCII";
} else if (strcmp(input_codename, "Shift_JIS") == 0) {
if (p->score & (SCORE_DEPEND|SCORE_CP932))
input_codename = "CP932";
... So if the NKF subsystem detects "Shift_JIS" is the encoding, and whatever "score" means matches up, it uses CP932 (Windows-31J) for the NKF processing.
|
There are a few other autodetection options listed here: https://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream I think at this point the quick fix would be to force the use of Windows-31J when the incoming encoding is actually Shift_JIS, since that appears to be what MRI is doing (subject to the "score" flags there). |
Ok, this is interesting. The following script reproduces the same error in JRuby and CRuby: require 'nkf'
ex = [0x87, 0x5B].pack('C*').force_encoding('Shift_JIS') # 'Ⅷ' in SJIS
puts ex.encode('UTF-8') So what's happening in our NKF is that we're falling back on Shift_JIS (since our detection doesn't work), which does not have conversion paths to/from UTF-8. MRI attempts to detect the encoding, and upon detecting Shift_JIS it seems to use CP932 in some cases. When we force MRI to also use Shift_JIS, it errors like we do. Another interesting bit: by virtue of its detection, MRI can handle these bytes as "BINARY" and still work. |
I have tried two of the top contenders for charset detection. Neither worked like I hoped. I think the problem is that they're too general and not focused on just JIS-related encodings. First I tried the "juniversalchardet" project, which appears to have been forked all over the place. It detected the incoming bytes as Windows-1252. I also tried ICU4J. It detected Big5. I have not found a JIS-only detector yet. I have also not confirmed whether MRI successfully detects the content, but given that it works with BINARY encoding, I assume it does. |
The
|
Another attempt using jgloss produced EUC-JP and a different error because this isn't valid in EUC-JP. I'm starting to think that perhaps MRI can't detect it either, but because the default mode is to assume "JIS", it uses that. |
Pinging @nurse to see if perhaps they can give us some tips here! |
Japan defines character set for their use as JIS (Japanese Industrial Standards).
nkf is a tool to read net news. Your current solution, using Windows-31J table instead of Shift_JIS, has a problem. Japanese used 3 encodings families, Shift-JIS, Japanese EUC, and ISO/2022. Both of Shift-JIS and Japanese EUC are multibyte encodings But "ネコ" in Shift-JIS is alternative representation of "ネコ". The scoring idea is such one. If it is implemented in Java, just assume a binary as some encodings and convert to Unicode string, and then get the category of the character. |
ref: jruby/jruby#4921 Signed-off-by: Peter Boling <peter.boling@gmail.com>
Environment
JRuby Version:
jruby 9.1.15.0 (2.3.3) 2017-12-07 929fde8 Java HotSpot(TM) 64-Bit Server VM 25.152-b16 on 1.8.0_152-b16 +jit [linux-x86_64]
OS:
Linux Home 4.8.0-53-generic #56~16.04.1-Ubuntu SMP Tue May 16 01:18:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Expected Behavior
Under MRI Ruby, this snippet of code produces the string
Ⅷ
, or U+2167. This is what I would expect it to do.Under MRI Ruby, this snippet of code will properly convert the UTF-8 string representation of
Ⅷ
to the SHIFT_JIS one.Actual Behavior
Under JRuby, the first snippet instead produces the following error:
The second snippet similarly fails to execute:
This issue seems to be common to all of the SHIFT_JIS representations of Roman numerals. I have yet to encounter it with any other character, though I regard it as likely that some others are affected.
The text was updated successfully, but these errors were encountered: