Non-ascii method names getting kicked out #2698

headius · 2015-03-13T20:44:24Z

This test in MRI's test_m17n.rb fails because we kick the method name out. It appears that our logic for checking identifiers is not correct, and RubyLexer.identifier (and down stream) need updating.

TestM17N#test_nonascii_method_name:
Java::OrgJcodingsException::EncodingException: invalid code point value
    org.jcodings.specific.BaseEUCJPEncoding.codeToMbcLength(BaseEUCJPEncoding.java:55)
    org.jcodings.specific.EUCJPEncoding.codeToMbcLength(EUCJPEncoding.java:24)
    org.jruby.lexer.yacc.RubyLexer.isMultiByteChar(RubyLexer.java:638)
    org.jruby.lexer.yacc.RubyLexer.getIdentifier(RubyLexer.java:1365)
    org.jruby.lexer.yacc.RubyLexer.identifier(RubyLexer.java:1779)

I started a patch to do the re-porting, but I'm not familiar enough with the lexer to complete it.

diff --git a/core/src/main/java/org/jruby/lexer/yacc/RubyLexer.java b/core/src/main/java/org/jruby/lexer/yacc/RubyLexer.java
index fa26107..4db7e14 100644
--- a/core/src/main/java/org/jruby/lexer/yacc/RubyLexer.java
+++ b/core/src/main/java/org/jruby/lexer/yacc/RubyLexer.java
@@ -621,7 +621,7 @@ public class RubyLexer {
      * mri: is_identchar
      */
     public boolean isIdentifierChar(int c) {
-        return Character.isLetterOrDigit(c) || c == '_' || isMultiByteChar(c);
+        return encoding.isAlnum(c) || c == '_' || !Encoding.isAscii(c);
     }

     public boolean isASCII(int c) {
@@ -1768,6 +1768,20 @@ public class RubyLexer {
     }

     private int identifier(int c, boolean commandState) throws IOException {
+        ByteList aggregate = new ByteList();
+        do {
+            if (!Encoding.isAscii(c)) mb = ENC_CODERANGE_UNKNOWN;
+            if (tokenAddMBC(c, aggregate) == -1) return 0;
+            c = nextToken();
+        } while (isIdentifierChar(c));
+
+        int result = 0;
+
+        if (c == '!' || c == '?') {
+            result = Tokens.tFID;
+        } else {
+            this
+        }
         if (!isIdentifierChar(c)) {
             String badChar = "\\" + Integer.toOctalString(c & 0xff);
             throw new SyntaxException(PID.CHARACTER_BAD, getPosition(), getCurrentLine(),
@@ -1791,8 +1805,7 @@ public class RubyLexer {
         } else {
             src.unread(c);
         }
-        
-        int result = 0;
+

         last_state = lex_state;
         if (lastBangOrPredicate) {

The text was updated successfully, but these errors were encountered:

headius · 2015-03-13T20:46:10Z

There may be other tests in test_m17n.rb (or elsewhere in our suites) that fail due to this. Probably should be fixed since it breaks MBC identifiers for many cases.

enebo · 2015-07-03T15:44:34Z

I am declaring this fixed since we no long die in the lexer. In fact we pass some of these tests now. The newer issue is that we convert chars to a Java string for methods but then when we call .name on a ruby method object we try and convert that Java string back to a Ruby string and the encodings info is lost. I am nearly 100% certain we already have an issue on this .name problem.

headius added the JRuby 9000 label Mar 13, 2015

headius added this to the JRuby 9.0.0.0 milestone Mar 13, 2015

enebo closed this as completed Jul 3, 2015

enebo modified the milestones: JRuby 9.0.0.0.rc2, JRuby 9.0.0.0 Jul 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Non-ascii method names getting kicked out #2698

Non-ascii method names getting kicked out #2698

headius commented Mar 13, 2015

headius commented Mar 13, 2015

enebo commented Jul 3, 2015

Non-ascii method names getting kicked out #2698

Non-ascii method names getting kicked out #2698

Comments

headius commented Mar 13, 2015

headius commented Mar 13, 2015

enebo commented Jul 3, 2015