Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ascii method names getting kicked out #2698

Closed
headius opened this issue Mar 13, 2015 · 2 comments
Closed

Non-ascii method names getting kicked out #2698

headius opened this issue Mar 13, 2015 · 2 comments

Comments

@headius
Copy link
Member

headius commented Mar 13, 2015

This test in MRI's test_m17n.rb fails because we kick the method name out. It appears that our logic for checking identifiers is not correct, and RubyLexer.identifier (and down stream) need updating.

TestM17N#test_nonascii_method_name:
Java::OrgJcodingsException::EncodingException: invalid code point value
    org.jcodings.specific.BaseEUCJPEncoding.codeToMbcLength(BaseEUCJPEncoding.java:55)
    org.jcodings.specific.EUCJPEncoding.codeToMbcLength(EUCJPEncoding.java:24)
    org.jruby.lexer.yacc.RubyLexer.isMultiByteChar(RubyLexer.java:638)
    org.jruby.lexer.yacc.RubyLexer.getIdentifier(RubyLexer.java:1365)
    org.jruby.lexer.yacc.RubyLexer.identifier(RubyLexer.java:1779)

I started a patch to do the re-porting, but I'm not familiar enough with the lexer to complete it.

diff --git a/core/src/main/java/org/jruby/lexer/yacc/RubyLexer.java b/core/src/main/java/org/jruby/lexer/yacc/RubyLexer.java
index fa26107..4db7e14 100644
--- a/core/src/main/java/org/jruby/lexer/yacc/RubyLexer.java
+++ b/core/src/main/java/org/jruby/lexer/yacc/RubyLexer.java
@@ -621,7 +621,7 @@ public class RubyLexer {
      * mri: is_identchar
      */
     public boolean isIdentifierChar(int c) {
-        return Character.isLetterOrDigit(c) || c == '_' || isMultiByteChar(c);
+        return encoding.isAlnum(c) || c == '_' || !Encoding.isAscii(c);
     }

     public boolean isASCII(int c) {
@@ -1768,6 +1768,20 @@ public class RubyLexer {
     }

     private int identifier(int c, boolean commandState) throws IOException {
+        ByteList aggregate = new ByteList();
+        do {
+            if (!Encoding.isAscii(c)) mb = ENC_CODERANGE_UNKNOWN;
+            if (tokenAddMBC(c, aggregate) == -1) return 0;
+            c = nextToken();
+        } while (isIdentifierChar(c));
+
+        int result = 0;
+
+        if (c == '!' || c == '?') {
+            result = Tokens.tFID;
+        } else {
+            this
+        }
         if (!isIdentifierChar(c)) {
             String badChar = "\\" + Integer.toOctalString(c & 0xff);
             throw new SyntaxException(PID.CHARACTER_BAD, getPosition(), getCurrentLine(),
@@ -1791,8 +1805,7 @@ public class RubyLexer {
         } else {
             src.unread(c);
         }
-        
-        int result = 0;
+

         last_state = lex_state;
         if (lastBangOrPredicate) {
@headius headius added this to the JRuby 9.0.0.0 milestone Mar 13, 2015
@headius
Copy link
Member Author

headius commented Mar 13, 2015

There may be other tests in test_m17n.rb (or elsewhere in our suites) that fail due to this. Probably should be fixed since it breaks MBC identifiers for many cases.

@enebo
Copy link
Member

enebo commented Jul 3, 2015

I am declaring this fixed since we no long die in the lexer. In fact we pass some of these tests now. The newer issue is that we convert chars to a Java string for methods but then when we call .name on a ruby method object we try and convert that Java string back to a Ruby string and the encodings info is lost. I am nearly 100% certain we already have an issue on this .name problem.

@enebo enebo closed this as completed Jul 3, 2015
@enebo enebo modified the milestones: JRuby 9.0.0.0.rc2, JRuby 9.0.0.0 Jul 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants