Bytelist love merge #5151

enebo · 2018-04-24T21:49:37Z

So the branch is a bit misnamed at this point but I will summarize this a tiny bit (have talked quite a lot to @headius about this already).

all identifiers are made as RubySymbols with exception of annotations (currently assumed as 7bit ASCII -- this can be changed later). RubySymbols have bytelist + 8859_1 identifier string. Parser and IR now use Symbol. rest of runtime continues to use the j.l.String as identifier/intern string. New vocabulary:

RubySymbol - name of the identifier
String - id of the identifier

RubySymbol.idString() is how you get an id for what you are looking up.

Error reporting has new static str(), names(), and ids() methods for making properly encoded Error strings. This is not 100% completed but many many errors have been converted.

This reverts commit a6fac6a.

This start changing all defined methods to use the raw (8859_1) string as the key in the method table. Symbol StringLiteral both now properly use proper ByteList and does not muck with String.

… twice.

…l name. Guard against it since it is a normal value.

…859_1 or a raw string. So I made toString() decode to actual charset based on encoding. This may need another field if we discover toString() on RubySymbol is called all the time but let's roll with this made on-demand for now.

Make and/or use isOr/isAnd and leverage AST being able to tell us this vs manually figuring it out.

```text jruby -e 'p method(def foo(あ); end).parameters' [[:req, :あ]] ``` to work. Up til now this would confuse our impl and we may see an error like: ```text ArgumentError: invalid byte sequence in US-ASCII inspect at org/jruby/RubySymbol.java:283 inspect at org/jruby/RubySymbol.java:268 inspect at org/jruby/RubyArray.java:1659 inspect at org/jruby/RubyArray.java:1659 p at org/jruby/RubyKernel.java:492 <main> at -e:1 ``` Pure ruby code should not properly be giving properly encoded values back. Java (e.g. native) methods mostly should not matter since they do not return names but if they do return names then they are all assumed to be UTF-8 encoded data.

…re number.

…field name but name changed from a String to a ByteList which make new frame read write code find nothing.

…nkle in our strategy. Errors are generated from the search deep in Module but it is a raw String. This is somewhat ok for most strings because they just happen to be utf-8 so the display. This is a big issue though since we want proper error message....

… day would come but I was hoping I could cheat with 8859_1 strings enough to merge for 9.2. Unfortunately, lack of encoding through method methods means we cannot accurately throw properly encoded error messages since he have long lost the encoding of the method. So with that said, most people will not be able to notice this change. All the String methods for methods still exist. These methods will assume all data is UTF-8. This could be a dicey gamble but since it is the default for Ruby it likely will not impact anyone. Also since UTF-8 is 7bit clean for ASCII we should totally not have an issue there. Some API signatures have changed. This means 9.2 might end up causing some issues if you are doing something very low-level with our code base (although changes up to this point have already broken some APIs within parsing and IR and that is unavoidable). The method reader/writer maps needed their return type changed from <String, DynamicMethod> to <ByteList, DynamicMethod>. It was an option to make a new name for methods which returned this but this is so far down in our APIs I think we should run with it. Also ProfiledMethod has changed from String to ByteList (e.g. getName() -> ByteList from String). Ultimately ProfiledMethod needs to be properly encoded to be able to report the method in a properly encoded way...so there you have it. Next phase is to change interpreters and the JIT to stop from going: ByteList -> String -> ByteList. I did not do that part here since this is pretty large already.

…one more ByteList path from IR Operands.

More error messages using str().

… master.

…evert typeAsString change.

strategy.

…red as FIXMEs like others but I am leaving some with bytelist_love to give a pedigree to when it was made.

kares · 2018-04-25T09:03:14Z

core/src/main/java/org/jruby/Ruby.java

-                throw newTypeError(parent.getName() + "::" + moduleObj.getMetaClass().getName() + " is not a module");
-            }
+            RubyString typeName = parentIsObject ?
+                    types(this, moduleObj.getMetaClass()) : types(this, parent, moduleObj.getMetaClass());


maybe a throwis lost here - unless its intentional

throw is immediately below that line. It was two throws and now is one line since both throws just were assembling a proper typeName.

kares · 2018-04-25T09:06:17Z

core/src/main/java/org/jruby/RubyBasicObject.java

-        RubyString str = RubyString.newString(getRuntime(), bytes, UTF8Encoding.INSTANCE);
-        str.setTaint(isTaint());
-        return str;
+        RubyString bytes = runtime.newString("#<");


just a q: so the direct ByteList concat wasn't good here, does it get problematic for any reason?
isn't the new way going to be doing a lot of "unnecessary" Java String-> byte[] conversion?

That is a good question I don't recall offhand. It is possible there was no issue but I learned that these methods are not quite a sensible as you may think. I will try it out quickly.

So I looked at this and I think I can reduce the cost substantially and use a bytelist but there is a specific reason why it works. First when we have m17n data getting combined into a string we cannot use bytelist since bytelist has no idea how to combine two different encoded things together. In this very particular case we only have 7bit ascii strings and something else. So I think I can just mark the new string as same encoding as the something else and it will work....almost anyways. If the encoding used is not isAsciicompat we will generate garbage. OTOH my current code was broken in the same way so we are no worse off.

Addendum since I am thinking about this. I should try the last broken part here. I suspect the types() method in RubyStringBuilder is actually what the non-ascii name will display like. So I need to try that.

kares · 2018-04-25T09:11:04Z

core/src/main/java/org/jruby/RubySymbol.java

+     * Make an instance variable out of this symbol (e.g. :foo will generate :foo=).
+     * @return the new symbol
+     */
+    public RubySymbol asAccessor() {


did you mean asWriter (not having asReader seems as just a coincidence since its the same symbol) ?

Ah nice. You're right I should have called this asWriter(). I will change that.

kares · 2018-04-25T09:13:19Z

core/src/main/java/org/jruby/RubySymbol.java

    @Override
    public final String toString() {
-        return symbol;


this could use an explanation (comment) on why not simply return symbol; ...for mere mortals :)

Ah yeah I will add an explanatory comment. Largely toString is doing it's best to make a j.l.String which displays properly but is not perfect since toString() should not be used internally for anything but debugging. Going full RubyString + decode seemed unnecesary here. Perhaps I should also add asJavaString() or using str() is more appropriate if you want to work with a string version of a RubySymbol?

overriding asJavaString explicitly (from IRubyObject) sounds good ... for the full decoding cycle (unless its already done so).

I also just found a few consumers of toString and changed them to idString or asJavaString depending on whether their context was used in an identifier or non-identifier way (for RubySymbol). I will likely be trying to audit more general toString() usage as we seem to be a bit inconsistent.

@kares oh I should point out asJavaString cannot be decoded version of the method since there are still over 100 users using it to lookup methods. We have this pattern:

public static IRubyObject get_inner_class(final ThreadContext context, final RubyModule self, final IRubyObject name) { // const_missing delegate final String constName = name.asJavaString();

All our internal tables for identifiers are keyed off of 8859_1. I think we may need to make a slightly more restrictive but helpful method to replace this pattern and ween off of asJavaString (idString on RubySymbol is more correct but only on RubySymbol -- probably need something on RubyString as well). This though need not be resolved immediately. Just letting you know why asJavaString cannot decode today.

As an aside it is a little weird IRubyObject has asJavaString since this means I could pass a fixnum in here and it would end up as "5" or something. From Ruby side of things I think people would be surprised that would work even if there was a "5" method.

kares · 2018-04-25T10:42:04Z

core/src/main/java/org/jruby/util/RubyStringBuilder.java

+    public static String str(Ruby runtime, IRubyObject value, String message) {
+        RubyString buf = (RubyString) value.asString().dup();
+
+        buf.cat19(runtime.newString(message));


do we still need to be using 19 legacy named methods? maybe cat + cat19 are worth some 'unification' ...

Unfortunately, they are very different and both should always exist. I would agree with you that we should refactor with better names as cat19 means nothing but I suspect cat19 is still used by multiple native extensions which use m17n strings in any way (although we can definitely come up with a new name and deprecate). Main difference is one could care less about what it is cat'ing and the other realizes it needs encoding negotiation.

ah yeah I recall them being impossible to unify now ... thanks

consumers to proper method instead of toString. This also addressed some more m17n errors we had.

enebo added 30 commits December 12, 2017 14:14

Bytelist more through IR

Verified

This commit was signed with the committer’s verified signature.

wyattjoh Wyatt Johnson

GPG key ID: C56A4C5DA36B212A

Verified
Learn about vigilant mode

c215a7a

Remove usage of getName() in ParserSupport from INameNode

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

b0260a3

More getName removal

Verified

This commit was signed with the committer’s verified signature.

wyattjoh Wyatt Johnson

GPG key ID: C56A4C5DA36B212A

Verified
Learn about vigilant mode

a6fac6a

Revert "More getName removal"

07224e1

This reverts commit a6fac6a.

A lot more ByteList usage in IR.

59e8a6a

Amazingly almost everything still passes (spec:ruby:fast totally does).

58557e8

This start changing all defined methods to use the raw (8859_1) string as the key in the method table. Symbol StringLiteral both now properly use proper ByteList and does not muck with String.

Minor: remove possible reassignment of name and calling normalizeSeps…

63193b5

… twice.

Unbreak AOT/persistence

80f1f6e

Simple regression in recent changes. Some trace instr will have a nul…

f39a9eb

…l name. Guard against it since it is a normal value.

Yay another missed IR persist/aot bug.

6d1577b

trivial. match MRI error string more

b071728

Moar mistakes in encode/decode using getName().

cf354ec

Change SymbolProc to use ByteList in IR and JIT/Interp.

7b3576e

Reduce more use of getName for getByteName in IRBuilder from AST.

53d3517

Merge branch 'master' into bytelist_love

f8e7e74

Whoops did not realize this did not get added to last merge commit.

c920b1f

Document what nodes are possible to undef.

6e25884

Make common place for common bytelists.

3ed229d

Make and/or use isOr/isAnd and leverage AST being able to tell us this vs manually figuring it out.

Was appending byte value instead of character representation of closu…

1ca3a82

…re number.

Unbreak spec:compiler from merge with master. It was calling the raw …

0f96a52

…field name but name changed from a String to a ByteList which make new frame read write code find nothing.

Merge branch 'master' into bytelist_love

ebb0f60

Merge branch 'master' into bytelist_love

49b9caa

Merge branch 'master' into bytelist_love

914742c

Symbols strings should be raw

ffb23fc

ProfilingService test has new method which needs to be implemented

eaf9b95

Update local_variables to use bytelistsi instead of Strings

9fe13cc

enebo added 17 commits April 16, 2018 10:22

Another case of String vs RubySymbol from AST. Also allowed removing …

88f6ceb

…one more ByteList path from IR Operands.

More toString to idString changes.

63fede2

More error messages using str().

more idString and make accessor methods work with all encodings.

61703bd

more cleanup of string vs symbol usage.

070b78c

Merge branch 'master' into bytelist_love

93fa541

Merge branch 'master' into bytelist_love

242f976

Fix one missed aspect from last nights merge. Now test:mri is same as…

90113f4

… master.

Merge branch 'master' into bytelist_love

00cecc9

Some FIXME changes.

15a604c

Merge branch 'master' into bytelist_love

edc6b79

Type conversion seems more focused on reporting obj vs type of obj. R…

d8f0989

…evert typeAsString change.

Wowsers. Fix two 2.5 m17n display issues!

2683c35

Merge branch 'master' into bytelist_love

135da97

Fix pushsymbolproc to use id vs bytelist to be consistent with new id

6f74142

strategy.

Go to type for argument descriptor from ByteList.

69dc167

More FIXME addressing and removal. Last ones I left should be conside…

3488484

…red as FIXMEs like others but I am leaving some with bytelist_love to give a pedigree to when it was made.

Merge branch 'master' into bytelist_love

5cc5f85

kares reviewed Apr 25, 2018

View reviewed changes

enebo added 4 commits April 25, 2018 10:33

Merge branch 'master' into bytelist_love

bbba978

Address a couple of comments from PR and change some obvious RubySymbol

ad79660

consumers to proper method instead of toString. This also addressed some more m17n errors we had.

Use bytelist since we technically only have one encoding at play here...

e1f6790

Merge branch 'master' into bytelist_love

52a1832

enebo merged commit 52a1832 into master Apr 25, 2018

enebo added this to the JRuby 9.2.0.0 milestone May 24, 2018

enebo deleted the bytelist_love branch November 9, 2018 20:23

headius mentioned this pull request Jul 18, 2020

Emojis in method names breaks alias/alias_method and __callee__/__method__ #4878

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Bytelist love merge #5151

Bytelist love merge #5151

enebo commented Apr 24, 2018

kares Apr 25, 2018

enebo Apr 25, 2018

kares Apr 25, 2018

enebo Apr 25, 2018

enebo Apr 25, 2018

enebo Apr 25, 2018

kares Apr 25, 2018

enebo Apr 25, 2018

kares Apr 25, 2018 •

edited

Loading

enebo Apr 25, 2018

kares Apr 25, 2018 •

edited

Loading

enebo Apr 25, 2018

enebo Apr 25, 2018

kares Apr 25, 2018

enebo Apr 25, 2018

kares Apr 25, 2018

Bytelist love merge #5151

Bytelist love merge #5151

Conversation

enebo commented Apr 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kares Apr 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kares Apr 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kares Apr 25, 2018 •

edited

Loading

kares Apr 25, 2018 •

edited

Loading