-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bytelist love merge #5151
Bytelist love merge #5151
Conversation
This reverts commit a6fac6a.
This start changing all defined methods to use the raw (8859_1) string as the key in the method table. Symbol StringLiteral both now properly use proper ByteList and does not muck with String.
…l name. Guard against it since it is a normal value.
…859_1 or a raw string. So I made toString() decode to actual charset based on encoding. This may need another field if we discover toString() on RubySymbol is called all the time but let's roll with this made on-demand for now.
Make and/or use isOr/isAnd and leverage AST being able to tell us this vs manually figuring it out.
```text jruby -e 'p method(def foo(あ); end).parameters' [[:req, :あ]] ``` to work. Up til now this would confuse our impl and we may see an error like: ```text ArgumentError: invalid byte sequence in US-ASCII inspect at org/jruby/RubySymbol.java:283 inspect at org/jruby/RubySymbol.java:268 inspect at org/jruby/RubyArray.java:1659 inspect at org/jruby/RubyArray.java:1659 p at org/jruby/RubyKernel.java:492 <main> at -e:1 ``` Pure ruby code should not properly be giving properly encoded values back. Java (e.g. native) methods mostly should not matter since they do not return names but if they do return names then they are all assumed to be UTF-8 encoded data.
…field name but name changed from a String to a ByteList which make new frame read write code find nothing.
…nkle in our strategy. Errors are generated from the search deep in Module but it is a raw String. This is somewhat ok for most strings because they just happen to be utf-8 so the display. This is a big issue though since we want proper error message....
… day would come but I was hoping I could cheat with 8859_1 strings enough to merge for 9.2. Unfortunately, lack of encoding through method methods means we cannot accurately throw properly encoded error messages since he have long lost the encoding of the method. So with that said, most people will not be able to notice this change. All the String methods for methods still exist. These methods will assume all data is UTF-8. This could be a dicey gamble but since it is the default for Ruby it likely will not impact anyone. Also since UTF-8 is 7bit clean for ASCII we should totally not have an issue there. Some API signatures have changed. This means 9.2 might end up causing some issues if you are doing something very low-level with our code base (although changes up to this point have already broken some APIs within parsing and IR and that is unavoidable). The method reader/writer maps needed their return type changed from <String, DynamicMethod> to <ByteList, DynamicMethod>. It was an option to make a new name for methods which returned this but this is so far down in our APIs I think we should run with it. Also ProfiledMethod has changed from String to ByteList (e.g. getName() -> ByteList from String). Ultimately ProfiledMethod needs to be properly encoded to be able to report the method in a properly encoded way...so there you have it. Next phase is to change interpreters and the JIT to stop from going: ByteList -> String -> ByteList. I did not do that part here since this is pretty large already.
…one more ByteList path from IR Operands.
More error messages using str().
…evert typeAsString change.
…red as FIXMEs like others but I am leaving some with bytelist_love to give a pedigree to when it was made.
throw newTypeError(parent.getName() + "::" + moduleObj.getMetaClass().getName() + " is not a module"); | ||
} | ||
RubyString typeName = parentIsObject ? | ||
types(this, moduleObj.getMetaClass()) : types(this, parent, moduleObj.getMetaClass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a throw
is lost here - unless its intentional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw is immediately below that line. It was two throws and now is one line since both throws just were assembling a proper typeName.
RubyString str = RubyString.newString(getRuntime(), bytes, UTF8Encoding.INSTANCE); | ||
str.setTaint(isTaint()); | ||
return str; | ||
RubyString bytes = runtime.newString("#<"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a q: so the direct ByteList
concat
wasn't good here, does it get problematic for any reason?
isn't the new way going to be doing a lot of "unnecessary" Java String
-> byte[]
conversion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good question I don't recall offhand. It is possible there was no issue but I learned that these methods are not quite a sensible as you may think. I will try it out quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I looked at this and I think I can reduce the cost substantially and use a bytelist but there is a specific reason why it works. First when we have m17n data getting combined into a string we cannot use bytelist since bytelist has no idea how to combine two different encoded things together. In this very particular case we only have 7bit ascii strings and something else. So I think I can just mark the new string as same encoding as the something else and it will work....almost anyways. If the encoding used is not isAsciicompat we will generate garbage. OTOH my current code was broken in the same way so we are no worse off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addendum since I am thinking about this. I should try the last broken part here. I suspect the types() method in RubyStringBuilder is actually what the non-ascii name will display like. So I need to try that.
* Make an instance variable out of this symbol (e.g. :foo will generate :foo=). | ||
* @return the new symbol | ||
*/ | ||
public RubySymbol asAccessor() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean asWriter
(not having asReader
seems as just a coincidence since its the same symbol) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice. You're right I should have called this asWriter(). I will change that.
@Override | ||
public final String toString() { | ||
return symbol; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could use an explanation (comment) on why not simply return symbol;
...for mere mortals :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah I will add an explanatory comment. Largely toString is doing it's best to make a j.l.String which displays properly but is not perfect since toString() should not be used internally for anything but debugging. Going full RubyString + decode seemed unnecesary here. Perhaps I should also add asJavaString() or using str() is more appropriate if you want to work with a string version of a RubySymbol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overriding asJavaString
explicitly (from IRubyObject) sounds good ... for the full decoding cycle (unless its already done so).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also just found a few consumers of toString and changed them to idString or asJavaString depending on whether their context was used in an identifier or non-identifier way (for RubySymbol). I will likely be trying to audit more general toString() usage as we seem to be a bit inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kares oh I should point out asJavaString cannot be decoded version of the method since there are still over 100 users using it to lookup methods. We have this pattern:
public static IRubyObject get_inner_class(final ThreadContext context,
final RubyModule self, final IRubyObject name) { // const_missing delegate
final String constName = name.asJavaString();
All our internal tables for identifiers are keyed off of 8859_1. I think we may need to make a slightly more restrictive but helpful method to replace this pattern and ween off of asJavaString (idString on RubySymbol is more correct but only on RubySymbol -- probably need something on RubyString as well). This though need not be resolved immediately. Just letting you know why asJavaString cannot decode today.
As an aside it is a little weird IRubyObject has asJavaString since this means I could pass a fixnum in here and it would end up as "5" or something. From Ruby side of things I think people would be surprised that would work even if there was a "5" method.
public static String str(Ruby runtime, IRubyObject value, String message) { | ||
RubyString buf = (RubyString) value.asString().dup(); | ||
|
||
buf.cat19(runtime.newString(message)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need to be using 19 legacy named methods? maybe cat
+ cat19
are worth some 'unification' ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, they are very different and both should always exist. I would agree with you that we should refactor with better names as cat19 means nothing but I suspect cat19 is still used by multiple native extensions which use m17n strings in any way (although we can definitely come up with a new name and deprecate). Main difference is one could care less about what it is cat'ing and the other realizes it needs encoding negotiation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yeah I recall them being impossible to unify now ... thanks
consumers to proper method instead of toString. This also addressed some more m17n errors we had.
So the branch is a bit misnamed at this point but I will summarize this a tiny bit (have talked quite a lot to @headius about this already).
all identifiers are made as RubySymbols with exception of annotations (currently assumed as 7bit ASCII -- this can be changed later). RubySymbols have bytelist + 8859_1 identifier string. Parser and IR now use Symbol. rest of runtime continues to use the j.l.String as identifier/intern string. New vocabulary:
RubySymbol.idString() is how you get an id for what you are looking up.
Error reporting has new static str(), names(), and ids() methods for making properly encoded Error strings. This is not 100% completed but many many errors have been converted.