Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bytelist love merge #5151

Merged
merged 91 commits into from Apr 25, 2018
Merged

Bytelist love merge #5151

merged 91 commits into from Apr 25, 2018

Conversation

enebo
Copy link
Member

@enebo enebo commented Apr 24, 2018

So the branch is a bit misnamed at this point but I will summarize this a tiny bit (have talked quite a lot to @headius about this already).

all identifiers are made as RubySymbols with exception of annotations (currently assumed as 7bit ASCII -- this can be changed later). RubySymbols have bytelist + 8859_1 identifier string. Parser and IR now use Symbol. rest of runtime continues to use the j.l.String as identifier/intern string. New vocabulary:

  • RubySymbol - name of the identifier
  • String - id of the identifier

RubySymbol.idString() is how you get an id for what you are looking up.

Error reporting has new static str(), names(), and ids() methods for making properly encoded Error strings. This is not 100% completed but many many errors have been converted.

This start changing all defined methods to use the raw (8859_1) string as the
key in the method table.  Symbol StringLiteral both now properly use proper
ByteList and does not muck with String.
…l name.

Guard against it since it is a normal value.
…859_1 or

a raw string.  So I made toString() decode to actual charset based on encoding.
This may need another field if we discover toString() on RubySymbol is called
all the time but let's roll with this made on-demand for now.
Make and/or use isOr/isAnd and leverage AST being able to tell us this vs
  manually figuring it out.
```text
jruby -e 'p method(def foo(あ); end).parameters'
[[:req, :あ]]
```

to work.  Up til now this would confuse our impl and we may see an error like:

```text
ArgumentError: invalid byte sequence in US-ASCII
  inspect at org/jruby/RubySymbol.java:283
  inspect at org/jruby/RubySymbol.java:268
  inspect at org/jruby/RubyArray.java:1659
  inspect at org/jruby/RubyArray.java:1659
        p at org/jruby/RubyKernel.java:492
   <main> at -e:1
```

Pure ruby code should not properly be giving properly encoded values back. Java
(e.g. native) methods mostly should not matter since they do not return names
but if they do return names then they are all assumed to be UTF-8 encoded data.
…field

name but name changed from a String to a ByteList which make new frame read
write code find nothing.
…nkle in

our strategy.  Errors are generated from the search deep in Module but it is a
raw String.  This is somewhat ok for most strings because they just happen to be
utf-8 so the display.  This is a big issue though since we want proper error
message....
… day

would come but I was hoping I could cheat with 8859_1 strings enough to merge
for 9.2.  Unfortunately, lack of encoding through method methods means we cannot
accurately throw properly encoded error messages since he have long lost the
encoding of the method.

So with that said, most people will not be able to notice this change.  All the
String methods for methods still exist.  These methods will assume all data is
UTF-8.  This could be a dicey gamble but since it is the default for Ruby it
likely will not impact anyone.  Also since UTF-8 is 7bit clean for ASCII we
should totally not have an issue there.

Some API signatures have changed.  This means 9.2 might end up causing some
issues if you are doing something very low-level with our code base (although
changes up to this point have already broken some APIs within parsing and IR
and that is unavoidable).  The method reader/writer maps needed their return
type changed from <String, DynamicMethod> to <ByteList, DynamicMethod>.  It
was an option to make a new name for methods which returned this but this is
so far down in our APIs I think we should run with it.  Also ProfiledMethod
has changed from String to ByteList (e.g. getName() -> ByteList from String).
Ultimately ProfiledMethod needs to be properly encoded to be able to report
the method in a properly encoded way...so there you have it.

Next phase is to change interpreters and the JIT to stop from going:
ByteList -> String -> ByteList.  I did not do that part here since this is
pretty large already.
throw newTypeError(parent.getName() + "::" + moduleObj.getMetaClass().getName() + " is not a module");
}
RubyString typeName = parentIsObject ?
types(this, moduleObj.getMetaClass()) : types(this, parent, moduleObj.getMetaClass());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a throwis lost here - unless its intentional

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw is immediately below that line. It was two throws and now is one line since both throws just were assembling a proper typeName.

RubyString str = RubyString.newString(getRuntime(), bytes, UTF8Encoding.INSTANCE);
str.setTaint(isTaint());
return str;
RubyString bytes = runtime.newString("#<");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a q: so the direct ByteList concat wasn't good here, does it get problematic for any reason?
isn't the new way going to be doing a lot of "unnecessary" Java String-> byte[] conversion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question I don't recall offhand. It is possible there was no issue but I learned that these methods are not quite a sensible as you may think. I will try it out quickly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I looked at this and I think I can reduce the cost substantially and use a bytelist but there is a specific reason why it works. First when we have m17n data getting combined into a string we cannot use bytelist since bytelist has no idea how to combine two different encoded things together. In this very particular case we only have 7bit ascii strings and something else. So I think I can just mark the new string as same encoding as the something else and it will work....almost anyways. If the encoding used is not isAsciicompat we will generate garbage. OTOH my current code was broken in the same way so we are no worse off.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addendum since I am thinking about this. I should try the last broken part here. I suspect the types() method in RubyStringBuilder is actually what the non-ascii name will display like. So I need to try that.

* Make an instance variable out of this symbol (e.g. :foo will generate :foo=).
* @return the new symbol
*/
public RubySymbol asAccessor() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean asWriter (not having asReader seems as just a coincidence since its the same symbol) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice. You're right I should have called this asWriter(). I will change that.

@Override
public final String toString() {
return symbol;
Copy link
Member

@kares kares Apr 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could use an explanation (comment) on why not simply return symbol; ...for mere mortals :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I will add an explanatory comment. Largely toString is doing it's best to make a j.l.String which displays properly but is not perfect since toString() should not be used internally for anything but debugging. Going full RubyString + decode seemed unnecesary here. Perhaps I should also add asJavaString() or using str() is more appropriate if you want to work with a string version of a RubySymbol?

Copy link
Member

@kares kares Apr 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overriding asJavaString explicitly (from IRubyObject) sounds good ... for the full decoding cycle (unless its already done so).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also just found a few consumers of toString and changed them to idString or asJavaString depending on whether their context was used in an identifier or non-identifier way (for RubySymbol). I will likely be trying to audit more general toString() usage as we seem to be a bit inconsistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kares oh I should point out asJavaString cannot be decoded version of the method since there are still over 100 users using it to lookup methods. We have this pattern:

    public static IRubyObject get_inner_class(final ThreadContext context,
        final RubyModule self, final IRubyObject name) { // const_missing delegate
        final String constName = name.asJavaString();

All our internal tables for identifiers are keyed off of 8859_1. I think we may need to make a slightly more restrictive but helpful method to replace this pattern and ween off of asJavaString (idString on RubySymbol is more correct but only on RubySymbol -- probably need something on RubyString as well). This though need not be resolved immediately. Just letting you know why asJavaString cannot decode today.

As an aside it is a little weird IRubyObject has asJavaString since this means I could pass a fixnum in here and it would end up as "5" or something. From Ruby side of things I think people would be surprised that would work even if there was a "5" method.

public static String str(Ruby runtime, IRubyObject value, String message) {
RubyString buf = (RubyString) value.asString().dup();

buf.cat19(runtime.newString(message));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need to be using 19 legacy named methods? maybe cat + cat19 are worth some 'unification' ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, they are very different and both should always exist. I would agree with you that we should refactor with better names as cat19 means nothing but I suspect cat19 is still used by multiple native extensions which use m17n strings in any way (although we can definitely come up with a new name and deprecate). Main difference is one could care less about what it is cat'ing and the other realizes it needs encoding negotiation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah I recall them being impossible to unify now ... thanks

@enebo enebo merged commit 52a1832 into master Apr 25, 2018
@enebo enebo added this to the JRuby 9.2.0.0 milestone May 24, 2018
@enebo enebo deleted the bytelist_love branch November 9, 2018 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants