Improve compatibility with MRI's Ripper #4898

grddev · 2017-12-16T12:51:27Z

After having discovered #4882 and then discovering that (block-)local variable references didn't work, it seemed like the whole process would be quicker if I looked for the solution myself, rather than wait for someone else to solve one problem before I could discover the next one.

And once I had figured out roughly how the code worked, I spent a little time to try to get all the MRI specs to pass. Apart from the tests that rely on the MRI directory layout, the only remaining issue is that JRuby's Ripper doesn't validate regular expressions during parsing. I looked into it briefly, but given that the Ripper parser effectively doesn't have access to a parse tree (as the dispatch callbacks replace the raw values), it wasn't obvious to me how to do that.

In addition to the MRI tests, this adds test/jruby/test_ripper.rb that tries to cover additional aspects of Ripper that the MRI specs do not cover. I guess these would ideally go into MRI to clarify that MRI's Ripper behavior is intentional rather than accidental.

Given that this is essentially a collection of a whole bunch of small fixes, I refer to the individual commit messages for additional information, instead of trying to summarize the changes here.

First, this ignores the entire file that relies on the MRI file structure, as it seems unlikely that any of those tests will ever be applicable. Second, it adds the test_sexp.rb file which was absent from the index for some unknown reason. It also introduces a couple of excludes for failing tests in that file. Third, it clears the excludes for a couple of tests that are currently passing, and updates the description for one of the excludes.

Otherwise, an invalid byte sequence inside an identifier causes Ripper to continuously parse the same invalid byte sequence over and over.

They are identical (after fixing tokadd_mbchar)

Otherwise, it doesn't recognize variables introduced in the block (parameters), resulting in vcall nodes instead of var_ref

The previous implementation called assignable on a bunch of things that were unlikely to be strings, while only supporting string arguments. This changes the logic so that it relies on getting the identifier value from the lexer instead. In addition, this inlines the logic from assignable in all cases where the implementation is known, and splits the cases for identifiers and constants, as they have very little to do with each other. Given that the parser has inlined the definition of MRI's user_varaible and keyword variable, it seems to make sense to also inline the assignable actions. While inlining the assignable implementation, it seamed to make sense to also inline is_id_var (and remove its unused argument).

Not 100% sure about the impact of this, but it seems unlikely that having two separate Ruby grammars for Ripper and the main parser would be a good thing.

The corresponding construct was in the main parser, and allowed it correctly parse "m x do; end", where Ripper would fail and assign the block to x rather than m.

It seems like MRI is doing this.

It might not be important, but MRI seems to do this.

Instead of ending the heredoc before emitting the final string content part, this instead just emits the final string content part and positions the cursor right before the end marker, such that the next call to the heredoc term should end the heredoc using the early return. The motivation for doing this was that with ripper the final string content of a heredoc wouldn't be dispatched to any token handler, but instead just propagated. The old behavior was clearly wrong for this corner-case, but this approach might have other problems.

As an added bonus, the parsing of magic comment key values no longer stops at the first unrecognized key, and thus writing # -*- warn_past_scope: true; coding: ISO-8859-1 -*- will now apply the encoding even if JRuby doesn't support the magic comment attribute warn_past_scope.

enebo · 2017-12-16T14:57:31Z

@grddev This is fantastic.

If you have interest in 9.1.x supporting these changes then you might maybe be up for the challenge of opening up another PR for jruby-9.1 branch so our current 9.1.x can get these changes. Note though that a few of these commits will not cherry-pick over (notably RipperParser.y) since ruby 2.4 added new productions 2.3 does not have like arg_rhs (ah also master changed Tokens to use keyword_XXX).

If you don't want to do that work I can try cp'ing as much as is reasonable and then fix the rest.

Another test case I frequently use is Yard although we pass their specs so as you can see ripper coverage is not fantastic.

For regexp checking did you see ParserSupport.regexpFragmentCheck? I have not spent any time looking at this but that method will check syntax but only require a ByteList. For named captures as lvars perhaps that is more involved than that?

grddev · 2017-12-16T23:19:44Z

For regexp checking did you see ParserSupport.regexpFragmentCheck? I have not spent any time looking at this but that method will check syntax but only require a ByteList. For named captures as lvars perhaps that is more involved than that?

I did see that, but the main part that the fragment validation does is ensure that the encoding is valid in each fragment, which isn't really possible to do before seeing the flags for the regexp, which in turn would require the Parser/Lexer to keep track of the regular expression fragments until getting to the options. Once the fragments are recorded, it would also be relatively simple to perform full validation/compilation of the regexp if the expression is static.

I didn't even think about variable captures, and haven't looked into how MRI Ripper deals with that, so that might require some investigation as well.

Its definitely solvable, but I believe it would be complicated enough to warrant its own PR (and I'm also not sure I have the time to look into it right now).

grddev added 13 commits December 16, 2017 13:21

Return false on tokadd_mbchar failure

62462ec

Otherwise, an invalid byte sequence inside an identifier causes Ripper to continuously parse the same invalid byte sequence over and over.

Call tokadd_mbchar from tokenAddMBC

32d2c20

They are identical (after fixing tokadd_mbchar)

Make opt_bv_decl consistent with MRI

bdc6d11

Use current scope in is_id_var

4a1b16d

Otherwise, it doesn't recognize variables introduced in the block (parameters), resulting in vcall nodes instead of var_ref

Align Ripper parser with main Ruby parser

1763e4e

Not 100% sure about the impact of this, but it seems unlikely that having two separate Ruby grammars for Ripper and the main parser would be a good thing.

Add command-args-state begin to Ripper

c0cfb5f

The corresponding construct was in the main parser, and allowed it correctly parse "m x do; end", where Ripper would fail and assign the block to x rather than m.

Reset the cmd-args-state from Ripper push-local-scope

5bd4137

It seems like MRI is doing this.

Clear command-args-state in brace blocks

56e54a7

It might not be important, but MRI seems to do this.

Dispatch on_operator_ambiguous for Ripper

b702e52

enebo added this to the JRuby 9.2.0.0 milestone Dec 16, 2017

enebo merged commit 0a4be2d into jruby:master Dec 16, 2017

grddev mentioned this pull request Dec 16, 2017

Improve compatibility with MRI's Ripper for JRuby 9.1 #4900

Merged

This was referenced Dec 17, 2017

Add support for validating Regexp in Ripper #4902

Merged

Miscellaneous parsing fixes #4911

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Improve compatibility with MRI's Ripper #4898

Improve compatibility with MRI's Ripper #4898

grddev commented Dec 16, 2017

enebo commented Dec 16, 2017

grddev commented Dec 16, 2017

Improve compatibility with MRI's Ripper #4898

Improve compatibility with MRI's Ripper #4898

Conversation

grddev commented Dec 16, 2017

enebo commented Dec 16, 2017

grddev commented Dec 16, 2017