Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for validating Regexp in Ripper #4902

Merged
merged 7 commits into from
Jan 14, 2018
Merged

Conversation

grddev
Copy link
Contributor

@grddev grddev commented Dec 17, 2017

This represents a stab at implementing the remaining Ripper compatibility as mentioned in #4898.

This uses the fact that Regexp tokenization is handled by a single StringTerm, and thus all tSTRING_CONTENT fragments are easily collectable until the tREGEXP_END comes with the options that we need for validation.

The validation itself is a copied/simplified version of what is performed by the main parser, as large parts the validation depended on the AST structure, which we do not have here.

Technically, this doesn't perform the validation at the same point in time as the main parser, as it performs the validation when encountering the tREGEXP_END token rather than when processing the regexp rule.

I speculate that the difference doesn't really matter given that the only thing we could do with the tREGEXP_END token is to apply the regexp rule.

In order to reduce copy-paste between the main parser and Ripper, I opted to shift some Regexp-related code into the Lexer. In theory, the code doesn't belong in the lexer, but putting it in the Lexer has some benefits. First, it is a component that is shared in a reasonable way between the two parsers. Second, it is essentially required by the proposed implementation, as the new validation takes place effectively inside the Lexer.

Unsurprisingly, it turns out that the coverage for Ripper parsing of Regexp isn't very extensive, and I haven't had time to put the code through any additional tests.

@grddev
Copy link
Contributor Author

grddev commented Dec 17, 2017

Forgot to mention, but this does nothing to address local variables as mentioned in #4898, but some quick testing seemed to indicate that MRI didn't handle this either (as the below output has a vcall rather than a var_ref towards the end.

% RBENV_VERSION=2.4.2 ruby -rripper -e 'p Ripper.sexp("/\$(?<dollars>\d+)\.(?<cents>\d+)/ =~ \"$3.67\"\ndollars")'
[:program, [[:binary, [:regexp_literal, [[:@tstring_content, "$(?<dollars>d+).(?<cents>d+)", [1, 1]]], [:@regexp_end, "/", [1, 29]]], :=~, [:string_literal, [:string_content, [:@tstring_content, "$3.67", [1, 35]]]]], [:vcall, [:@ident, "dollars", [2, 0]]]]]

Copy link
Member

@enebo enebo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really I just want a comment on that null fragment if since it is not obvious what scenario that happens in.

lexer.checkRegexpFragment(runtime, fragment, options);
}
}
if (last != null && regexpFragments.size() == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only possible for /#{ffofofofo}/ and /@{gfogfogogog}/? If so can you add a comment here as to what scenario this is for? Without printing out the lex stream I would almost think this was not possible. A comment will help later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the implementation with an additional regexpDynamic variable instead to make the logic simpler to follow instead on commenting on the cryptic logic.

}
}

private boolean is7BitASCII(ByteList value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for myself. We should make sure CR is calculated as we build up string so we do not need to rescan all bytelists after creation.

@@ -1453,13 +1453,8 @@ public Node arg_append(Node node1, Node node2) {

// MRI: reg_fragment_check
public void regexpFragmentCheck(RegexpNode end, ByteList value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 9.2 will be considered a major version you can remove this from ParserSupport if you want (or not) and just make callers call lexer directly. Optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@enebo
Copy link
Member

enebo commented Jan 10, 2018

@grddev sorry our last merge broke this. I did not notice you had addressed the comments so I spaced out landing this. Can you update it?

And use the same function from both Ripper and main parser.
While not really related to Lexing, this is a component that is shared between Ripper and the main parser, and that seemed like the lesser evil.
It seems to have been `!ENCODING_IS_ASCII8BIT(str)` from the beginning, so I'm not sure why it was the opposite here.
The code was copied from Parser support, so it was clearly broken before, but it had to be fixed now as parts of the Ripper test suite relies on $! rather than explicitly catching the exception.
This uses the fact that Regexp tokenization is handled by a single StringTerm, and thus all tSTRING_CONTENT fragments are easily collectable until the tREGEXP_END comes with the options that we need for validation.

The validation itself is a copied/simplified version of what is performed by the main parser, as large parts the validation depended on the AST structure, which we do not have here.

Technically, this doesn't perform the validation at the same point in time as the main parser, as it performs the validation when encountering the tREGEXP_END token rather than when processing the regexp rule.

I speculate that the difference doesn't really matter given that the only thing we could do with the tREGEXP_END token is to apply the regexp rule.
Use a separate variable to track whether things are dynamic or not, and use a List to avoid tracking the last element explicitly.
The methods were only retained to provide the old interface, but by directly calling the new methods in the Lexer, we can remove the old methods, given that we don't really need to be backwards compatible here.
@grddev
Copy link
Contributor Author

grddev commented Jan 14, 2018

Rebased the changes on top of master

@enebo enebo merged commit 2614c77 into jruby:master Jan 14, 2018
@enebo
Copy link
Member

enebo commented Jan 14, 2018

@grddev thanks. sorry I lost this one in the weeds :)

@enebo enebo added this to the JRuby 9.2.0.0 milestone Feb 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants