-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for validating Regexp in Ripper #4902
Conversation
Forgot to mention, but this does nothing to address local variables as mentioned in #4898, but some quick testing seemed to indicate that MRI didn't handle this either (as the below output has a vcall rather than a var_ref towards the end.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really I just want a comment on that null fragment if since it is not obvious what scenario that happens in.
lexer.checkRegexpFragment(runtime, fragment, options); | ||
} | ||
} | ||
if (last != null && regexpFragments.size() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only possible for /#{ffofofofo}/ and /@{gfogfogogog}/? If so can you add a comment here as to what scenario this is for? Without printing out the lex stream I would almost think this was not possible. A comment will help later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrote the implementation with an additional regexpDynamic
variable instead to make the logic simpler to follow instead on commenting on the cryptic logic.
} | ||
} | ||
|
||
private boolean is7BitASCII(ByteList value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for myself. We should make sure CR is calculated as we build up string so we do not need to rescan all bytelists after creation.
@@ -1453,13 +1453,8 @@ public Node arg_append(Node node1, Node node2) { | |||
|
|||
// MRI: reg_fragment_check | |||
public void regexpFragmentCheck(RegexpNode end, ByteList value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since 9.2 will be considered a major version you can remove this from ParserSupport if you want (or not) and just make callers call lexer directly. Optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@grddev sorry our last merge broke this. I did not notice you had addressed the comments so I spaced out landing this. Can you update it? |
And use the same function from both Ripper and main parser.
While not really related to Lexing, this is a component that is shared between Ripper and the main parser, and that seemed like the lesser evil.
It seems to have been `!ENCODING_IS_ASCII8BIT(str)` from the beginning, so I'm not sure why it was the opposite here.
The code was copied from Parser support, so it was clearly broken before, but it had to be fixed now as parts of the Ripper test suite relies on $! rather than explicitly catching the exception.
This uses the fact that Regexp tokenization is handled by a single StringTerm, and thus all tSTRING_CONTENT fragments are easily collectable until the tREGEXP_END comes with the options that we need for validation. The validation itself is a copied/simplified version of what is performed by the main parser, as large parts the validation depended on the AST structure, which we do not have here. Technically, this doesn't perform the validation at the same point in time as the main parser, as it performs the validation when encountering the tREGEXP_END token rather than when processing the regexp rule. I speculate that the difference doesn't really matter given that the only thing we could do with the tREGEXP_END token is to apply the regexp rule.
Use a separate variable to track whether things are dynamic or not, and use a List to avoid tracking the last element explicitly.
The methods were only retained to provide the old interface, but by directly calling the new methods in the Lexer, we can remove the old methods, given that we don't really need to be backwards compatible here.
Rebased the changes on top of master |
@grddev thanks. sorry I lost this one in the weeds :) |
This represents a stab at implementing the remaining Ripper compatibility as mentioned in #4898.
This uses the fact that Regexp tokenization is handled by a single StringTerm, and thus all tSTRING_CONTENT fragments are easily collectable until the tREGEXP_END comes with the options that we need for validation.
The validation itself is a copied/simplified version of what is performed by the main parser, as large parts the validation depended on the AST structure, which we do not have here.
Technically, this doesn't perform the validation at the same point in time as the main parser, as it performs the validation when encountering the tREGEXP_END token rather than when processing the regexp rule.
I speculate that the difference doesn't really matter given that the only thing we could do with the tREGEXP_END token is to apply the regexp rule.
In order to reduce copy-paste between the main parser and Ripper, I opted to shift some Regexp-related code into the Lexer. In theory, the code doesn't belong in the lexer, but putting it in the Lexer has some benefits. First, it is a component that is shared in a reasonable way between the two parsers. Second, it is essentially required by the proposed implementation, as the new validation takes place effectively inside the Lexer.
Unsurprisingly, it turns out that the coverage for Ripper parsing of Regexp isn't very extensive, and I haven't had time to put the code through any additional tests.