Add support for validating Regexp in Ripper #4902

grddev · 2017-12-17T22:29:21Z

This represents a stab at implementing the remaining Ripper compatibility as mentioned in #4898.

This uses the fact that Regexp tokenization is handled by a single StringTerm, and thus all tSTRING_CONTENT fragments are easily collectable until the tREGEXP_END comes with the options that we need for validation.

The validation itself is a copied/simplified version of what is performed by the main parser, as large parts the validation depended on the AST structure, which we do not have here.

Technically, this doesn't perform the validation at the same point in time as the main parser, as it performs the validation when encountering the tREGEXP_END token rather than when processing the regexp rule.

I speculate that the difference doesn't really matter given that the only thing we could do with the tREGEXP_END token is to apply the regexp rule.

In order to reduce copy-paste between the main parser and Ripper, I opted to shift some Regexp-related code into the Lexer. In theory, the code doesn't belong in the lexer, but putting it in the Lexer has some benefits. First, it is a component that is shared in a reasonable way between the two parsers. Second, it is essentially required by the proposed implementation, as the new validation takes place effectively inside the Lexer.

Unsurprisingly, it turns out that the coverage for Ripper parsing of Regexp isn't very extensive, and I haven't had time to put the code through any additional tests.

grddev · 2017-12-17T22:37:41Z

Forgot to mention, but this does nothing to address local variables as mentioned in #4898, but some quick testing seemed to indicate that MRI didn't handle this either (as the below output has a vcall rather than a var_ref towards the end.

% RBENV_VERSION=2.4.2 ruby -rripper -e 'p Ripper.sexp("/\$(?<dollars>\d+)\.(?<cents>\d+)/ =~ \"$3.67\"\ndollars")'
[:program, [[:binary, [:regexp_literal, [[:@tstring_content, "$(?<dollars>d+).(?<cents>d+)", [1, 1]]], [:@regexp_end, "/", [1, 29]]], :=~, [:string_literal, [:string_content, [:@tstring_content, "$3.67", [1, 35]]]]], [:vcall, [:@ident, "dollars", [2, 0]]]]]

enebo

Really I just want a comment on that null fragment if since it is not obvious what scenario that happens in.

enebo · 2017-12-18T16:01:10Z

core/src/main/java/org/jruby/ext/ripper/StringTerm.java

+                lexer.checkRegexpFragment(runtime, fragment, options);
+            }
+        }
+        if (last != null && regexpFragments.size() == 1) {


Is this only possible for /#{ffofofofo}/ and /@{gfogfogogog}/? If so can you add a comment here as to what scenario this is for? Without printing out the lex stream I would almost think this was not possible. A comment will help later on.

Rewrote the implementation with an additional regexpDynamic variable instead to make the logic simpler to follow instead on commenting on the cryptic logic.

enebo · 2017-12-18T16:03:40Z

core/src/main/java/org/jruby/lexer/LexingCommon.java

+        }
+    }
+
+    private boolean is7BitASCII(ByteList value) {


Note for myself. We should make sure CR is calculated as we build up string so we do not need to rescan all bytelists after creation.

enebo · 2017-12-18T16:05:35Z

core/src/main/java/org/jruby/parser/ParserSupport.java

@@ -1453,13 +1453,8 @@ public Node arg_append(Node node1, Node node2) {

    // MRI: reg_fragment_check
    public void regexpFragmentCheck(RegexpNode end, ByteList value) {


Since 9.2 will be considered a major version you can remove this from ParserSupport if you want (or not) and just make callers call lexer directly. Optional.

enebo · 2018-01-10T14:55:40Z

@grddev sorry our last merge broke this. I did not notice you had addressed the comments so I spaced out landing this. Can you update it?

And use the same function from both Ripper and main parser.

While not really related to Lexing, this is a component that is shared between Ripper and the main parser, and that seemed like the lesser evil.

It seems to have been `!ENCODING_IS_ASCII8BIT(str)` from the beginning, so I'm not sure why it was the opposite here.

The code was copied from Parser support, so it was clearly broken before, but it had to be fixed now as parts of the Ripper test suite relies on $! rather than explicitly catching the exception.

This uses the fact that Regexp tokenization is handled by a single StringTerm, and thus all tSTRING_CONTENT fragments are easily collectable until the tREGEXP_END comes with the options that we need for validation. The validation itself is a copied/simplified version of what is performed by the main parser, as large parts the validation depended on the AST structure, which we do not have here. Technically, this doesn't perform the validation at the same point in time as the main parser, as it performs the validation when encountering the tREGEXP_END token rather than when processing the regexp rule. I speculate that the difference doesn't really matter given that the only thing we could do with the tREGEXP_END token is to apply the regexp rule.

Use a separate variable to track whether things are dynamic or not, and use a List to avoid tracking the last element explicitly.

The methods were only retained to provide the old interface, but by directly calling the new methods in the Lexer, we can remove the old methods, given that we don't really need to be backwards compatible here.

grddev · 2018-01-14T09:30:44Z

Rebased the changes on top of master

enebo · 2018-01-14T17:31:38Z

@grddev thanks. sorry I lost this one in the weeds :)

enebo requested changes Dec 18, 2017

View reviewed changes

grddev added 7 commits January 14, 2018 10:15

Extract parseRegexpFlags to Lexer

Verified

This commit was signed with the committer’s verified signature.

JounQin JounQin

GPG key ID: 0BC0B62F2E603198

Verified
Learn about vigilant mode

0829a85

And use the same function from both Ripper and main parser.

Simplify the regexp validation logic

d3a8051

Use a separate variable to track whether things are dynamic or not, and use a List to avoid tracking the last element explicitly.

Inline the regexp validation inside ParserSupport

836f80e

The methods were only retained to provide the old interface, but by directly calling the new methods in the Lexer, we can remove the old methods, given that we don't really need to be backwards compatible here.

grddev force-pushed the ripper-regex branch from 78ab0b9 to 836f80e Compare January 14, 2018 09:24

enebo merged commit 2614c77 into jruby:master Jan 14, 2018

enebo added this to the JRuby 9.2.0.0 milestone Feb 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Add support for validating Regexp in Ripper #4902

Add support for validating Regexp in Ripper #4902

grddev commented Dec 17, 2017

grddev commented Dec 17, 2017

enebo left a comment

enebo Dec 18, 2017

grddev Dec 19, 2017

enebo Dec 18, 2017

enebo Dec 18, 2017

grddev Dec 19, 2017

enebo commented Jan 10, 2018

grddev commented Jan 14, 2018

enebo commented Jan 14, 2018

		@@ -1453,13 +1453,8 @@ public Node arg_append(Node node1, Node node2) {

		// MRI: reg_fragment_check
		public void regexpFragmentCheck(RegexpNode end, ByteList value) {

Add support for validating Regexp in Ripper #4902

Add support for validating Regexp in Ripper #4902

Conversation

grddev commented Dec 17, 2017

grddev commented Dec 17, 2017

enebo left a comment

Choose a reason for hiding this comment

enebo Dec 18, 2017

Choose a reason for hiding this comment

grddev Dec 19, 2017

Choose a reason for hiding this comment

enebo Dec 18, 2017

Choose a reason for hiding this comment

enebo Dec 18, 2017

Choose a reason for hiding this comment

grddev Dec 19, 2017

Choose a reason for hiding this comment

enebo commented Jan 10, 2018

grddev commented Jan 14, 2018

enebo commented Jan 14, 2018