Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork41
optimize positive lookahead at end of pattern#167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
optimize positive lookahead at end of pattern#167
Uh oh!
There was an error while loading.Please reload this page.
Conversation
- we achieve this by moving a trailing positive lookahead outside our wrapped capture group 0.- if the contents of the lookahead are easy, and the rest of the expression is also easy, we can then delegate the whole regexp to regex-automata and fixup the capture groups afterwards.- If the contents are not easy, no harm is done because the expression retains the same semantics in our VM, as we only perform the move if the whole pattern is not self-recursive. Also, as it is done after parsing, any relative backreferences etc. should be unaffected.- Because it is done before analysis, no re-analysis needs to take place, so it is quite efficient.This allows us to take full advantage of regex-automata's prefilters etc, and in general delegating to the NFA should always be quicker than using a backtracking engine and trying each character position in the haystack one at a time.This type of regex pattern is a quite common scenario inside `.sublime-syntax` definitions which are compatible with Sublime Text's `sregex` engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Nice idea and great work! I left some comments.
Also, I think it would be good to add some more tests exercising these new combinations (easy and optimized, hard and optimized) tomatching
/finding
.
@@ -1 +1 @@ | |||
github: [raphlinus, robinst] | |||
github: [raphlinus, robinst, keith-hall] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
🎉
Uh oh!
There was an error while loading.Please reload this page.
src/lib.rs Outdated
@@ -239,6 +241,10 @@ enum RegexImpl { | |||
inner: RaRegex, | |||
options: RegexOptions, | |||
}, | |||
WrapCaptureFixup { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Can you add some documentation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
done
src/lib.rs Outdated
@@ -409,6 +415,10 @@ enum CapturesImpl<'t> { | |||
text: &'t str, | |||
locations: RaCaptures, | |||
}, | |||
WrapCaptureFixup { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Docs here as well, i.e. something like "capture group 1 in the API maps to group 2 internally, etc"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
please let me know if what I added is clear enough
let raw_e = match tree.expr { | ||
Expr::Concat(ref v) => match v[1] { | ||
Expr::Group(ref child) => child, | ||
_ => unreachable!(), | ||
}, | ||
_ => unreachable!(), | ||
}, | ||
_ => unreachable!(), | ||
}; | ||
raw_e.to_str(&mut re_cooked, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I never liked this wrapping and unwrapping here, and this makes it a bit more complicated.
I wonder how complicated it would be to instead make the VM understand arbitrary start positions and capturing match bounds.. Do you know how the regex crate does this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't know, I haven't checked yet. Would you mind if we left it for a future PR if it is non-trivial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
the fix for#167 (comment) has perhaps made it even more complicated... so I will understand if you say "no" 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The regex crate code for this is e.g. here:
Note that it knows whether a pattern is anchored or not, something we could do as well (another optimization).
Apart from that, I haven't gone back to this yet. I do now think it would not be too difficult to change the VM to get rid of the wrapping. The added benefit as mentioned above is that anchored matches that happen to not match would also be optimized.
(An alternative that I typed at some point but not sure I sent, is if we delayed the wrapping to later when we know we need the VM. AFAICT it would mean we'd have to wrap usingInfos
instead when we callcompile(&info)?
, but the nice thing is that the delegation case wouldn't be affected. Let me know if that makes sense or not 😅.)
src/lib.rs Outdated
let inner = compile::compile_inner(&re_cooked, &options)?; | ||
return Ok(Regex { | ||
inner: RegexImpl::Wrap { inner, options }, | ||
inner: if requires_capture_group_fixup { | ||
RegexImpl::WrapCaptureFixup { inner, options } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Instead of this we could also add a boolean toWrap
, right? Some code below could be simplified with that I think, e.g. in captures_from_pos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
done
src/lib.rs Outdated
match regex.inner { | ||
RegexImpl::WrapCaptureFixup { .. } => {}, | ||
_ => panic!("trailing positive lookahead for an otherwise easy pattern should avoid going through the VM"), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Isn't there something to make this simpler, maybeassert!(matches!(...))
(but keep the nice assertion message)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
yes, there is, thanks for the suggestion - done :)
src/lib.rs Outdated
}; | ||
raw_e.to_str(&mut re_cooked, 0); | ||
} else { | ||
tree.expr.to_str(&mut re_cooked, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
So in the case where we optimized, we're now asking the regex crate to match on(?s:.)*?(pattern)
, instead of justpattern
.. Does that change anything about how the matching works?
Basically I'm wondering whether the(?s:.)*?
in the beginning has any bad effects (and we don't really need it for delegating, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I guess a bad effect could be extra memory usage on a large input haystack. We indeed don't need it for delegating... I've pushed a fix, but I'm not 100% sure I'm happy with it... Please see what you think...
src/optimize.rs Outdated
} | ||
} | ||
fn optimize_trailing_lookahead(tree: &mut ExprTree) -> Result<bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Why theResult
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
good question, nice catch. No need for it - removed :)
// - if it was applied, capture group 0 is no longer implicit, but explicit | ||
// if/when the whole expression gets delegated to regex-automata | ||
// converts i.e. original pattern `a(?=b)` when wrapped in the capture group 0 | ||
// as `(?s:.)*?(a(?=b))` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Is(?s:.)*?(a(?=b|c))
handled correctly here? I.e. does the content of the lookaround need to be wrapped in a group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
great question! theto_str
method handles that for us, I added some tests to prove it. (We actually don't have anExpr
for a non-capture group)
Co-authored-by: Robin Stocker <robinst@canva.com>
Uh oh!
There was an error while loading.Please reload this page.
Introducing
fancy-regex
's first optimization! This is done purely on the parse tree, without touching the VM.We move a trailing positive lookahead outside our wrapped capture group 0.
If the contents of the lookahead are easy, and the rest of the expression is also easy, we can then delegate the whole regexp to regex-automata and fixup the capture groups afterwards.
If the contents are not easy, no harm is done because the expression retains the same semantics in our VM, as we only perform the move if the whole pattern is not self-recursive. Also, as it is done after parsing, any relative backreferences etc. should be unaffected.
Because it is done before analysis, no re-analysis needs to take place, so it is quite efficient.
This allows us to take full advantage of regex-automata's prefilters etc, and in general delegating to the NFA should always be quicker than using a backtracking engine and trying each character position in the haystack one at a time.
This type of regex pattern is a quite common scenario inside
.sublime-syntax
definitions which are compatible with Sublime Text'ssregex
engine.Examples - the new capture group becomes our explicit capture group 0:
(?=foo)
becomes()foo
\w+(?=[ ])
becomes(\w+)[ ]
a(b)c(?=def)
becomes(a(b)c)def
a(b)c(?=def|ghi)
becomes(a(b)c)(?:def|ghi)