Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

optimize positive lookahead at end of pattern#167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
keith-hall wants to merge12 commits intofancy-regex:main
base:main
Choose a base branch
Loading
fromforkeith:trailing_positive_lookahead_optimization

Conversation

keith-hall
Copy link
Contributor

@keith-hallkeith-hall commentedJul 7, 2025
edited
Loading

Introducingfancy-regex's first optimization! This is done purely on the parse tree, without touching the VM.

We move a trailing positive lookahead outside our wrapped capture group 0.

If the contents of the lookahead are easy, and the rest of the expression is also easy, we can then delegate the whole regexp to regex-automata and fixup the capture groups afterwards.

If the contents are not easy, no harm is done because the expression retains the same semantics in our VM, as we only perform the move if the whole pattern is not self-recursive. Also, as it is done after parsing, any relative backreferences etc. should be unaffected.

Because it is done before analysis, no re-analysis needs to take place, so it is quite efficient.

This allows us to take full advantage of regex-automata's prefilters etc, and in general delegating to the NFA should always be quicker than using a backtracking engine and trying each character position in the haystack one at a time.

This type of regex pattern is a quite common scenario inside.sublime-syntax definitions which are compatible with Sublime Text'ssregex engine.

Examples - the new capture group becomes our explicit capture group 0:

  • (?=foo) becomes()foo
  • \w+(?=[ ]) becomes(\w+)[ ]
  • a(b)c(?=def) becomes(a(b)c)def
  • a(b)c(?=def|ghi) becomes(a(b)c)(?:def|ghi)

- we achieve this by moving a trailing positive lookahead outside our wrapped capture group 0.- if the contents of the lookahead are easy, and the rest of the expression is also easy, we can then delegate the whole regexp to regex-automata and fixup the capture groups afterwards.- If the contents are not easy, no harm is done because the expression retains the same semantics in our VM, as we only perform the move if the whole pattern is not self-recursive. Also, as it is done after parsing, any relative backreferences etc. should be unaffected.- Because it is done before analysis, no re-analysis needs to take place, so it is quite efficient.This allows us to take full advantage of regex-automata's prefilters etc, and in general delegating to the NFA should always be quicker than using a backtracking engine and trying each character position in the haystack one at a time.This type of regex pattern is a quite common scenario inside `.sublime-syntax` definitions which are compatible with Sublime Text's `sregex` engine.
Copy link
Contributor

@robinstrobinst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice idea and great work! I left some comments.

Also, I think it would be good to add some more tests exercising these new combinations (easy and optimized, hard and optimized) tomatching/finding.

keith-hall reacted with thumbs up emoji
@@ -1 +1 @@
github: [raphlinus, robinst]
github: [raphlinus, robinst, keith-hall]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

🎉

src/lib.rs Outdated
@@ -239,6 +241,10 @@ enum RegexImpl {
inner: RaRegex,
options: RegexOptions,
},
WrapCaptureFixup {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can you add some documentation here?

keith-hall reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

done

src/lib.rs Outdated
@@ -409,6 +415,10 @@ enum CapturesImpl<'t> {
text: &'t str,
locations: RaCaptures,
},
WrapCaptureFixup {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Docs here as well, i.e. something like "capture group 1 in the API maps to group 2 internally, etc"

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

please let me know if what I added is clear enough

Comment on lines +755 to +762
let raw_e = match tree.expr {
Expr::Concat(ref v) => match v[1] {
Expr::Group(ref child) => child,
_ => unreachable!(),
},
_ => unreachable!(),
},
_ => unreachable!(),
};
raw_e.to_str(&mut re_cooked, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I never liked this wrapping and unwrapping here, and this makes it a bit more complicated.

I wonder how complicated it would be to instead make the VM understand arbitrary start positions and capturing match bounds.. Do you know how the regex crate does this?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't know, I haven't checked yet. Would you mind if we left it for a future PR if it is non-trivial?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

the fix for#167 (comment) has perhaps made it even more complicated... so I will understand if you say "no" 😅

Copy link
Contributor

@robinstrobinstJul 10, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The regex crate code for this is e.g. here:

https://github.com/rust-lang/regex/blob/1a069b9232c607b34c4937122361aa075ef573fa/regex-automata/src/nfa/thompson/pikevm.rs#L1218-L1219

Note that it knows whether a pattern is anchored or not, something we could do as well (another optimization).

Apart from that, I haven't gone back to this yet. I do now think it would not be too difficult to change the VM to get rid of the wrapping. The added benefit as mentioned above is that anchored matches that happen to not match would also be optimized.

(An alternative that I typed at some point but not sure I sent, is if we delayed the wrapping to later when we know we need the VM. AFAICT it would mean we'd have to wrap usingInfos instead when we callcompile(&info)?, but the nice thing is that the delegation case wouldn't be affected. Let me know if that makes sense or not 😅.)

keith-hall reacted with thumbs up emoji
src/lib.rs Outdated
let inner = compile::compile_inner(&re_cooked, &options)?;
return Ok(Regex {
inner: RegexImpl::Wrap { inner, options },
inner: if requires_capture_group_fixup {
RegexImpl::WrapCaptureFixup { inner, options }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Instead of this we could also add a boolean toWrap, right? Some code below could be simplified with that I think, e.g. in captures_from_pos.

keith-hall reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

done

src/lib.rs Outdated
Comment on lines 2060 to 2063
match regex.inner {
RegexImpl::WrapCaptureFixup { .. } => {},
_ => panic!("trailing positive lookahead for an otherwise easy pattern should avoid going through the VM"),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Isn't there something to make this simpler, maybeassert!(matches!(...)) (but keep the nice assertion message)?

keith-hall reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

yes, there is, thanks for the suggestion - done :)

src/lib.rs Outdated
};
raw_e.to_str(&mut re_cooked, 0);
} else {
tree.expr.to_str(&mut re_cooked, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So in the case where we optimized, we're now asking the regex crate to match on(?s:.)*?(pattern), instead of justpattern.. Does that change anything about how the matching works?

Basically I'm wondering whether the(?s:.)*? in the beginning has any bad effects (and we don't really need it for delegating, right?)

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I guess a bad effect could be extra memory usage on a large input haystack. We indeed don't need it for delegating... I've pushed a fix, but I'm not 100% sure I'm happy with it... Please see what you think...

}
}

fn optimize_trailing_lookahead(tree: &mut ExprTree) -> Result<bool> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why theResult?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

good question, nice catch. No need for it - removed :)

// - if it was applied, capture group 0 is no longer implicit, but explicit
// if/when the whole expression gets delegated to regex-automata
// converts i.e. original pattern `a(?=b)` when wrapped in the capture group 0
// as `(?s:.)*?(a(?=b))`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is(?s:.)*?(a(?=b|c)) handled correctly here? I.e. does the content of the lookaround need to be wrapped in a group?

keith-hall reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

great question! theto_str method handles that for us, I added some tests to prove it. (We actually don't have anExpr for a non-capture group)

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@robinstrobinstrobinst left review comments

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

2 participants
@keith-hall@robinst

[8]ページ先頭

©2009-2025 Movatter.jp