- Notifications
You must be signed in to change notification settings - Fork1.6k
<regex>: Fix matching of loops with context-dependent empty matches and bounded number of repetitions#5820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Merged
StephanTLavavej merged 2 commits intomicrosoft:mainfrommuellerj2:regex-fix-context-dependent-empty-matches-in-loopsNov 5, 2025
Merged
Uh oh!
There was an error while loading.Please reload this page.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
… and bounded number of repetitions
Uh oh!
There was an error while loading.Please reload this page.
StephanTLavavej approved these changesNov 3, 2025
Member
StephanTLavavej commentedNov 4, 2025
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
3f29c70 intomicrosoft:main 41 checks passed
Uh oh!
There was an error while loading.Please reload this page.
Member
StephanTLavavej commentedNov 5, 2025
🐞 🛠️ 😸 |
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading.Please reload this page.
Fixes#5797.
Bug analysis
The matcher contains an optimization for empty matches in a loop that has a minimum number of repetitions set. It's based on the observation that when a repetition matched the empty string, the next repetition can just match the empty string again, so the minimum is essentially lifted. This is achieved by simulating minimum number of reps - k - 1 matches, if the empty match was the k-th one.
However, this means we also have fewer remaining repetitions if there is also a maximum number of repetitions set. If these aren't enough, the matching along this trajectory fails.
This isn't much of an issue, though, if the empty match can happen in any repetition: If we must have at most y empty matches, then we will find the solution in the trajectory where the first empty match happens in repetition (min number of reps - y). (A mathematician might say that repetitions matching an empty string must commute with all other repetitions.)
But it's possible to construct empty matches that can only appear when the repetition happens at specific positions in the input string using assertions: ECMAScript's lookahead assertions, but also caret and dollar anchors. Using these constructs and disjunctions inside the repeated expression, we can construct input-position-dependent empty matches. This means that the bug could be encountered in some ECMAScript, POSIX extended and awk regexes.
What about POSIX basic regexes?
I'm not sure. While such regexes have one more context-dependent construct compared to ECMAScript and POSIX extended, namely backreferences (ECMAScript backreferences arenot the right kind of context-dependent because they either stay the same throughout the whole evaluation of the loop or are reset at the start of each repetition), these regexes don't have any disjunctions. I have tried to come up with a problematic basic regex, but have failed.
What about simple loops?
Simple loops implement this optimization as well. But I think context-dependent constructs cannot appear in simple loops currently (though maybe in the future). And even if they could appear, it wouldn't matter: Since simple loops are guaranteed to match strings of the same length in each repetition, a simple loop can only ever match empty strings if it matched an empty string once.
Fix
The easiest fix is to just disable this optimization for non-simple loops if a maximum number of repetitions has been set on the loop.
Alternatives
We could separate the repetition counters for minimum and maximum number of repetitions and only increase the counter used for the minimum if an empty match is encountered. Alternatively, we could add some additional loop state variables that encode that the minimum number of repetitions have essentially been lifted. But this would mean that more loop state would have to be stored and managed even for loops that don't gain anything from this optimization. I doubt that all this work is worth it just to enable this optimization in a few more cases.
Doubts about this optimization
This is second time this optimization is a source of bugs in the matcher (see#5218 (comment) with fix in#5494 after it became clear that this can cause some wrong assignments to capturing groups in ECMAScript as well) because it is applied too aggressively. I can't guarantee that we will not discover another bug caused by it. It also seems to optimize a weird corner case: What's the point of setting the minimum number of repetitions on a loop if the repeated pattern can match the empty string in all repetitions? So I will probably just pull it out if we stumble on another bug that is caused by this optimization.