Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

<regex>: Speed up searches for regexes that start with assertions#5576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation

@muellerj2
Copy link
Contributor

Towards#5468. This extends the skip heuristic to the remaining unhandled assertions: word boundaries and lookahead assertions.

  • The handling of word boundaries turns out to be much simpler in_Matcher2::_Skip() than in_Matcher2::_Is_wbound(), because we don't have to do any special handling for the start or end of the input string:_Skip() is never called for the start (or as the comment at the top says,--_First_arg is valid) and it cannot skip beyond the end of the input string (_Last) anyway. So we only have to handle the middle case and have to keep looking for the first position where the word character property changes (\b) or where it does not change (\B).
  • For negative lookahead assertions, the easiest thing to do is to just ignore the assertion body (i.e., to assume that the assertion succeeds) and keep looking for a suitable node that can be used for skipping in the remainder of the main regex.
    • Keep in mind that_Skip() is just a heuristic, so it's fine if we don't exclude some non-matches. There is only one thing we must not do here: Skip something that could match.
  • For positive lookahead assertions, we can look for the first place where the start of the assertion and the regex after the assertion both match.
    • Because this is implemented via recursion, I enforce a maximum recursion depth of 50 to prevent stack overflow. This limit is only reached by regexes that start with more than fifty positive lookahead assertions, so this should be more than enough for any practical regex. And the worst thing that happens if it is reached is that the optimization is (maybe only partially) disengaged.
    • While the code might look similar to the one that caused the quadratic slowdown in<regex>: Nonlinear slowdown with increasing string length #5452, this does not suffer from the same problem because we only move forward in the checks: If evaluating the start of the assertion body or the regex after the assertion body tells us that we should skip until position x, then we search next for the first possible position the other part of the regex matches starting from x. No position is evaluated more than once for the assertion and once for the remainder of the regex after the assertion, which implies a running time linear in the input.
  • The skip heuristic now also continues evaluation beyond_N_endif nodes. (Nothing relevant happens at such nodes. They don't imply anything about the input string. Their only feature is that all the branches of an_N_if node join here again, but the problematic case is when branching happens, not when branches end.)

Tests

I added some tests for lookahead assertions. The tests useregex_replace() because it performs several searches into a single test call.

There already was test coverage for skipping of word boundaries inVSO_0000000_regex_use'stest_DDB_153116_replacements() function. But the test coverage had a gap: It didn't check for correct behavior if two (or more) non-word characters are next to each other. I closed this gap by adding some spaces to the existing tests.

Benchmark

benchmarkbeforeafterspeedup
bm_lorem_search/"bibe"/229157.329157.31.00
bm_lorem_search/"bibe"/35937559988.80.99
bm_lorem_search/"bibe"/41171881171881.00
bm_lorem_search/"(bibe)"/247711.9474331.01
bm_lorem_search/"(bibe)"/397656.298349.40.99
bm_lorem_search/"(bibe)"/42009111967251.02
bm_lorem_search/"(bibe)+"/26277964174.10.98
bm_lorem_search/"(bibe)+"/31255581255581.00
bm_lorem_search/"(bibe)+"/42511162511161.00
bm_lorem_search/"(?:bibe)+"/251562.551562.51.00
bm_lorem_search/"(?:bibe)+"/31025341025391.00
bm_lorem_search/"(?:bibe)+"/42040412040411.00
bm_lorem_search/R"(\bbibe)"/224585785449.22.88
bm_lorem_search/R"(\bbibe)"/34709481726312.73
bm_lorem_search/R"(\bbibe)"/49412653530212.67
bm_lorem_search/R"(\Bibe)"/22563191995071.28
bm_lorem_search/R"(\Bibe)"/35312504098071.30
bm_lorem_search/R"(\Bibe)"/410458508370541.25
bm_lorem_search/R"((?=….)bibe)"/2452080531258.51
bm_lorem_search/R"((?=….)bibe)"/38893691025348.67
bm_lorem_search/R"((?=….)bibe)"/417593802448497.19
bm_lorem_search/R"((?=bibe)….)"/230692059988.85.12
bm_lorem_search/R"((?=bibe)….)"/359988898349.46.10
bm_lorem_search/R"((?=bibe)….)"/413392901949726.87
bm_lorem_search/R"((?!lorem)bibe)"/245509748828.19.32
bm_lorem_search/R"((?!lorem)bibe)"/390680896256.99.42
bm_lorem_search/R"((?!lorem)bibe)"/418431601949729.45

StephanTLavavej reacted with heart emoji
@muellerj2muellerj2 requested a review froma team as acode ownerJune 7, 2025 22:36
@StephanTLavavejStephanTLavavej added performanceMust go faster regexmeow is a substring of homeowner labelsJun 8, 2025
@StephanTLavavejStephanTLavavej self-assigned thisJun 8, 2025
@StephanTLavavejStephanTLavavej removed their assignmentJun 10, 2025
@StephanTLavavejStephanTLavavej moved this fromInitial Review toReady To Merge inSTL Code ReviewsJun 10, 2025
@StephanTLavavejStephanTLavavej moved this fromReady To Merge toMerging inSTL Code ReviewsJun 11, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej
Copy link
Member

I resolved a trivial adjacent-add conflict with#5535 in<regex>.

@StephanTLavavejStephanTLavavej merged commita4c5e3f intomicrosoft:mainJun 14, 2025
39 checks passed
@github-project-automationgithub-project-automationbot moved this fromMerging toDone inSTL Code ReviewsJun 14, 2025
@StephanTLavavej
Copy link
Member

Must go faster! 🚗 🦖 😻

@muellerj2muellerj2 deleted the regex-speedup-assertion-searches branchJune 16, 2025 21:31
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@StephanTLavavejStephanTLavavejStephanTLavavej approved these changes

Assignees

No one assigned

Labels

performanceMust go fasterregexmeow is a substring of homeowner

Projects

Archived in project

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@muellerj2@StephanTLavavej

[8]ページ先頭

©2009-2025 Movatter.jp