- Notifications
You must be signed in to change notification settings - Fork5.2k
Improve literal-after-loop regex optimization#93190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Regex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set. However, it's missing a handful of cases we can easily support:- It currently gives up if the set loop is wrapped in an atomic and/or a capture.- It currently gives up if the literal is a set that's wrapped in an atomic, capture, concatenate, loop, or lazy loop.- If the set loop is followed by an ignore-case string, it currently only searches for the starting set of that string, rather than more of it.- If the literal is a set, we'd only examine it if it was exactly one iteration (RegexNodeKind.Set) rather than a loop with at least one iteration.This fixes all of those issues, such that the optimization extends to more patterns.
ghost commentedOct 8, 2023
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsRegex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set. However, it's missing a handful of cases we can easily support:
This fixes all of those issues, such that the optimization extends to more patterns. In our regex database, there are currently 189 patterns that lead to using this optimization. With this change, that increases to 331. Based on benchmark fromhttps://github.com/BurntSushi/rebar#ruff-noqa: usingBenchmarkDotNet.Attributes;usingBenchmarkDotNet.Running;usingSystem.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("start","end","Error","StdDev","Job")]publicpartialclassTests{privatestaticreadonlyRegexs_regex=newRegex("(\\s*)((?i:# noqa)(?::\\s?(([A-Z]+[0-9]+(?:[,\\s]+)?)+))?)",RegexOptions.Compiled);privatestaticstrings_haystack=newHttpClient().GetStringAsync("https://raw.githubusercontent.com/BurntSushi/rebar/master/benchmarks/haystacks/wild/cpython-226484e4.py").Result;[Benchmark]publicintLineByLine(){inttotal=0;foreach(ReadOnlySpan<char>lineins_haystack.AsSpan().EnumerateLines()){total+=s_regex.Count(line);}returntotal;}[Benchmark]publicintAll()=>s_regex.Count(s_haystack);}
|
buyaa-n left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Looks good at my level
src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexFindOptimizationsTests.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.csShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Groups.Tests.csShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.csShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.csShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
Regex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set. However, it's missing a handful of cases we can easily support:
This fixes all of those issues, such that the optimization extends to more patterns. In our regex database, there are currently 189 patterns that lead to using this optimization. With this change, that increases to 331.
Based on benchmark fromhttps://github.com/BurntSushi/rebar#ruff-noqa: