Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Improve literal-after-loop regex optimization#93190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged

Conversation

@stephentoub
Copy link
Member

Regex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set. However, it's missing a handful of cases we can easily support:

  • It currently gives up if the set loop is wrapped in an atomic and/or a capture.
  • It currently gives up if the literal is a set that's wrapped in an atomic, capture, concatenate, loop, or lazy loop.
  • If the set loop is followed by an ignore-case string, it currently only searches for the starting set of that string, rather than more of it.
  • If the literal is a set, we'd only examine it if it was exactly one iteration (RegexNodeKind.Set) rather than a loop with at least one iteration.

This fixes all of those issues, such that the optimization extends to more patterns. In our regex database, there are currently 189 patterns that lead to using this optimization. With this change, that increases to 331.

Based on benchmark fromhttps://github.com/BurntSushi/rebar#ruff-noqa:

usingBenchmarkDotNet.Attributes;usingBenchmarkDotNet.Running;usingSystem.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("start","end","Error","StdDev","Job")]publicpartialclassTests{privatestaticreadonlyRegexs_regex=newRegex("(\\s*)((?i:# noqa)(?::\\s?(([A-Z]+[0-9]+(?:[,\\s]+)?)+))?)",RegexOptions.Compiled);privatestaticstrings_haystack=newHttpClient().GetStringAsync("https://raw.githubusercontent.com/BurntSushi/rebar/master/benchmarks/haystacks/wild/cpython-226484e4.py").Result;[Benchmark]publicintLineByLine(){inttotal=0;foreach(ReadOnlySpan<char>lineins_haystack.AsSpan().EnumerateLines()){total+=s_regex.Count(line);}returntotal;}[Benchmark]publicintAll()=>s_regex.Count(s_haystack);}
MethodToolchainMeanRatio
LineByLine\main\corerun.exe308.76 ms1.00
LineByLine\pr\corerun.exe66.82 ms0.22
All\main\corerun.exe278.16 ms1.00
All\pr\corerun.exe23.11 ms0.08

am11, MihaZupan, and buyaa-n reacted with thumbs up emoji
Regex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set.  However, it's missing a handful of cases we can easily support:- It currently gives up if the set loop is wrapped in an atomic and/or a capture.- It currently gives up if the literal is a set that's wrapped in an atomic, capture, concatenate, loop, or lazy loop.- If the set loop is followed by an ignore-case string, it currently only searches for the starting set of that string, rather than more of it.- If the literal is a set, we'd only examine it if it was exactly one iteration (RegexNodeKind.Set) rather than a loop with at least one iteration.This fixes all of those issues, such that the optimization extends to more patterns.
@ghost
Copy link

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info inarea-owners.md if you want to be subscribed.

Issue Details

Regex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set. However, it's missing a handful of cases we can easily support:

  • It currently gives up if the set loop is wrapped in an atomic and/or a capture.
  • It currently gives up if the literal is a set that's wrapped in an atomic, capture, concatenate, loop, or lazy loop.
  • If the set loop is followed by an ignore-case string, it currently only searches for the starting set of that string, rather than more of it.
  • If the literal is a set, we'd only examine it if it was exactly one iteration (RegexNodeKind.Set) rather than a loop with at least one iteration.

This fixes all of those issues, such that the optimization extends to more patterns. In our regex database, there are currently 189 patterns that lead to using this optimization. With this change, that increases to 331.

Based on benchmark fromhttps://github.com/BurntSushi/rebar#ruff-noqa:

usingBenchmarkDotNet.Attributes;usingBenchmarkDotNet.Running;usingSystem.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("start","end","Error","StdDev","Job")]publicpartialclassTests{privatestaticreadonlyRegexs_regex=newRegex("(\\s*)((?i:# noqa)(?::\\s?(([A-Z]+[0-9]+(?:[,\\s]+)?)+))?)",RegexOptions.Compiled);privatestaticstrings_haystack=newHttpClient().GetStringAsync("https://raw.githubusercontent.com/BurntSushi/rebar/master/benchmarks/haystacks/wild/cpython-226484e4.py").Result;[Benchmark]publicintLineByLine(){inttotal=0;foreach(ReadOnlySpan<char>lineins_haystack.AsSpan().EnumerateLines()){total+=s_regex.Count(line);}returntotal;}[Benchmark]publicintAll()=>s_regex.Count(s_haystack);}
MethodToolchainMeanRatio
LineByLine\main\corerun.exe308.76 ms1.00
LineByLine\pr\corerun.exe66.82 ms0.22
All\main\corerun.exe278.16 ms1.00
All\pr\corerun.exe23.11 ms0.08
Author:stephentoub
Assignees:-
Labels:

area-System.Text.RegularExpressions,tenet-performance

Milestone:9.0.0

Copy link
Contributor

@buyaa-nbuyaa-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks good at my level

@stephentoubstephentoub merged commit0cd1774 intodotnet:mainOct 19, 2023
@stephentoubstephentoub deleted the improveliteralafterloopopt branchOctober 19, 2023 01:25
@ghostghost locked asresolvedand limited conversation to collaboratorsNov 18, 2023
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@danmoseleydanmoseleydanmoseley left review comments

@joperezrjoperezrAwaiting requested review from joperezr

@MihaZupanMihaZupanAwaiting requested review from MihaZupan

+1 more reviewer

@buyaa-nbuyaa-nbuyaa-n approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

@stephentoubstephentoub

Projects

None yet

Milestone

9.0.0

Development

Successfully merging this pull request may close these issues.

3 participants

@stephentoub@danmoseley@buyaa-n

[8]ページ先頭

©2009-2025 Movatter.jp