- Notifications
You must be signed in to change notification settings - Fork5.2k
NonBacktracking Regex optimizations#102655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
ieviev commentedMay 24, 2024
@dotnet-policy-service agree |
ieviev commentedMay 26, 2024
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/StateFlags.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
ieviev commentedMay 28, 2024
forgot the zero-width-non-joiner so this is not correct, this would otherwise almost double the performance but making the conditional nullability checks cheaper did help by around 20% one thing i didnt figure out is where should tests for SymbolicRegexMatcher go? |
danmoseley commentedMay 28, 2024
IIRC essentially all the tests for the non backtracking are also run against the other engines, and involve passing the pattern and text in (ie aren't testing the internals directly). Is that what you're looking to do? BTW is any of this work potentially relevant to the other engines? (I don't own regex just observer here.) |
ieviev commentedMay 28, 2024
@danmoseley Perhaps Compiled might benefit from the same alphabet compression and character kind lookup. A source generated variant of the DFA would be very impressive though, it'd run circles around all other engines |
kasperk81 commentedMay 28, 2024
|
ieviev commentedMay 28, 2024
runtime/src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/AttRegexTests.cs Lines 47 to 48 ine8e553b
Running thousands of regex instances with no GC inbetween while still maintaining low memory constraints in wasm tests is not easily solved, maybe some of the high memory annotations like |
ieviev commentedMay 29, 2024
Now the engine has less per-match overhead as well, The ending Z anchor is a major pain preventing lots of optimizations, I added a slight indirection to ascii-only cases to save memory (128 B instead of 65536 B) but the Other than that the results look pretty good, |
stephentoub commentedMay 30, 2024
Thanks,@ieviev! I will take a look soon. |
ieviev commentedJun 5, 2024
There's still some things to do here
|
stephentoub left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sorry for the delay in getting to this. Much appreciated your submitting it.
A couple of comments/questions...
In our previous conversations, you mentioned you thought inner vectorization would be possible, where we could vectorize within a match rather than just finding the next starting place. I don't see that in this PR. Is that possible? Seems like that'd be an opportunity for some significant wins.
I understand from your comments that these change brought significant wins on some patterns, in particular those involving non-ASCII, which is great. I'm a bit concerned though that when running this on our own perf test suite, I'm seeing regressions in various places. You can find the tests inhttps://github.com/dotnet/performance/tree/main/src/benchmarks/micro/libraries/System.Text.RegularExpressions. Some of the more concerning ones were
\p{Sm}and.{2,4}(Tom, which regressed throughput by ~50%, andHolmes.{0,25}(...).{0,25}Holmes..., which regressed throughput by ~25%. Thoughts?
Thanks!
src/libraries/System.Text.RegularExpressions/src/System.Text.RegularExpressions.csproj OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/MatchingState.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/MatchingState.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/MatchingState.csShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/MatchingState.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/NonBacktrackingTests.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
...em.Text.RegularExpressions/tests/FunctionalTests/System.Text.RegularExpressions.Tests.csproj OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
src/libraries/System.Text.RegularExpressions/tests/UnitTests/SymbolicRegexTests.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
ieviev commentedJun 15, 2024
Yes, this is possible. Any pattern that contains .* could be a lot faster with longer matches. It'd be best to start with inner vectorization without anchors. The presence of anchors makes it more complicated and expensive but i still think it's possible with anchors as well when followed with an anchor context lookup, also it needs a bit of testing to see where the line between match time speedup and unwanted compile/construction-time overhead is.
I'll definitely profile these as well. There is some overhead from the way edge cases are currently handled. \p{Sm} in particular could be made to skip the reversal part entirely along with other fixed length patterns. I'll follow up about this once i've done some more testing |
ieviev commentedJun 18, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Now i've had a better look at the dotnet/performance benchmarks as well. Turns out i was wrong about the \p{Sm} benchmark, Try running the performance tests again - i think it should be faster across the board now. I thought about inner vectorization as well, but it doesn't seem like there's many benchmarks including .* or other star loops and i think creating a separate implementation of RegexFindOptimizations for the automata engine would be far more beneficial. Derivatives have a very concise and useful way to infer prefixes, that'd work for any pattern, so this would replace a lot of custom logic in // ... prefix start, looping back to this or any visited states is redundantSymbolicRegexNodecurrentState=// ..varmergedPrefix= solver.Emptyforeach(varmintermin minterms){var derivative=createDerivative(state,minterm)if(isRedundant(derivative))then continue;// O(1) bitwise merge relevant transitions for the prefixmergedPrefix= solver.Or(mergedPrefix,minterm)}// .. continue breadth first search for longer prefixesreturn mergedPrefix There's many optimizations that currently go undetected/unused which could enable avx in more cases. The avx and transitions could be combined as well, e.g. when optimizing for a longer prefix like I wonder if replacing the algorithm choices in the match with something like while(position!=end){// check for success// get next stateposition++;} An optimized DFA will give you a very consistent worst case throughput of somewhere around 500mb/s on any pattern which also accounts for utf16 losses on utf8 input (why the english benchmarks are slower), AVX on benchmarks is a bit of a gamble, which sometimes helps a lot and sometimes does not. High match count overhead can sometimes dominate the benchmark as well. There's massive gains from AVX in some cases but overall i'm not exactly sure where the line is as it depends on input text as well. The worst case performance should be very good on both ascii and unicode by now, what kind of optimizations to do with prefixes depends on benchmarks used orhttps://github.com/dotnet/runtime-assets/blob/main/src/System.Text.RegularExpressions.TestData/Regex_RealWorldPatterns.json I'll refactor and clean up the PR and comments for a bit now. |
ieviev commentedJun 18, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Ah i left an exception to check if there's over 255 minterms in some pattern as well, a pattern from RegexPcreTests did reach it so a fallback mode definitely necessary. an int array would be a 256K allocation but a couple times faster than BDD so there's a decision there as well. Would pooling the array be an option? |
Simplification, style consistency, dead code deletion, some bounds-check removal, etc.
stephentoub left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I pushed a commit with a last round of changes.
Otherwise, LGTM.
Thanks!
stephentoub commentedJul 11, 2024
I've got another round of cleanup / simplification (some of the code in this PR, some pre-existing) I'm working on and will submit as a separate PR. |
veanes commentedJul 11, 2024
I will be happy to provide comments or feedback if needed. |



Uh oh!
There was an error while loading.Please reload this page.
@stephentoub@veanes
Here's the initial list of changes i'd make to the nonbacktracking regex engine,
I haven't yet documented much because i made some design choices that i need to discuss.
Rebar baseline before changes:

After changes:

update: After more optimization:

Essentially i fixed the things that hurt the performance the most, there's many wins of 300% or more already.
Some fundamental design changes:
There's low hanging fruit in the hot loop as well, i removed the IsDeadend flag entirely, since there is just one dead end state, which i stored the state ID for. The nullability and skipping checks got dedicated short-circuit lookups as having a nullable position at all (a valid match) is usually rare, something like
beqorbgtcould be used as well. Some if if-else branches in the hotloop should be removed and resolved somewhere above.
There's a lot of performance still on the table, when compared to resharp, dedicated
DFA prefilter optimizations could help a lot as well.
Let me know what you think of the changes, i think the nonbacktracking engine could be made a lot faster than compiled in most scenarios and the symbolic automata benefits should be used a lot more aggressively.