- Notifications
You must be signed in to change notification settings - Fork5.1k
Improved handling of Antimirov mode in Nonbacktracking Regex engine#65637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
ghost commentedFeb 20, 2022
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsRefactored how current state is represented by encapsulating it in a new struct The update should improve Antimorov mode performance, but was also useful for code cleanup in general. Moved several methods from Updated some unit tests to actually cause Antimirov mode to be triggered. Some new optimizations had caused the mode not to be triggered any more (e.g. for the regex a.{20}$ where the index was advanced immediately to the 20th character from the end). (I will add some unit test to cover Some code of TODO: Getting rid of
|
Why is it called |
The naming |
The failing tests are unrelated to the changes in this pr (or so it looks to me). |
The algorithm names and the distinction seems like the sort of thing that would live happily in a complex comment that explains the difference for users. The variable names being |
Point taken, we will probably end up doing a separate phase of renamings addressing the terminology issue in general, as a separate PR for this, e.g., renaming |
@veanes, I rebased your commits on main and fix up all the merge conflicts. I also added a commit that cleans up a few things, including avoiding the ToArray call in the NFA mode on every transition. Before doing further work locally in your branch, you'll want to sync to these changes, which you can do with:
If you've already done further work locally,don't do the above until you've made a backup of that work, e.g. by creating a new branch from it:
then doing the earlier commands, and then cherry-picking your last commit. |
FYI, in its current state this ends up really regressing perf. I see a few obvious things I'll fix... |
Thanks a lot for the rebase! I don't have any pending work other than a unit test case I was debugging the watchdog issue. I'll just stash this change and apply the first option above. I think perf could be addressed with |
That's not the issue I'm fixing. Every time CurrentState is being created, it's creating new lists and sets... even when in DFA mode. I'll push a new commit here shortly.
There are a variety of TODOs in the code, e.g. for OrderedOr. What's the plan for those? We're at a point where I'd like to avoid checking in lots of new open issues. If they're of the form "we could consider using XYZ to improve perf", that's fine, and we should have an issue tracking it. But if they're of the form "I'm not sure if this is functionally correct; we should investigate whether we need to be doing something different here", I'd like to address that as part of the change. |
The plan for |
Regarding the continuous creation of |
stephentoub commentedFeb 22, 2022 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I pushed a commit to fixsome of the perf regressions. In particular, the change was allocating multiple lists/sets every time a new CurrentSet was created, even when in DFA mode, which was causing allocations to skyrocket and having a very measurable impact on throughput. Even with that, though, there are some substantial perf regressions. Here's what the "sherlock" perf tests:
I don't think any of those are exercising the NFA mode (?), so this is a hit to the DFA perf. |
It happens multiple times per match operation. |
I am puzzled by this, as |
Unless you are editing this right now I would suggest a fix this by passing the ref to |
I'm fixing it right now. |
That will still create the lists/sets once per match operation. The fix I have here creates them once per runner, allowing those lists/sets to be reused across matches, which is something we'll want to maintain. If we can go further and pass the current state by ref into those helpers instead of also passing in PerThreadState and creating a new CurrentState in each of those, great. But I'm not clear how that relates to the argument being passed in to CurrentState in each of those locations... it's already going to be in the right state? I'm not currently editing this (and don't plan to this evening). Please feel free to pull it down and tweak it according to your recommendation. Thanks. |
Well, I'll fix this and push the fix later this evening. |
Oh,, I forgot to pull your edits@stephentoub .. so now I'm getting just collisions, we were doing similar things in slightly different ways ... essentially perThreadData vs passing the ref to CurrentState (that is per thread obviously due to being created in the method call), not sure what the best option is here exactly, but I think using CurrentState as it was intended to (i.e., fixing my bug, just creating it once and passing it by ref is still the right thing to do). At the same time, I think that the Boolean |
I merged the conflicts, kept both edits, so passing the |
I don't understand how _antimirov is supposed to work. The code is flipping it back to false for phase 2/3; doesn't that mean the next match performed will start off creating new nodes as DFA even if it already exceeded the threshold? Why do we need this flag at all? e.g. why not just decide when lazily creating a new transition whether it should be DFA or NFA, based on the current size of the graph (or based on whether we've initialized the NFA state lists/sets), rather than based on a persisted bool? |
I've got the regression mostly addressed. But I'm trying to figure out now why the we're never ending up with antimirov set to true. I don't think it's from my changes, so maybe from something in Margus' initial changes here. I'll keep looking at it and push a new commit soon. |
Here's my idea fleshed out in code:https://github.com/olsaarik/runtime/blob/antimirov-fix/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs#L797-L900 . One change that seemed necessary was to not pass in the |
What I have in my local changes is a CurrentState struct that's just a discriminated union of DfaMatchingState and an NfaMatchingState. We pass that around by ref, and we do the same static interface constraint we currently do in main, with DfaStateHandler and NfaStateHandler types implementing those interfaces, which operate over the ref CurrentState passed in. |
I think I like the sound of your solution better. The recursive call into the same function with a different type that I had to do is a bit surprising. |
Funny, I was thinking the same about yours. :-) I'll experiment with a hybrid. |
By the way in my solution I think the check for antimirov should actually happen at the start of the outer loop rather than at the end to avoid new match calls temporarily being in DFA mode when they should be in NFA already. |
Thanks for helping out with this this! |
So here's a fun "alternative" approach:https://github.com/olsaarik/runtime/blob/0d51dc4174e0ef726ce6f75734af79719a5ef04f/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs#L866-L878 Not sure this is a serious proposal, but I love using exceptions for control flow :D |
a6fd2c5
to61805ba
Comparestephentoub commentedFeb 24, 2022 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@olsaarik,@veanes,@joperezr, I pushed a new set of changes. Please take a look. From the commit description:
I still see some variance in the perf tests, but nothing like before. Most of the tests are now +/- 10% from where they were before. A few tests get better by upwards of 20% while a few get worse by a similar margin. There's also still room for microoptimization on the fast path, but I'd like to look at that separately. |
d829db0
to0c63092
Compare....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.csShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
I wanted to add that I went through all the new changes involving use of |
I was actually thinking of the List of states in the NFA transition function ... Still having a single state in the set itself may also occur, as the state becomes "deteriministic" temporarily, and could continue to be so for while, this obviously depends on a combination of circumstances triggered by a particular input that makes the behavior deterministic for that particular case (e.g. returning to the "initial" state that just loops around in it, if the initial state is special cased and not split into a set --- an optimization we discussed with@olsaarik ) |
veanes commentedFeb 25, 2022 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Concretely, one optimization would be in
This would allow the top-level loop to know that we are still in initial state and take advantage of the related optimizations used in DFA mode (initial prefix search etc) Obviously, this calls for a proper NFA mode set of benchmarks we currently don't have. |
Thanks. Let's look at that subsequently. I didn't apply it for now as that would also require changing the main loop that assumes NFA states are never initial and thus only looks to see whether the current state is a DFA initial state. |
Sounds good, just wanted to point this out, to keep this in mind. |
…changesThis does a few things:- The previous changes for Antimirov changed the representation from being represented as a DfaMatchingState to instead being a list of states. Unfortunately, this came at the expense of a significant increase in overhead for the DFA mode. This commit melds the old and new design, by still using a CurrentState struct that can be either a DFA state or an NFA state, but with a simplified representation that lacks any processing. The processing is left to two separate structs that implement static abstract interface methods, as was previously done, using that to provide streamlined, generic-specialized processing for the DFA path.- Overhauled how we switch from DFA mode to NFA mode. Previously, the builder would store whether we were in NFA mode, and any matching after it had been switched over immediately started in NFA mode. We now always start in DFA, and only switch over to NFA if we try to create a new state but doing so would exceed the graph size constraint.- Overhauled how we handle the outer/inner loop logic. Previously we wanted to remain in a tight inner loop and so would use some "leeway" to allow us to go beyond the graph size constraint in order to avoid having to check too frequently whether we'd exceeded that size. Now, the transition function itself returns whether we need to exit out to the outer loop in order to upgrade from DFA to NFA, and until/unless that happens, we can stay in the inner loop.- Avoids inlining huge amounts of NFA code that can result in worse DFA code- Renamed Brzozowski and Antimirov to DFA and NFA.- Added a lot of comments to SymbolicRegexMatcher- Fixes "watchdog" issue with captures- A few other renames / cleanups.
Uh oh!
There was an error while loading.Please reload this page.
Refactored how current state is represented by encapsulating it in a new struct
CurrentState
that is passed by reference in theDelta
andTakeTransition
methods and hides whether the state isDfaMatchingState
or a set of NFA states that have now their own transition function_antimirovDelta
maintained by theSymbolicRegexBuilder
. Use ofCurrentState
should not incur any overhead over howDfaMatchingState
was used before. Further improvements in the implementation ofCurrentState
might be possible usingSparseIntMap
, currentlyList
andHashSet
are being used.The update should improve Antimorov mode performance, but was also useful for code cleanup in general.
Moved several methods from
SymbolicRegexMatcher
toSymbolicRegexBuilder
where they belong more naturally. RemovedITransition
interface and the associated Brzozowski and Antimirov structs as they are no longer needed.Updated some unit tests to actually cause Antimirov mode to be triggered. Some new optimizations had caused the mode not to be triggered any more (e.g. for the regex a.{20}$ where the index was advanced immediately to the 20th character from the end). (I will add some unit test to cover
Watchdog
, that has become dormant for seemingly similar reasons.)Some code of
CreateNewNfaTransition
concering nondeterministic behavior from a single NFA state is currently not detected because derivatives do not introduceOr
any more in many cases but ratherOrderedOr
so the important caseif (coretarget.Node.Kind == SymbolicRegexKind.Or)
inCreateNewNfaTransition
is currently never triggered and all
OrderedOr
nodes are treated as single states thus sidestepping the Antimirov mode.TODO: Getting rid of
Or
, or rather renamingOrderedOr
toOr
(and getting rid of the unordered semantics, the current mixture ofOr
andOrderedOr
is a temporary headache). This will also make theSymbolicRegexNode
AST more lightweightFixes#60918