NotificationsYou must be signed in to change notification settings
Fork5.1k
Star16.6k

Improved handling of Antimirov mode in Nonbacktracking Regex engine#65637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

stephentoub merged 2 commits intodotnet:mainfromveanes:antimirov

Feb 26, 2022

Merged

Improved handling of Antimirov mode in Nonbacktracking Regex engine#65637

stephentoub merged 2 commits intodotnet:mainfromveanes:antimirov

Feb 26, 2022

Conversation

Copy link

Contributor

veanes commentedFeb 20, 2022•
edited by stephentoub
Loading

Refactored how current state is represented by encapsulating it in a new structCurrentState that is passed by reference in theDelta andTakeTransition methods and hides whether the state isDfaMatchingState or a set of NFA states that have now their own transition function_antimirovDelta maintained by theSymbolicRegexBuilder. Use ofCurrentState should not incur any overhead over howDfaMatchingState was used before. Further improvements in the implementation ofCurrentState might be possible usingSparseIntMap, currentlyList andHashSet are being used.

The update should improve Antimorov mode performance, but was also useful for code cleanup in general.

Moved several methods fromSymbolicRegexMatcher toSymbolicRegexBuilder where they belong more naturally. RemovedITransition interface and the associated Brzozowski and Antimirov structs as they are no longer needed.

Updated some unit tests to actually cause Antimirov mode to be triggered. Some new optimizations had caused the mode not to be triggered any more (e.g. for the regex a.{20}$ where the index was advanced immediately to the 20th character from the end). (I will add some unit test to coverWatchdog, that has become dormant for seemingly similar reasons.)

Some code ofCreateNewNfaTransition concering nondeterministic behavior from a single NFA state is currently not detected because derivatives do not introduceOr any more in many cases but ratherOrderedOr so the important case
if (coretarget.Node.Kind == SymbolicRegexKind.Or) inCreateNewNfaTransition
is currently never triggered and allOrderedOr nodes are treated as single states thus sidestepping the Antimirov mode.

TODO: Getting rid ofOr, or rather renamingOrderedOr toOr (and getting rid of the unordered semantics, the current mixture ofOr andOrderedOr is a temporary headache). This will also make theSymbolicRegexNode AST more lightweight

Fixes#60918

veanes requested review fromstephentoub,danmoseley,joperezr andolsaarik

February 20, 2022 21:20

ghost added the area-System.Text.RegularExpressions label

Feb 20, 2022

ghost assignedveanes

Feb 20, 2022

Copy link

ghost commentedFeb 20, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info inarea-owners.md if you want to be subscribed.

Issue Details

The update should improve Antimorov mode performance, but was also useful for code cleanup in general.

Author:	veanes
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	-

Copy link

Contributor

Wraith2 commentedFeb 20, 2022

Why is it called_antimirovDelta and is there something more generic it could be called?

Copy link

ContributorAuthor

veanes commentedFeb 21, 2022

The naming_antimirovDelta,_antimirovStatearray,_antimirovStatearrayInverse as well as the boolen flag _antimirov reflect that these mirror the values of NFA in _antimirov mode (NFA mode) as opposed to the default DFA (_delta and_statearray) in Brzozowski mode. The names could also be _nfaDelta, _nfaStatearray, and _nfaStatearrayInverse, but the names reflect that the intent that the concerned algorithms use Antimirov-style derivatives (leading to NFA) vs Brzozowski-style derivatives (leading to DFA). We may end up renaming these names as some later point, not sure. In general the namedelta is often used for atransition function of an automaton. So long story short, the name_antimirovDelta is really_nfaTransitionFunction.

Copy link

ContributorAuthor

veanes commentedFeb 21, 2022

The failing tests are unrelated to the changes in this pr (or so it looks to me).

Copy link

Contributor

Wraith2 commentedFeb 21, 2022

The algorithm names and the distinction seems like the sort of thing that would live happily in a complex comment that explains the difference for users. The variable names beingnfa ordfa would be both clear and more approachable than using the algorithm authors' names to me.

Copy link

ContributorAuthor

veanes commentedFeb 21, 2022

Point taken, we will probably end up doing a separate phase of renamings addressing the terminology issue in general, as a separate PR for this, e.g., renaming_antimirov to_nfaMode, etc., there are other similar cases in the code. We have been discussing this on and off for a while.

veanes mentioned this pull request

Feb 21, 2022

Optimize symbolic regex Antimirov mode to avoid so much allocation#60918

Closed

stephentoub force-pushed theantimirov branch from9aa8771 to2426b0bCompare

February 22, 2022 00:51

Copy link

Member

stephentoub commentedFeb 22, 2022

@veanes, I rebased your commits on main and fix up all the merge conflicts. I also added a commit that cleans up a few things, including avoiding the ToArray call in the NFA mode on every transition.

Before doing further work locally in your branch, you'll want to sync to these changes, which you can do with:

git fetch --all
git reset --hard origin/antimirov

If you've already done further work locally,don't do the above until you've made a backup of that work, e.g. by creating a new branch from it:

git checkout -b antimirovtemp
git commit -a -m "fix todos"

then doing the earlier commands, and then cherry-picking your last commit.

Copy link

Member

stephentoub commentedFeb 22, 2022

FYI, in its current state this ends up really regressing perf. I see a few obvious things I'll fix...

Copy link

ContributorAuthor

veanes commentedFeb 22, 2022

Thanks a lot for the rebase! I don't have any pending work other than a unit test case I was debugging the watchdog issue. I'll just stash this change and apply the first option above. I think perf could be addressed withSparseIntMap (@olsaarik know best how to apply it)

Copy link

Member

stephentoub commentedFeb 22, 2022

I think perf could be addressed with SparseIntMap

That's not the issue I'm fixing. Every time CurrentState is being created, it's creating new lists and sets... even when in DFA mode. I'll push a new commit here shortly.

I don't have any pending work

There are a variety of TODOs in the code, e.g. for OrderedOr. What's the plan for those? We're at a point where I'd like to avoid checking in lots of new open issues. If they're of the form "we could consider using XYZ to improve perf", that's fine, and we should have an issue tracking it. But if they're of the form "I'm not sure if this is functionally correct; we should investigate whether we need to be doing something different here", I'd like to address that as part of the change.

Copy link

ContributorAuthor

veanes commentedFeb 22, 2022

The plan forOrderedOr:
We get rid ofOr and only haveOrderedOr. The unorderedOr case could be realized by ordering the elements in theOrderedOr collection by using hashcodes (if this would be needed, but I think not, since we want to now follow the exact (or best effort) same semantics for matches as in backtracking, in which caseOr needs to be ordered always). In particular this would essentially get rid ofSymbolicRegexSet used for_alts (conjunction case can be added back later as a form ofAnd-collection using binary representation as inOrderedOr but not relevant now)

Copy link

ContributorAuthor

veanes commentedFeb 22, 2022

Regarding the continuous creation ofCurrentState, this happens only when a shift happens from DFA to NFA mode, the intent was that this should not happen very often, but I may have been wrong here (which is why I did not think allowingnull initializing them tonull would make much difference)

Copy link

Member

stephentoub commentedFeb 22, 2022•
edited
Loading

I think perf could be addressed with SparseIntMap

That's not the issue I'm fixing. Every time CurrentState is being created, it's creating new lists and sets... even when in DFA mode. I'll push a new commit here shortly.

I pushed a commit to fixsome of the perf regressions. In particular, the change was allocating multiple lists/sets every time a new CurrentSet was created, even when in DFA mode, which was causing allocations to skyrocket and having a very measurable impact on throughput.

Even with that, though, there are some substantial perf regressions. Here's what the "sherlock" perf tests:
https://github.com/dotnet/performance/blob/996fb443bf081a976a2f92a0fc14d5f1c212f6f7/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs#L127-L169
look like for me when I run them locally:

Method	Toolchain	Pattern	Options	Mean	Ratio
Count	\main\corerun.exe	(?i)Holmes	NonBacktracking	659.04 us	1.00
Count	\pr\corerun.exe	(?i)Holmes	NonBacktracking	678.86 us	1.03

Count	\main\corerun.exe	(?i)Sher[a-z]+\|Hol[a-z]+	NonBacktracking	1,215.72 us	1.00
Count	\pr\corerun.exe	(?i)Sher[a-z]+\|Hol[a-z]+	NonBacktracking	1,252.77 us	1.03

Count	\main\corerun.exe	(?i)Sherlock	NonBacktracking	141.34 us	1.00
Count	\pr\corerun.exe	(?i)Sherlock	NonBacktracking	154.75 us	1.10

Count	\main\corerun.exe	(?i)Sherlock Holmes	NonBacktracking	149.50 us	1.00
Count	\pr\corerun.exe	(?i)Sherlock Holmes	NonBacktracking	161.83 us	1.08

Count	\main\corerun.exe	(?i)Sherlock\|Holmes\|Watson	NonBacktracking	3,546.32 us	1.00
Count	\pr\corerun.exe	(?i)Sherlock\|Holmes\|Watson	NonBacktracking	3,690.29 us	1.04

Count	\main\corerun.exe	(?i)Sherlock\|(...)er\|John\|Baker [49]	NonBacktracking	4,896.03 us	1.00
Count	\pr\corerun.exe	(?i)Sherlock\|(...)er\|John\|Baker [49]	NonBacktracking	5,061.19 us	1.03

Count	\main\corerun.exe	(?i)the	NonBacktracking	1,534.91 us	1.00
Count	\pr\corerun.exe	(?i)the	NonBacktracking	1,894.79 us	1.24

Count	\main\corerun.exe	(?m)^Sherlock(...)rlock Holmes$ [37]	NonBacktracking	66.29 us	1.00
Count	\pr\corerun.exe	(?m)^Sherlock(...)rlock Holmes$ [37]	NonBacktracking	72.95 us	1.10

Count	\main\corerun.exe	(?s).*	NonBacktracking	3,287.32 us	1.00
Count	\pr\corerun.exe	(?s).*	NonBacktracking	4,450.44 us	1.35

Count	\main\corerun.exe	.*	NonBacktracking	4,374.97 us	1.00
Count	\pr\corerun.exe	.*	NonBacktracking	6,046.40 us	1.38

Count	\main\corerun.exe	Holmes	NonBacktracking	105.21 us	1.00
Count	\pr\corerun.exe	Holmes	NonBacktracking	135.95 us	1.29

Count	\main\corerun.exe	Holmes.{0,25}(...).{0,25}Holmes [39]	NonBacktracking	187.00 us	1.00
Count	\pr\corerun.exe	Holmes.{0,25}(...).{0,25}Holmes [39]	NonBacktracking	241.87 us	1.29

Count	\main\corerun.exe	Sher[a-z]+\|Hol[a-z]+	NonBacktracking	187.05 us	1.00
Count	\pr\corerun.exe	Sher[a-z]+\|Hol[a-z]+	NonBacktracking	235.15 us	1.26

Count	\main\corerun.exe	Sherlock	NonBacktracking	56.30 us	1.00
Count	\pr\corerun.exe	Sherlock	NonBacktracking	65.92 us	1.17

Count	\main\corerun.exe	Sherlock Holmes	NonBacktracking	65.58 us	1.00
Count	\pr\corerun.exe	Sherlock Holmes	NonBacktracking	76.54 us	1.17

Count	\main\corerun.exe	Sherlock\s+Holmes	NonBacktracking	79.20 us	1.00
Count	\pr\corerun.exe	Sherlock\s+Holmes	NonBacktracking	93.03 us	1.18

Count	\main\corerun.exe	Sherlock\|Holmes	NonBacktracking	191.97 us	1.00
Count	\pr\corerun.exe	Sherlock\|Holmes	NonBacktracking	250.97 us	1.31

Count	\main\corerun.exe	Sherlock\|Holmes\|Watson	NonBacktracking	250.09 us	1.00
Count	\pr\corerun.exe	Sherlock\|Holmes\|Watson	NonBacktracking	315.37 us	1.26

Count	\main\corerun.exe	Sherlock\|Holm(...)er\|John\|Baker [45]	NonBacktracking	2,949.07 us	1.00
Count	\pr\corerun.exe	Sherlock\|Holm(...)er\|John\|Baker [45]	NonBacktracking	3,011.03 us	1.02

Count	\main\corerun.exe	Sherlock\|Street	NonBacktracking	89.18 us	1.00
Count	\pr\corerun.exe	Sherlock\|Street	NonBacktracking	108.13 us	1.21

Count	\main\corerun.exe	The	NonBacktracking	128.29 us	1.00
Count	\pr\corerun.exe	The	NonBacktracking	151.60 us	1.20

Count	\main\corerun.exe	[^\n]*	NonBacktracking	4,304.93 us	1.00
Count	\pr\corerun.exe	[^\n]*	NonBacktracking	6,143.43 us	1.43

Count	\main\corerun.exe	[a-q][^u-z]{13}x	NonBacktracking	103.27 us	1.00
Count	\pr\corerun.exe	[a-q][^u-z]{13}x	NonBacktracking	119.98 us	1.16

Count	\main\corerun.exe	[a-zA-Z]+ing	NonBacktracking	6,801.28 us	1.00
Count	\pr\corerun.exe	[a-zA-Z]+ing	NonBacktracking	9,346.86 us	1.37

Count	\main\corerun.exe	\b\w+n\b	NonBacktracking	8,425.09 us	1.00
Count	\pr\corerun.exe	\b\w+n\b	NonBacktracking	11,394.81 us	1.36

Count	\main\corerun.exe	\p{Ll}	NonBacktracking	29,176.29 us	1.00
Count	\pr\corerun.exe	\p{Ll}	NonBacktracking	44,094.94 us	1.51

Count	\main\corerun.exe	\p{Lu}	NonBacktracking	1,909.84 us	1.00
Count	\pr\corerun.exe	\p{Lu}	NonBacktracking	2,445.12 us	1.28

Count	\main\corerun.exe	\p{L}	NonBacktracking	30,398.30 us	1.00
Count	\pr\corerun.exe	\p{L}	NonBacktracking	45,196.70 us	1.49

Count	\main\corerun.exe	\s[a-zA-Z]{0,12}ing\s	NonBacktracking	4,454.26 us	1.00
Count	\pr\corerun.exe	\s[a-zA-Z]{0,12}ing\s	NonBacktracking	6,538.71 us	1.47

Count	\main\corerun.exe	\w+	NonBacktracking	13,048.35 us	1.00
Count	\pr\corerun.exe	\w+	NonBacktracking	19,288.41 us	1.48

Count	\main\corerun.exe	\w+\s+Holmes	NonBacktracking	3,883.77 us	1.00
Count	\pr\corerun.exe	\w+\s+Holmes	NonBacktracking	6,294.95 us	1.61

Count	\main\corerun.exe	\w+\s+Holmes\s+\w+	NonBacktracking	3,823.51 us	1.00
Count	\pr\corerun.exe	\w+\s+Holmes\s+\w+	NonBacktracking	6,066.32 us	1.59

Count	\main\corerun.exe	aei	NonBacktracking	51.11 us	1.00
Count	\pr\corerun.exe	aei	NonBacktracking	51.54 us	1.01

Count	\main\corerun.exe	aqj	NonBacktracking	39.11 us	1.00
Count	\pr\corerun.exe	aqj	NonBacktracking	38.40 us	0.99

Count	\main\corerun.exe	the	NonBacktracking	824.52 us	1.00
Count	\pr\corerun.exe	the	NonBacktracking	1,160.05 us	1.41

Count	\main\corerun.exe	the\s+\w+	NonBacktracking	1,320.44 us	1.00
Count	\pr\corerun.exe	the\s+\w+	NonBacktracking	1,879.80 us	1.42

Count	\main\corerun.exe	zqj	NonBacktracking	37.05 us	1.00
Count	\pr\corerun.exe	zqj	NonBacktracking	37.06 us	1.00

I don't think any of those are exercising the NFA mode (?), so this is a hit to the DFA perf.

Copy link

Member

stephentoub commentedFeb 22, 2022

Regarding the continuous creation of CurrentState, this happens only when a shift happens from DFA to NFA mode

It happens multiple times per match operation.

Copy link

ContributorAuthor

veanes commentedFeb 22, 2022

I am puzzled by this, asCurrentState is passed by reference inDelta andTakeTransition and should only be createdonce (unless NFA mode kicks in, at least this was the intent).
Ohhh, I see, the issue is that it should be reused when new initial (DfaMatchingState) is created, as opposed to be recreated inFindEndPosition,FindStartPosition etc.CurrentState should be passed into those methods asparameter. -- my bad.

Copy link

ContributorAuthor

veanes commentedFeb 22, 2022

Unless you are editing this right now I would suggest a fix this by passing the ref toCurrentState inFindEndPosition andFindStartPosition andFindFinalStatePosition and creating it inFindMatchonce.

Copy link

ContributorAuthor

veanes commentedFeb 22, 2022

I'm fixing it right now.

Copy link

Member

stephentoub commentedFeb 22, 2022

Unless you are editing this right now I would suggest a fix this by passing the ref to CurrentState in FindEndPosition and FindStartPosition and FindFinalStatePosition and creating it in FindMatch once.

That will still create the lists/sets once per match operation. The fix I have here creates them once per runner, allowing those lists/sets to be reused across matches, which is something we'll want to maintain.

If we can go further and pass the current state by ref into those helpers instead of also passing in PerThreadState and creating a new CurrentState in each of those, great. But I'm not clear how that relates to the argument being passed in to CurrentState in each of those locations... it's already going to be in the right state?

I'm not currently editing this (and don't plan to this evening). Please feel free to pull it down and tweak it according to your recommendation.

Thanks.

Copy link

ContributorAuthor

veanes commentedFeb 22, 2022

CurrentState is really a ref for holding the actual state, when passing it in it initially it will passed say as the first argument to
FindFinalStatePosition,

s =new CurrentState<TSetType>(_dotstarredInitialStates[GetCharKind(input, i - 1)])FindFinalStatePosition(s, ... old arguments)

Well, I'll fix this and push the fix later this evening.

Copy link

ContributorAuthor

veanes commentedFeb 22, 2022

Oh,, I forgot to pull your edits@stephentoub .. so now I'm getting just collisions, we were doing similar things in slightly different ways ... essentially perThreadData vs passing the ref to CurrentState (that is per thread obviously due to being created in the method call), not sure what the best option is here exactly, but I think using CurrentState as it was intended to (i.e., fixing my bug, just creating it once and passing it by ref is still the right thing to do). At the same time, I think that the Boolean_antimirov should be per thread data also, currently it is not.

Copy link

ContributorAuthor

veanes commentedFeb 22, 2022

I merged the conflicts, kept both edits, so passing theperThreadData, I think_antimirov should be a flag inperThreadData (currently it is not, it is in the builder).

Copy link

Member

stephentoub commentedFeb 22, 2022

I think _antimirov should be a flag in perThreadData (currently it is not, it is in the builder).

I don't understand how _antimirov is supposed to work. The code is flipping it back to false for phase 2/3; doesn't that mean the next match performed will start off creating new nodes as DFA even if it already exceeded the threshold?

Why do we need this flag at all? e.g. why not just decide when lazily creating a new transition whether it should be DFA or NFA, based on the current size of the graph (or based on whether we've initialized the NFA state lists/sets), rather than based on a persisted bool?

Copy link

Member

stephentoub commentedFeb 23, 2022

I've got the regression mostly addressed. But I'm trying to figure out now why the we're never ending up with antimirov set to true. I don't think it's from my changes, so maybe from something in Margus' initial changes here. I'll keep looking at it and push a new commit soon.

Copy link

Contributor

olsaarik commentedFeb 23, 2022

Here's my idea fleshed out in code:https://github.com/olsaarik/runtime/blob/antimirov-fix/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs#L797-L900 . One change that seemed necessary was to not pass in theICurrentState asref, since an assignment to that at the end would necessarily do the boxing (although only once per match, so might be fine).

Copy link

Member

stephentoub commentedFeb 23, 2022

What I have in my local changes is a CurrentState struct that's just a discriminated union of DfaMatchingState and an NfaMatchingState. We pass that around by ref, and we do the same static interface constraint we currently do in main, with DfaStateHandler and NfaStateHandler types implementing those interfaces, which operate over the ref CurrentState passed in.

Copy link

Contributor

olsaarik commentedFeb 23, 2022

I think I like the sound of your solution better. The recursive call into the same function with a different type that I had to do is a bit surprising.

Copy link

Member

stephentoub commentedFeb 23, 2022

Funny, I was thinking the same about yours. :-) I'll experiment with a hybrid.

Copy link

Contributor

olsaarik commentedFeb 23, 2022

By the way in my solution I think the check for antimirov should actually happen at the start of the outer loop rather than at the end to avoid new match calls temporarily being in DFA mode when they should be in NFA already.

Copy link

ContributorAuthor

veanes commentedFeb 23, 2022

Thanks for helping out with this this!
Looks like there is definite progress.
I'm busy today with a presentation I have to give tomorrow so had to switch context.

Copy link

Contributor

olsaarik commentedFeb 23, 2022

So here's a fun "alternative" approach:https://github.com/olsaarik/runtime/blob/0d51dc4174e0ef726ce6f75734af79719a5ef04f/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs#L866-L878DfaMatchingState.Next throws thatNoMoreDfaException when it would create a new state over the antimirov bound and then the whole thing switches over to NFA mode. It only happens once per match call and due to the "zero cost" nature of exceptions shouldn't slow down DFA mode at all.

Not sure this is a serious proposal, but I love using exceptions for control flow :D

stephentoub force-pushed theantimirov branch 2 times, most recently froma6fd2c5 to61805baCompare

February 24, 2022 03:55

Copy link

Member

stephentoub commentedFeb 24, 2022•
edited
Loading

@olsaarik,@veanes,@joperezr, I pushed a new set of changes. Please take a look. From the commit description:

Overhaul SymbolicRegexMatcher to avoid DFA perf regressions from NFA changes
This does a few things:
The previous changes for Antimirov changed the representation from being represented as a DfaMatchingState to instead being a list of states. Unfortunately, this came at the expense of a significant increase in overhead for the DFA mode. This commit melds the old and new design, by still using a CurrentState struct that can be either a DFA state or an NFA state, but with a simplified representation that lacks any processing. The processing is left to two separate structs that implement static abstract interface methods, as was previously done, using that to provide streamlined, generic-specialized processing for the DFA path.
Overhauled how we switch from DFA mode to NFA mode. Previously, the builder would store whether we were in NFA mode, and any matching after it had been switched over immediately started in NFA mode. We now always start in DFA, and only switch over to NFA if we try to create a new state but doing so would exceed the graph size constraint.
Overhauled how we handle the outer/inner loop logic. Previously we wanted to remain in a tight inner loop and so would use some "leeway" to allow us to go beyond the graph size constraint in order to avoid having to check too frequently whether we'd exceeded that size. Now, the transition function itself returns whether we need to exit out to the outer loop in order to upgrade from DFA to NFA, and until/unless that happens, we can stay in the inner loop.
Avoids inlining huge amounts of NFA code that can result in worse DFA code
Renamed Brzozowski and Antimirov to DFA and NFA.
Added a lot of comments to SymbolicRegexMatcher
Fixes "watchdog" issue with captures
A few other renames / cleanups.

I still see some variance in the perf tests, but nothing like before. Most of the tests are now +/- 10% from where they were before. A few tests get better by upwards of 20% while a few get worse by a similar margin. There's also still room for microoptimization on the fast path, but I'd like to look at that separately.

stephentoub force-pushed theantimirov branch 2 times, most recently fromd829db0 to0c63092Compare

February 25, 2022 03:34

olsaarik reviewed

Feb 25, 2022

View reviewed changes

....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs OutdatedShow resolvedHide resolved

....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.csShow resolvedHide resolved

Copy link

ContributorAuthor

veanes commentedFeb 25, 2022

I wanted to add that I went through all the new changes involving use ofIStateHandler. It looks really clean to me and I'm happy to see the use ofSparseIntMap. Definitely looks very solid and clear, I'm really happy this is converging. I do not have very strong viewpoint on the special case of one state. But I know one thingfor sure. Having single state in NFA transitions is very common (e.g. the current test cases are like that, I debugged and checked), most if not all nondeterminism isbecause of the set not because of the NFA state also possibly being nondeterministic. So special casing could make sense if there is even a small gain.

Copy link

ContributorAuthor

veanes commentedFeb 25, 2022

I was actually thinking of the List of states in the NFA transition function ... Still having a single state in the set itself may also occur, as the state becomes "deteriministic" temporarily, and could continue to be so for while, this obviously depends on a combination of circumstances triggered by a particular input that makes the behavior deterministic for that particular case (e.g. returning to the "initial" state that just loops around in it, if the initial state is special cased and not split into a set --- an optimization we discussed with@olsaarik )

Copy link

ContributorAuthor

veanes commentedFeb 25, 2022•
edited
Loading

Concretely, one optimization would be inNfaStateHandler:
public static bool IsInitialState(ref CurrentState state) =>

            if  state.NfaState!.NfaStateSet.Count == 1 then             builder.GetCoreState(nfaState.NfaState!.NfaStateSet[0].IsInitialState)            else false

This would allow the top-level loop to know that we are still in initial state and take advantage of the related optimizations used in DFA mode (initial prefix search etc)

Obviously, this calls for a proper NFA mode set of benchmarks we currently don't have.

stephentoub force-pushed theantimirov branch from0c63092 to9c97d46Compare

February 25, 2022 21:34

Copy link

Member

stephentoub commentedFeb 25, 2022

Concretely, one optimization would be in NfaStateHandler

Thanks. Let's look at that subsequently. I didn't apply it for now as that would also require changing the main loop that assumes NFA states are never initial and thus only looks to see whether the current state is a DFA initial state.

Copy link

ContributorAuthor

veanes commentedFeb 25, 2022

Sounds good, just wanted to point this out, to keep this in mind.

olsaarik approved these changes

Feb 25, 2022

View reviewed changes

veanesand others added2 commits

February 25, 2022 19:38

Improvements to Antimirov mode and misc small formatting edits

770d300

Overhaul SymbolicRegexMatcher to avoid DFA perf regressions from NFA …

b086223

…changesThis does a few things:- The previous changes for Antimirov changed the representation from being represented as a DfaMatchingState to instead being a list of states.  Unfortunately, this came at the expense of a significant increase in overhead for the DFA mode.  This commit melds the old and new design, by still using a CurrentState struct that can be either a DFA state or an NFA state, but with a simplified representation that lacks any processing.  The processing is left to two separate structs that implement static abstract interface methods, as was previously done, using that to provide streamlined, generic-specialized processing for the DFA path.- Overhauled how we switch from DFA mode to NFA mode.  Previously, the builder would store whether we were in NFA mode, and any matching after it had been switched over immediately started in NFA mode.  We now always start in DFA, and only switch over to NFA if we try to create a new state but doing so would exceed the graph size constraint.- Overhauled how we handle the outer/inner loop logic.  Previously we wanted to remain in a tight inner loop and so would use some "leeway" to allow us to go beyond the graph size constraint in order to avoid having to check too frequently whether we'd exceeded that size.  Now, the transition function itself returns whether we need to exit out to the outer loop in order to upgrade from DFA to NFA, and until/unless that happens, we can stay in the inner loop.- Avoids inlining huge amounts of NFA code that can result in worse DFA code- Renamed Brzozowski and Antimirov to DFA and NFA.- Added a lot of comments to SymbolicRegexMatcher- Fixes "watchdog" issue with captures- A few other renames / cleanups.

stephentoub force-pushed theantimirov branch from9c97d46 tob086223Compare

February 26, 2022 00:38

stephentoub merged commit83e5cbc intodotnet:main

Feb 26, 2022

This was referencedMar 1, 2022

Regressions in System.Text.RegularExpressions.Tests#66009

Closed

[Perf] Changes at 2/26/2022 2:38:27 AMdotnet/perf-autofiling-issues#3799

Closed

[Perf] Changes at 2/26/2022 2:38:27 AMdotnet/perf-autofiling-issues#3809

Closed

[Perf] Changes at 2/26/2022 10:27:55 AMdotnet/perf-autofiling-issues#3810

Closed

ghost locked asresolvedand limited conversation to collaborators

Mar 28, 2022

Labels

area-System.Text.RegularExpressions

4 participants

Movatterモバイル変換

Improved handling of Antimirov mode in Nonbacktracking Regex engine#65637

Improved handling of Antimirov mode in Nonbacktracking Regex engine#65637

Uh oh!

Conversation

veanes commentedFeb 20, 2022• edited by stephentoubLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ghost commentedFeb 20, 2022

Uh oh!

Wraith2 commentedFeb 20, 2022

Uh oh!

veanes commentedFeb 21, 2022

Uh oh!

veanes commentedFeb 21, 2022

Uh oh!

Wraith2 commentedFeb 21, 2022

Uh oh!

veanes commentedFeb 21, 2022

Uh oh!

stephentoub commentedFeb 22, 2022

Uh oh!

stephentoub commentedFeb 22, 2022

Uh oh!

veanes commentedFeb 22, 2022

Uh oh!

stephentoub commentedFeb 22, 2022

Uh oh!

veanes commentedFeb 22, 2022

Uh oh!

veanes commentedFeb 22, 2022

Uh oh!

stephentoub commentedFeb 22, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

stephentoub commentedFeb 22, 2022

Uh oh!

veanes commentedFeb 22, 2022

Uh oh!

veanes commentedFeb 22, 2022

Uh oh!

veanes commentedFeb 22, 2022

Uh oh!

stephentoub commentedFeb 22, 2022

Uh oh!

veanes commentedFeb 22, 2022

Uh oh!

veanes commentedFeb 22, 2022

Uh oh!

veanes commentedFeb 22, 2022

Uh oh!

stephentoub commentedFeb 22, 2022

Uh oh!

stephentoub commentedFeb 23, 2022

Uh oh!

olsaarik commentedFeb 23, 2022

Uh oh!

stephentoub commentedFeb 23, 2022

Uh oh!

olsaarik commentedFeb 23, 2022

Uh oh!

stephentoub commentedFeb 23, 2022

Uh oh!

olsaarik commentedFeb 23, 2022

Uh oh!

veanes commentedFeb 23, 2022

Uh oh!

olsaarik commentedFeb 23, 2022

Uh oh!

stephentoub commentedFeb 24, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

veanes commentedFeb 25, 2022

Uh oh!

veanes commentedFeb 25, 2022

Uh oh!

veanes commentedFeb 25, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

veanes commentedFeb 20, 2022•
edited by stephentoub
Loading

stephentoub commentedFeb 22, 2022•
edited
Loading

stephentoub commentedFeb 24, 2022•
edited
Loading

veanes commentedFeb 25, 2022•
edited
Loading