Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Improved handling of Antimirov mode in Nonbacktracking Regex engine#65637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
stephentoub merged 2 commits intodotnet:mainfromveanes:antimirov
Feb 26, 2022

Conversation

veanes
Copy link
Contributor

@veanesveanes commentedFeb 20, 2022
edited by stephentoub
Loading

Refactored how current state is represented by encapsulating it in a new structCurrentState that is passed by reference in theDelta andTakeTransition methods and hides whether the state isDfaMatchingState or a set of NFA states that have now their own transition function_antimirovDelta maintained by theSymbolicRegexBuilder. Use ofCurrentState should not incur any overhead over howDfaMatchingState was used before. Further improvements in the implementation ofCurrentState might be possible usingSparseIntMap, currentlyList andHashSet are being used.

The update should improve Antimorov mode performance, but was also useful for code cleanup in general.

Moved several methods fromSymbolicRegexMatcher toSymbolicRegexBuilder where they belong more naturally. RemovedITransition interface and the associated Brzozowski and Antimirov structs as they are no longer needed.

Updated some unit tests to actually cause Antimirov mode to be triggered. Some new optimizations had caused the mode not to be triggered any more (e.g. for the regex a.{20}$ where the index was advanced immediately to the 20th character from the end). (I will add some unit test to coverWatchdog, that has become dormant for seemingly similar reasons.)

Some code ofCreateNewNfaTransition concering nondeterministic behavior from a single NFA state is currently not detected because derivatives do not introduceOr any more in many cases but ratherOrderedOr so the important case
if (coretarget.Node.Kind == SymbolicRegexKind.Or) inCreateNewNfaTransition
is currently never triggered and allOrderedOr nodes are treated as single states thus sidestepping the Antimirov mode.

TODO: Getting rid ofOr, or rather renamingOrderedOr toOr (and getting rid of the unordered semantics, the current mixture ofOr andOrderedOr is a temporary headache). This will also make theSymbolicRegexNode AST more lightweight

Fixes#60918

@ghost
Copy link

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info inarea-owners.md if you want to be subscribed.

Issue Details

Refactored how current state is represented by encapsulating it in a new structCurrentState that is passed by reference in theDelta andTakeTransition methods and hides whether the state isDfaMatchingState or a set of NFA states that have now their own transition function_antimirovDelta maintained by theSymbolicRegexBuilder. Use ofCurrentState should not incur any overhead over howDfaMatchingState was used before. Further improvements in the implementation ofCurrentState might be possible usingSparseIntMap, currentlyList andHashSet are being used.

The update should improve Antimorov mode performance, but was also useful for code cleanup in general.

Moved several methods fromSymbolicRegexMatcher toSymbolicRegexBuilder where they belong more naturally. RemovedITransition interface and the associated Brzozowski and Antimirov structs as they are no longer needed.

Updated some unit tests to actually cause Antimirov mode to be triggered. Some new optimizations had caused the mode not to be triggered any more (e.g. for the regex a.{20}$ where the index was advanced immediately to the 20th character from the end). (I will add some unit test to coverWatchdog, that has become dormant for seemingly similar reasons.)

Some code ofCreateNewNfaTransition concering nondeterministic behavior from a single NFA state is currently not detected because derivatives do not introduceOr any more in many cases but ratherOrderedOr so the important case
if (coretarget.Node.Kind == SymbolicRegexKind.Or) inCreateNewNfaTransition
is currently never triggered and allOrderedOr nodes are treated as single states thus sidestepping the Antimirov mode.

TODO: Getting rid ofOr, or rather renamingOrderedOr toOr (and getting rid of the unordered semantics, the current mixture ofOr andOrderedOr is a temporary headache). This will also make theSymbolicRegexNode AST more lightweight

Author:veanes
Assignees:-
Labels:

area-System.Text.RegularExpressions

Milestone:-

@Wraith2
Copy link
Contributor

Why is it called_antimirovDelta and is there something more generic it could be called?

@veanes
Copy link
ContributorAuthor

The naming_antimirovDelta,_antimirovStatearray,_antimirovStatearrayInverse as well as the boolen flag _antimirov reflect that these mirror the values of NFA in _antimirov mode (NFA mode) as opposed to the default DFA (_delta and_statearray) in Brzozowski mode. The names could also be _nfaDelta, _nfaStatearray, and _nfaStatearrayInverse, but the names reflect that the intent that the concerned algorithms use Antimirov-style derivatives (leading to NFA) vs Brzozowski-style derivatives (leading to DFA). We may end up renaming these names as some later point, not sure. In general the namedelta is often used for atransition function of an automaton. So long story short, the name_antimirovDelta is really_nfaTransitionFunction.

@veanes
Copy link
ContributorAuthor

The failing tests are unrelated to the changes in this pr (or so it looks to me).

@Wraith2
Copy link
Contributor

The algorithm names and the distinction seems like the sort of thing that would live happily in a complex comment that explains the difference for users. The variable names beingnfa ordfa would be both clear and more approachable than using the algorithm authors' names to me.

@veanes
Copy link
ContributorAuthor

Point taken, we will probably end up doing a separate phase of renamings addressing the terminology issue in general, as a separate PR for this, e.g., renaming_antimirov to_nfaMode, etc., there are other similar cases in the code. We have been discussing this on and off for a while.

Wraith2 and GSPP reacted with thumbs up emoji

@stephentoub
Copy link
Member

@veanes, I rebased your commits on main and fix up all the merge conflicts. I also added a commit that cleans up a few things, including avoiding the ToArray call in the NFA mode on every transition.

Before doing further work locally in your branch, you'll want to sync to these changes, which you can do with:

git fetch --all
git reset --hard origin/antimirov

If you've already done further work locally,don't do the above until you've made a backup of that work, e.g. by creating a new branch from it:

git checkout -b antimirovtemp
git commit -a -m "fix todos"

then doing the earlier commands, and then cherry-picking your last commit.

@stephentoub
Copy link
Member

FYI, in its current state this ends up really regressing perf. I see a few obvious things I'll fix...

@veanes
Copy link
ContributorAuthor

Thanks a lot for the rebase! I don't have any pending work other than a unit test case I was debugging the watchdog issue. I'll just stash this change and apply the first option above. I think perf could be addressed withSparseIntMap (@olsaarik know best how to apply it)

@stephentoub
Copy link
Member

I think perf could be addressed with SparseIntMap

That's not the issue I'm fixing. Every time CurrentState is being created, it's creating new lists and sets... even when in DFA mode. I'll push a new commit here shortly.

I don't have any pending work

There are a variety of TODOs in the code, e.g. for OrderedOr. What's the plan for those? We're at a point where I'd like to avoid checking in lots of new open issues. If they're of the form "we could consider using XYZ to improve perf", that's fine, and we should have an issue tracking it. But if they're of the form "I'm not sure if this is functionally correct; we should investigate whether we need to be doing something different here", I'd like to address that as part of the change.

@veanes
Copy link
ContributorAuthor

The plan forOrderedOr:
We get rid ofOr and only haveOrderedOr. The unorderedOr case could be realized by ordering the elements in theOrderedOr collection by using hashcodes (if this would be needed, but I think not, since we want to now follow the exact (or best effort) same semantics for matches as in backtracking, in which caseOr needs to be ordered always). In particular this would essentially get rid ofSymbolicRegexSet used for_alts (conjunction case can be added back later as a form ofAnd-collection using binary representation as inOrderedOr but not relevant now)

stephentoub reacted with thumbs up emoji

@veanes
Copy link
ContributorAuthor

Regarding the continuous creation ofCurrentState, this happens only when a shift happens from DFA to NFA mode, the intent was that this should not happen very often, but I may have been wrong here (which is why I did not think allowingnull initializing them tonull would make much difference)

@stephentoub
Copy link
Member

stephentoub commentedFeb 22, 2022
edited
Loading

I think perf could be addressed with SparseIntMap

That's not the issue I'm fixing. Every time CurrentState is being created, it's creating new lists and sets... even when in DFA mode. I'll push a new commit here shortly.

I pushed a commit to fixsome of the perf regressions. In particular, the change was allocating multiple lists/sets every time a new CurrentSet was created, even when in DFA mode, which was causing allocations to skyrocket and having a very measurable impact on throughput.

Even with that, though, there are some substantial perf regressions. Here's what the "sherlock" perf tests:
https://github.com/dotnet/performance/blob/996fb443bf081a976a2f92a0fc14d5f1c212f6f7/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs#L127-L169
look like for me when I run them locally:

MethodToolchainPatternOptionsMeanRatio
Count\main\corerun.exe(?i)HolmesNonBacktracking659.04 us1.00
Count\pr\corerun.exe(?i)HolmesNonBacktracking678.86 us1.03
Count\main\corerun.exe(?i)Sher[a-z]+|Hol[a-z]+NonBacktracking1,215.72 us1.00
Count\pr\corerun.exe(?i)Sher[a-z]+|Hol[a-z]+NonBacktracking1,252.77 us1.03
Count\main\corerun.exe(?i)SherlockNonBacktracking141.34 us1.00
Count\pr\corerun.exe(?i)SherlockNonBacktracking154.75 us1.10
Count\main\corerun.exe(?i)Sherlock HolmesNonBacktracking149.50 us1.00
Count\pr\corerun.exe(?i)Sherlock HolmesNonBacktracking161.83 us1.08
Count\main\corerun.exe(?i)Sherlock|Holmes|WatsonNonBacktracking3,546.32 us1.00
Count\pr\corerun.exe(?i)Sherlock|Holmes|WatsonNonBacktracking3,690.29 us1.04
Count\main\corerun.exe(?i)Sherlock|(...)er|John|Baker [49]NonBacktracking4,896.03 us1.00
Count\pr\corerun.exe(?i)Sherlock|(...)er|John|Baker [49]NonBacktracking5,061.19 us1.03
Count\main\corerun.exe(?i)theNonBacktracking1,534.91 us1.00
Count\pr\corerun.exe(?i)theNonBacktracking1,894.79 us1.24
Count\main\corerun.exe(?m)^Sherlock(...)rlock Holmes$ [37]NonBacktracking66.29 us1.00
Count\pr\corerun.exe(?m)^Sherlock(...)rlock Holmes$ [37]NonBacktracking72.95 us1.10
Count\main\corerun.exe(?s).*NonBacktracking3,287.32 us1.00
Count\pr\corerun.exe(?s).*NonBacktracking4,450.44 us1.35
Count\main\corerun.exe.*NonBacktracking4,374.97 us1.00
Count\pr\corerun.exe.*NonBacktracking6,046.40 us1.38
Count\main\corerun.exeHolmesNonBacktracking105.21 us1.00
Count\pr\corerun.exeHolmesNonBacktracking135.95 us1.29
Count\main\corerun.exeHolmes.{0,25}(...).{0,25}Holmes [39]NonBacktracking187.00 us1.00
Count\pr\corerun.exeHolmes.{0,25}(...).{0,25}Holmes [39]NonBacktracking241.87 us1.29
Count\main\corerun.exeSher[a-z]+|Hol[a-z]+NonBacktracking187.05 us1.00
Count\pr\corerun.exeSher[a-z]+|Hol[a-z]+NonBacktracking235.15 us1.26
Count\main\corerun.exeSherlockNonBacktracking56.30 us1.00
Count\pr\corerun.exeSherlockNonBacktracking65.92 us1.17
Count\main\corerun.exeSherlock HolmesNonBacktracking65.58 us1.00
Count\pr\corerun.exeSherlock HolmesNonBacktracking76.54 us1.17
Count\main\corerun.exeSherlock\s+HolmesNonBacktracking79.20 us1.00
Count\pr\corerun.exeSherlock\s+HolmesNonBacktracking93.03 us1.18
Count\main\corerun.exeSherlock|HolmesNonBacktracking191.97 us1.00
Count\pr\corerun.exeSherlock|HolmesNonBacktracking250.97 us1.31
Count\main\corerun.exeSherlock|Holmes|WatsonNonBacktracking250.09 us1.00
Count\pr\corerun.exeSherlock|Holmes|WatsonNonBacktracking315.37 us1.26
Count\main\corerun.exeSherlock|Holm(...)er|John|Baker [45]NonBacktracking2,949.07 us1.00
Count\pr\corerun.exeSherlock|Holm(...)er|John|Baker [45]NonBacktracking3,011.03 us1.02
Count\main\corerun.exeSherlock|StreetNonBacktracking89.18 us1.00
Count\pr\corerun.exeSherlock|StreetNonBacktracking108.13 us1.21
Count\main\corerun.exeTheNonBacktracking128.29 us1.00
Count\pr\corerun.exeTheNonBacktracking151.60 us1.20
Count\main\corerun.exe[^\n]*NonBacktracking4,304.93 us1.00
Count\pr\corerun.exe[^\n]*NonBacktracking6,143.43 us1.43
Count\main\corerun.exe[a-q][^u-z]{13}xNonBacktracking103.27 us1.00
Count\pr\corerun.exe[a-q][^u-z]{13}xNonBacktracking119.98 us1.16
Count\main\corerun.exe[a-zA-Z]+ingNonBacktracking6,801.28 us1.00
Count\pr\corerun.exe[a-zA-Z]+ingNonBacktracking9,346.86 us1.37
Count\main\corerun.exe\b\w+n\bNonBacktracking8,425.09 us1.00
Count\pr\corerun.exe\b\w+n\bNonBacktracking11,394.81 us1.36
Count\main\corerun.exe\p{Ll}NonBacktracking29,176.29 us1.00
Count\pr\corerun.exe\p{Ll}NonBacktracking44,094.94 us1.51
Count\main\corerun.exe\p{Lu}NonBacktracking1,909.84 us1.00
Count\pr\corerun.exe\p{Lu}NonBacktracking2,445.12 us1.28
Count\main\corerun.exe\p{L}NonBacktracking30,398.30 us1.00
Count\pr\corerun.exe\p{L}NonBacktracking45,196.70 us1.49
Count\main\corerun.exe\s[a-zA-Z]{0,12}ing\sNonBacktracking4,454.26 us1.00
Count\pr\corerun.exe\s[a-zA-Z]{0,12}ing\sNonBacktracking6,538.71 us1.47
Count\main\corerun.exe\w+NonBacktracking13,048.35 us1.00
Count\pr\corerun.exe\w+NonBacktracking19,288.41 us1.48
Count\main\corerun.exe\w+\s+HolmesNonBacktracking3,883.77 us1.00
Count\pr\corerun.exe\w+\s+HolmesNonBacktracking6,294.95 us1.61
Count\main\corerun.exe\w+\s+Holmes\s+\w+NonBacktracking3,823.51 us1.00
Count\pr\corerun.exe\w+\s+Holmes\s+\w+NonBacktracking6,066.32 us1.59
Count\main\corerun.exeaeiNonBacktracking51.11 us1.00
Count\pr\corerun.exeaeiNonBacktracking51.54 us1.01
Count\main\corerun.exeaqjNonBacktracking39.11 us1.00
Count\pr\corerun.exeaqjNonBacktracking38.40 us0.99
Count\main\corerun.exetheNonBacktracking824.52 us1.00
Count\pr\corerun.exetheNonBacktracking1,160.05 us1.41
Count\main\corerun.exethe\s+\w+NonBacktracking1,320.44 us1.00
Count\pr\corerun.exethe\s+\w+NonBacktracking1,879.80 us1.42
Count\main\corerun.exezqjNonBacktracking37.05 us1.00
Count\pr\corerun.exezqjNonBacktracking37.06 us1.00

I don't think any of those are exercising the NFA mode (?), so this is a hit to the DFA perf.

@stephentoub
Copy link
Member

Regarding the continuous creation of CurrentState, this happens only when a shift happens from DFA to NFA mode

It happens multiple times per match operation.

@veanes
Copy link
ContributorAuthor

I am puzzled by this, asCurrentState is passed by reference inDelta andTakeTransition and should only be createdonce (unless NFA mode kicks in, at least this was the intent).
Ohhh, I see, the issue is that it should be reused when new initial (DfaMatchingState) is created, as opposed to be recreated inFindEndPosition,FindStartPosition etc.CurrentState should be passed into those methods asparameter. -- my bad.

@veanes
Copy link
ContributorAuthor

Unless you are editing this right now I would suggest a fix this by passing the ref toCurrentState inFindEndPosition andFindStartPosition andFindFinalStatePosition and creating it inFindMatchonce.

@veanes
Copy link
ContributorAuthor

I'm fixing it right now.

@stephentoub
Copy link
Member

Unless you are editing this right now I would suggest a fix this by passing the ref to CurrentState in FindEndPosition and FindStartPosition and FindFinalStatePosition and creating it in FindMatch once.

That will still create the lists/sets once per match operation. The fix I have here creates them once per runner, allowing those lists/sets to be reused across matches, which is something we'll want to maintain.

If we can go further and pass the current state by ref into those helpers instead of also passing in PerThreadState and creating a new CurrentState in each of those, great. But I'm not clear how that relates to the argument being passed in to CurrentState in each of those locations... it's already going to be in the right state?

I'm not currently editing this (and don't plan to this evening). Please feel free to pull it down and tweak it according to your recommendation.

Thanks.

@veanes
Copy link
ContributorAuthor

CurrentState is really a ref for holding the actual state, when passing it in it initially it will passed say as the first argument to
FindFinalStatePosition,

s =new CurrentState<TSetType>(_dotstarredInitialStates[GetCharKind(input, i - 1)])FindFinalStatePosition(s, ... old arguments)

Well, I'll fix this and push the fix later this evening.

@veanes
Copy link
ContributorAuthor

Oh,, I forgot to pull your edits@stephentoub .. so now I'm getting just collisions, we were doing similar things in slightly different ways ... essentially perThreadData vs passing the ref to CurrentState (that is per thread obviously due to being created in the method call), not sure what the best option is here exactly, but I think using CurrentState as it was intended to (i.e., fixing my bug, just creating it once and passing it by ref is still the right thing to do). At the same time, I think that the Boolean_antimirov should be per thread data also, currently it is not.

@veanes
Copy link
ContributorAuthor

I merged the conflicts, kept both edits, so passing theperThreadData, I think_antimirov should be a flag inperThreadData (currently it is not, it is in the builder).

@stephentoub
Copy link
Member

I think _antimirov should be a flag in perThreadData (currently it is not, it is in the builder).

I don't understand how _antimirov is supposed to work. The code is flipping it back to false for phase 2/3; doesn't that mean the next match performed will start off creating new nodes as DFA even if it already exceeded the threshold?

Why do we need this flag at all? e.g. why not just decide when lazily creating a new transition whether it should be DFA or NFA, based on the current size of the graph (or based on whether we've initialized the NFA state lists/sets), rather than based on a persisted bool?

@stephentoub
Copy link
Member

I've got the regression mostly addressed. But I'm trying to figure out now why the we're never ending up with antimirov set to true. I don't think it's from my changes, so maybe from something in Margus' initial changes here. I'll keep looking at it and push a new commit soon.

@olsaarik
Copy link
Contributor

Here's my idea fleshed out in code:https://github.com/olsaarik/runtime/blob/antimirov-fix/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs#L797-L900 . One change that seemed necessary was to not pass in theICurrentState asref, since an assignment to that at the end would necessarily do the boxing (although only once per match, so might be fine).

@stephentoub
Copy link
Member

What I have in my local changes is a CurrentState struct that's just a discriminated union of DfaMatchingState and an NfaMatchingState. We pass that around by ref, and we do the same static interface constraint we currently do in main, with DfaStateHandler and NfaStateHandler types implementing those interfaces, which operate over the ref CurrentState passed in.

@olsaarik
Copy link
Contributor

I think I like the sound of your solution better. The recursive call into the same function with a different type that I had to do is a bit surprising.

@stephentoub
Copy link
Member

Funny, I was thinking the same about yours. :-) I'll experiment with a hybrid.

olsaarik reacted with laugh emoji

@olsaarik
Copy link
Contributor

By the way in my solution I think the check for antimirov should actually happen at the start of the outer loop rather than at the end to avoid new match calls temporarily being in DFA mode when they should be in NFA already.

@veanes
Copy link
ContributorAuthor

Thanks for helping out with this this!
Looks like there is definite progress.
I'm busy today with a presentation I have to give tomorrow so had to switch context.

@olsaarik
Copy link
Contributor

So here's a fun "alternative" approach:https://github.com/olsaarik/runtime/blob/0d51dc4174e0ef726ce6f75734af79719a5ef04f/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs#L866-L878DfaMatchingState.Next throws thatNoMoreDfaException when it would create a new state over the antimirov bound and then the whole thing switches over to NFA mode. It only happens once per match call and due to the "zero cost" nature of exceptions shouldn't slow down DFA mode at all.

Not sure this is a serious proposal, but I love using exceptions for control flow :D

stephentoub reacted with laugh emoji

@stephentoubstephentoubforce-pushed theantimirov branch 2 times, most recently froma6fd2c5 to61805baCompareFebruary 24, 2022 03:55
@stephentoub
Copy link
Member

stephentoub commentedFeb 24, 2022
edited
Loading

@olsaarik,@veanes,@joperezr, I pushed a new set of changes. Please take a look. From the commit description:

Overhaul SymbolicRegexMatcher to avoid DFA perf regressions from NFA changes
This does a few things:

  • The previous changes for Antimirov changed the representation from being represented as a DfaMatchingState to instead being a list of states. Unfortunately, this came at the expense of a significant increase in overhead for the DFA mode. This commit melds the old and new design, by still using a CurrentState struct that can be either a DFA state or an NFA state, but with a simplified representation that lacks any processing. The processing is left to two separate structs that implement static abstract interface methods, as was previously done, using that to provide streamlined, generic-specialized processing for the DFA path.
  • Overhauled how we switch from DFA mode to NFA mode. Previously, the builder would store whether we were in NFA mode, and any matching after it had been switched over immediately started in NFA mode. We now always start in DFA, and only switch over to NFA if we try to create a new state but doing so would exceed the graph size constraint.
  • Overhauled how we handle the outer/inner loop logic. Previously we wanted to remain in a tight inner loop and so would use some "leeway" to allow us to go beyond the graph size constraint in order to avoid having to check too frequently whether we'd exceeded that size. Now, the transition function itself returns whether we need to exit out to the outer loop in order to upgrade from DFA to NFA, and until/unless that happens, we can stay in the inner loop.
  • Avoids inlining huge amounts of NFA code that can result in worse DFA code
  • Renamed Brzozowski and Antimirov to DFA and NFA.
  • Added a lot of comments to SymbolicRegexMatcher
  • Fixes "watchdog" issue with captures
  • A few other renames / cleanups.

I still see some variance in the perf tests, but nothing like before. Most of the tests are now +/- 10% from where they were before. A few tests get better by upwards of 20% while a few get worse by a similar margin. There's also still room for microoptimization on the fast path, but I'd like to look at that separately.

@stephentoubstephentoubforce-pushed theantimirov branch 2 times, most recently fromd829db0 to0c63092CompareFebruary 25, 2022 03:34
@veanes
Copy link
ContributorAuthor

I wanted to add that I went through all the new changes involving use ofIStateHandler. It looks really clean to me and I'm happy to see the use ofSparseIntMap. Definitely looks very solid and clear, I'm really happy this is converging. I do not have very strong viewpoint on the special case of one state. But I know one thingfor sure. Having single state in NFA transitions is very common (e.g. the current test cases are like that, I debugged and checked), most if not all nondeterminism isbecause of the set not because of the NFA state also possibly being nondeterministic. So special casing could make sense if there is even a small gain.

@veanes
Copy link
ContributorAuthor

I was actually thinking of the List of states in the NFA transition function ... Still having a single state in the set itself may also occur, as the state becomes "deteriministic" temporarily, and could continue to be so for while, this obviously depends on a combination of circumstances triggered by a particular input that makes the behavior deterministic for that particular case (e.g. returning to the "initial" state that just loops around in it, if the initial state is special cased and not split into a set --- an optimization we discussed with@olsaarik )

@veanes
Copy link
ContributorAuthor

veanes commentedFeb 25, 2022
edited
Loading

Concretely, one optimization would be inNfaStateHandler:
public static bool IsInitialState(ref CurrentState state) =>

            if  state.NfaState!.NfaStateSet.Count == 1 then             builder.GetCoreState(nfaState.NfaState!.NfaStateSet[0].IsInitialState)            else false

This would allow the top-level loop to know that we are still in initial state and take advantage of the related optimizations used in DFA mode (initial prefix search etc)

Obviously, this calls for a proper NFA mode set of benchmarks we currently don't have.

@stephentoub
Copy link
Member

Concretely, one optimization would be in NfaStateHandler

Thanks. Let's look at that subsequently. I didn't apply it for now as that would also require changing the main loop that assumes NFA states are never initial and thus only looks to see whether the current state is a DFA initial state.

@veanes
Copy link
ContributorAuthor

Sounds good, just wanted to point this out, to keep this in mind.

stephentoub reacted with thumbs up emoji

veanesand others added2 commitsFebruary 25, 2022 19:38
…changesThis does a few things:- The previous changes for Antimirov changed the representation from being represented as a DfaMatchingState to instead being a list of states.  Unfortunately, this came at the expense of a significant increase in overhead for the DFA mode.  This commit melds the old and new design, by still using a CurrentState struct that can be either a DFA state or an NFA state, but with a simplified representation that lacks any processing.  The processing is left to two separate structs that implement static abstract interface methods, as was previously done, using that to provide streamlined, generic-specialized processing for the DFA path.- Overhauled how we switch from DFA mode to NFA mode.  Previously, the builder would store whether we were in NFA mode, and any matching after it had been switched over immediately started in NFA mode.  We now always start in DFA, and only switch over to NFA if we try to create a new state but doing so would exceed the graph size constraint.- Overhauled how we handle the outer/inner loop logic.  Previously we wanted to remain in a tight inner loop and so would use some "leeway" to allow us to go beyond the graph size constraint in order to avoid having to check too frequently whether we'd exceeded that size.  Now, the transition function itself returns whether we need to exit out to the outer loop in order to upgrade from DFA to NFA, and until/unless that happens, we can stay in the inner loop.- Avoids inlining huge amounts of NFA code that can result in worse DFA code- Renamed Brzozowski and Antimirov to DFA and NFA.- Added a lot of comments to SymbolicRegexMatcher- Fixes "watchdog" issue with captures- A few other renames / cleanups.
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.
Reviewers

@stephentoubstephentoubstephentoub left review comments

@olsaarikolsaarikolsaarik approved these changes

@danmoseleydanmoseleyAwaiting requested review from danmoseley

@joperezrjoperezrAwaiting requested review from joperezr

Assignees

@veanesveanes

Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

Optimize symbolic regex Antimirov mode to avoid so much allocation
4 participants
@veanes@Wraith2@stephentoub@olsaarik

[8]ページ先頭

©2009-2025 Movatter.jp