NotificationsYou must be signed in to change notification settings
Fork5.1k
Star16.6k

Centralize regex tree analysis for atomic/capture/backtracking detection#65734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

stephentoub merged 1 commit intodotnet:mainfromstephentoub:regexanalysis

Feb 25, 2022

Merged

Centralize regex tree analysis for atomic/capture/backtracking detection#65734

stephentoub merged 1 commit intodotnet:mainfromstephentoub:regexanalysis

Feb 25, 2022

Conversation

Copy link

Member

stephentoub commentedFeb 22, 2022

We currently either guess at some of this state based on the immediate surrounding nodes (e.g. whether the immediate child backtracks) or we do potentially-expensive walks each time we need to check (e.g. walking all ancestors until root to determine whether a given node is to be considered atomic). This changes the code to do a pass over the graph to compute the relevant information, which can then be used by the code generators any time they need to access that information. The net effect of this is that in some cases where we were generating code to handle backtracking we'll no longer emit that code, we're not susceptible to O(N^2) behavior in some places we previously were for oddly shaped trees (e.g. a loop deeply nested inside of an atomic node), and things are a little bit cleaner.

Fixes#62451
Note that the issue also talks about tracking not just which nodes contain captures, but which nodes are followed by captures, as that would allow us to avoid emitting uncapturing code for nodes in expressions that contain captures but where the captures were before that node in the graph. However, having written the logic to track that, I realized it was both a little complicated and it doesn't really buy us all that much, so I decided not to go ahead with it.

stephentoub added area-System.Text.RegularExpressions tenet-performancePerformance related issue labels

Feb 22, 2022

stephentoub added this to the7.0.0 milestone

Feb 22, 2022

stephentoub requested a review fromjoperezr

February 22, 2022 20:49

ghost assignedstephentoub

Feb 22, 2022

Copy link

ghost commentedFeb 22, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info inarea-owners.md if you want to be subscribed.

Issue Details

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`,`tenet-performance`
Milestone:	7.0.0

joperezr reviewed

Feb 24, 2022

View reviewed changes

...aries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexTreeAnalyzer.csShow resolvedHide resolved

Centralize regex tree analysis for atomic/capture/backtracking detection

2380596

We currently either guess at some of this state based on the immediate surrounding nodes (e.g. whether the immediate child backtracks) or we do potentially-expensive walks each time we need to check (e.g. walking all ancestors until root to determine whether a given node is to be considered atomic).  This changes the code to do a pass over the graph to compute the relevant information, which can then be used by the code generators any time they need to access that information.  This provides the code with faster and more accurate answers.

stephentoub force-pushed theregexanalysis branch from995a617 to2380596Compare

February 24, 2022 13:05

Copy link

MemberAuthor

stephentoub commentedFeb 25, 2022

@joperezr, did you have any more feedback on this? Thanks.

joperezr approved these changes

Feb 25, 2022

View reviewed changes

Copy link

Member

joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do we want to add few unit tests for RegexTreeAnalyzer to parse a few expressions and ensure it succesfully calculates IsAtomic, MayContainCapture, and MayBacktrack? Possible one where _complete is false too that ensures MayContainCapture and MayBacktrack always return true?

I know that all of the new code is covered by existing unit tests already since CodeGen engines will be using it every time, but I wonder if it would be beneficial to have focused tests for this helper class.

Other than that, this LGTM, thanks@stephentoub

Copy link

MemberAuthor

stephentoub commentedFeb 25, 2022

Do we want to add few unit tests for RegexTreeAnalyzer to parse a few expressions and ensure it succesfully calculates IsAtomic, MayContainCapture, and MayBacktrack? Possible one where _complete is false too that ensures MayContainCapture and MayBacktrack always return true?

Sure, we can add some. I'd like to do so separately, though.