Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Centralize regex tree analysis for atomic/capture/backtracking detection#65734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
stephentoub merged 1 commit intodotnet:mainfromstephentoub:regexanalysis
Feb 25, 2022

Conversation

stephentoub
Copy link
Member

We currently either guess at some of this state based on the immediate surrounding nodes (e.g. whether the immediate child backtracks) or we do potentially-expensive walks each time we need to check (e.g. walking all ancestors until root to determine whether a given node is to be considered atomic). This changes the code to do a pass over the graph to compute the relevant information, which can then be used by the code generators any time they need to access that information. The net effect of this is that in some cases where we were generating code to handle backtracking we'll no longer emit that code, we're not susceptible to O(N^2) behavior in some places we previously were for oddly shaped trees (e.g. a loop deeply nested inside of an atomic node), and things are a little bit cleaner.

Fixes#62451
Note that the issue also talks about tracking not just which nodes contain captures, but which nodes are followed by captures, as that would allow us to avoid emitting uncapturing code for nodes in expressions that contain captures but where the captures were before that node in the graph. However, having written the logic to track that, I realized it was both a little complicated and it doesn't really buy us all that much, so I decided not to go ahead with it.

@ghost
Copy link

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info inarea-owners.md if you want to be subscribed.

Issue Details

We currently either guess at some of this state based on the immediate surrounding nodes (e.g. whether the immediate child backtracks) or we do potentially-expensive walks each time we need to check (e.g. walking all ancestors until root to determine whether a given node is to be considered atomic). This changes the code to do a pass over the graph to compute the relevant information, which can then be used by the code generators any time they need to access that information. The net effect of this is that in some cases where we were generating code to handle backtracking we'll no longer emit that code, we're not susceptible to O(N^2) behavior in some places we previously were for oddly shaped trees (e.g. a loop deeply nested inside of an atomic node), and things are a little bit cleaner.

Fixes#62451
Note that the issue also talks about tracking not just which nodes contain captures, but which nodes are followed by captures, as that would allow us to avoid emitting uncapturing code for nodes in expressions that contain captures but where the captures were before that node in the graph. However, having written the logic to track that, I realized it was both a little complicated and it doesn't really buy us all that much, so I decided not to go ahead with it.

Author:stephentoub
Assignees:-
Labels:

area-System.Text.RegularExpressions,tenet-performance

Milestone:7.0.0

We currently either guess at some of this state based on the immediate surrounding nodes (e.g. whether the immediate child backtracks) or we do potentially-expensive walks each time we need to check (e.g. walking all ancestors until root to determine whether a given node is to be considered atomic).  This changes the code to do a pass over the graph to compute the relevant information, which can then be used by the code generators any time they need to access that information.  This provides the code with faster and more accurate answers.
@stephentoub
Copy link
MemberAuthor

@joperezr, did you have any more feedback on this? Thanks.

Copy link
Member

@joperezrjoperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do we want to add few unit tests for RegexTreeAnalyzer to parse a few expressions and ensure it succesfully calculates IsAtomic, MayContainCapture, and MayBacktrack? Possible one where _complete is false too that ensures MayContainCapture and MayBacktrack always return true?

I know that all of the new code is covered by existing unit tests already since CodeGen engines will be using it every time, but I wonder if it would be beneficial to have focused tests for this helper class.

Other than that, this LGTM, thanks@stephentoub

@stephentoub
Copy link
MemberAuthor

Do we want to add few unit tests for RegexTreeAnalyzer to parse a few expressions and ensure it succesfully calculates IsAtomic, MayContainCapture, and MayBacktrack? Possible one where _complete is false too that ensures MayContainCapture and MayBacktrack always return true?

Sure, we can add some. I'd like to do so separately, though.

@stephentoubstephentoub merged commit2ce0af0 intodotnet:mainFeb 25, 2022
@stephentoubstephentoub deleted the regexanalysis branchFebruary 25, 2022 21:43
@ghostghost locked asresolvedand limited conversation to collaboratorsMar 28, 2022
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.
Reviewers

@joperezrjoperezrjoperezr approved these changes

Assignees

@stephentoubstephentoub

Projects
None yet
Milestone
7.0.0
Development

Successfully merging this pull request may close these issues.

Improve Regex's knowledge of captures and atomicity in the tree
2 participants
@stephentoub@joperezr

[8]ページ先頭

©2009-2025 Movatter.jp