Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

<regex>: Makewregex correctly match negated character classes#5403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation

@muellerj2
Copy link
Contributor

Fixes#992 in a binary-compatible way by repurposing some unused NFA node flag bits. Also makes a little bit of progress towards#995.

Even if the solution in this PR is born out of the need to maintain binary compatibility, I think now that it's probably the best one for these negated classes anyway (at least as long as we keep the representation the same forstd::regex_traits and custom traits classes -- we could save one more bit forstd::regex_traits specifically, because we know how the character classes \w, \s and \d intersect).

Alternative solutions

  1. Store a singlechar_class_type value in the NFA node to represent all specified negated character classes. Sounds like a good idea until you realize that De Morgan's law applies here: For example, we have(not d) or (not w) = not (w and d) for character class[\W\D], so we would have to store the value ofw and d. While the standard guarantees that or'ing character class types makes sense ([re.grammar]/9, there is no such wording for intersecting/and'ing character class types, so and'ing them is allowed to fail for custom traits classes. (One can still make it work forstd::regex_traits though because we know how \W, \S and \D relate and control the associatedchar_class_type values.) See[\W\D] fails to match alphabetic characters boostorg/regex#241 and[libc++]<regex>: Character class[\W\D] fails to match alphabetic characters llvm/llvm-project#131516 for bug reports related to this.
  2. Store achar_class_type in the NFA node for each specified negated character class. Works, but it's quite wasteful.

Note that we could choose these solutions as well: We could create a class derived from_Node_class that stores thesechar_class_type values and repurpose a flag bit to mark that this is the extended version of the_Node_class. But this is a more complicated approach with no clear advantage. (We would have to implement the second alternative if we had to support users adding their own character class escapes like Boost.)

What about#5243?

We could apply the same solution for#5243 as well, but it would have an unfortunate side effect: If new parser and old matcher were to be mixed, one buggy behavior would be replaced by another buggy behavior. Currently,[\w\s] matches alphanumeric characters but fails to match spaces at code points >= 256. If we applied this PR's solution to#5243 as well and the old matcher were to be picked up,[\w\s] would match spaces but fail to match alphanumeric characters at code points >= 256.

We can also keep the old buggy behavior in all cases by basically implementing alternative solution 1 above, so a more complicated alternative approach does have a clear advantage here.

So the follow-up question is: Do we go with this solution for#5243 as well, changing the buggy behavior when mixing new parser and old matcher as a side effect, or do we go for one of the more complex alternative solutions that has the advantage that it remains bug-compatible?

Why does this PR also set bits on the root node?

These bits aren't used yet, but the idea is that they tell the matcher to look up these character classes during initialization so that these lookups don't have to happen each time an attempt is made to match a squared character class with a negated character class. We should start making use of these flags when the matcher is renamed. (I think this isn't too far in the future because an efficient fix for#5365 also requires a change to some internal data structures of the matcher.)

Even if we don't use them yet, setting them now has the advantage that we won't have to handle the case in the future that the negated class flags are set on a character class node but are not on the root node.

Why did you leave a small gap between old and new node flags?

I just wanted to leave some room for new flags common to many node types. We will most likely need at least one such flag: One that marks extended versions of NFA nodes. I also defined the flag bits twice to make it clear that they are specific to two node types (_N_begin and _N_class). But feel free to change this if you prefer to do things differently.

AlexGuteniev reacted with eyes emoji
@StephanTLavavejStephanTLavavej added bugSomething isn't working regexmeow is a substring of homeowner labelsApr 14, 2025
@StephanTLavavejStephanTLavavej self-assigned thisApr 14, 2025
@StephanTLavavej
Copy link
Member

So the follow-up question is: Do we go with this solution for#5243 as well, changing the buggy behavior when mixing new parser and old matcher as a side effect, or do we go for one of the more complex alternative solutions that has the advantage that it remains bug-compatible?

I think we should go with the first solution. Simpler is less risky and more maintainable, and changing buggy behavior in the event of mixing is fine.

muellerj2 reacted with thumbs up emoji

@StephanTLavavej
Copy link
Member

Thanks, this is great! 😻 I pushed tiny changes and expanded the test coverage slightly.

@StephanTLavavejStephanTLavavej removed their assignmentApr 19, 2025
@StephanTLavavejStephanTLavavej moved this fromInitial Review toReady To Merge inSTL Code ReviewsApr 19, 2025
@StephanTLavavejStephanTLavavej moved this fromReady To Merge toMerging inSTL Code ReviewsApr 22, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej
Copy link
Member

I resolved adjacent-edit conflicts in<regex> where#5392 changed:

  • To_Add_equiv2 and_Add_coll2, while this PR changed the parameters of_Add_named_class.
  • _Do_ex_class2 logic commented with// process =, while this PR changed the second argument of_Nfa._Add_named_class(_Cls, false); to_Rx_char_class_kind::_Positive.

@StephanTLavavejStephanTLavavej merged commitc0f5f35 intomicrosoft:mainApr 22, 2025
39 checks passed
@github-project-automationgithub-project-automationbot moved this fromMerging toDone inSTL Code ReviewsApr 22, 2025
@StephanTLavavej
Copy link
Member

Thanks for figuring out how to fix the unfixable! 🧠 💡 😻

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@StephanTLavavejStephanTLavavejStephanTLavavej approved these changes

Assignees

No one assigned

Labels

bugSomething isn't workingregexmeow is a substring of homeowner

Projects

Archived in project

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

<regex> mishandles locale-based character classes outside of the char range

2 participants

@muellerj2@StephanTLavavej

[8]ページ先頭

©2009-2025 Movatter.jp