Even if the solution in this PR is born out of the need to maintain binary compatibility, I think now that it's probably the best one for these negated classes anyway (at least as long as we keep the representation the same forstd::regex_traits and custom traits classes -- we could save one more bit forstd::regex_traits specifically, because we know how the character classes \w, \s and \d intersect).

Alternative solutions

Store a singlechar_class_type value in the NFA node to represent all specified negated character classes. Sounds like a good idea until you realize that De Morgan's law applies here: For example, we have(not d) or (not w) = not (w and d) for character class[\W\D], so we would have to store the value ofw and d. While the standard guarantees that or'ing character class types makes sense ([re.grammar]/9, there is no such wording for intersecting/and'ing character class types, so and'ing them is allowed to fail for custom traits classes. (One can still make it work forstd::regex_traits though because we know how \W, \S and \D relate and control the associatedchar_class_type values.) See[\W\D] fails to match alphabetic characters boostorg/regex#241 and[libc++]<regex>: Character class[\W\D] fails to match alphabetic characters llvm/llvm-project#131516 for bug reports related to this.
Store achar_class_type in the NFA node for each specified negated character class. Works, but it's quite wasteful.

Note that we could choose these solutions as well: We could create a class derived from_Node_class that stores thesechar_class_type values and repurpose a flag bit to mark that this is the extended version of the_Node_class. But this is a more complicated approach with no clear advantage. (We would have to implement the second alternative if we had to support users adding their own character class escapes like Boost.)

What about#5243?

We could apply the same solution for#5243 as well, but it would have an unfortunate side effect: If new parser and old matcher were to be mixed, one buggy behavior would be replaced by another buggy behavior. Currently,[\w\s] matches alphanumeric characters but fails to match spaces at code points >= 256. If we applied this PR's solution to#5243 as well and the old matcher were to be picked up,[\w\s] would match spaces but fail to match alphanumeric characters at code points >= 256.

We can also keep the old buggy behavior in all cases by basically implementing alternative solution 1 above, so a more complicated alternative approach does have a clear advantage here.

So the follow-up question is: Do we go with this solution for#5243 as well, changing the buggy behavior when mixing new parser and old matcher as a side effect, or do we go for one of the more complex alternative solutions that has the advantage that it remains bug-compatible?

Why does this PR also set bits on the root node?

These bits aren't used yet, but the idea is that they tell the matcher to look up these character classes during initialization so that these lookups don't have to happen each time an attempt is made to match a squared character class with a negated character class. We should start making use of these flags when the matcher is renamed. (I think this isn't too far in the future because an efficient fix for#5365 also requires a change to some internal data structures of the matcher.)

Even if we don't use them yet, setting them now has the advantage that we won't have to handle the case in the future that the negated class flags are set on a character class node but are not on the root node.

Why did you leave a small gap between old and new node flags?

I just wanted to leave some room for new flags common to many node types. We will most likely need at least one such flag: One that marks extended versions of NFA nodes. I also defined the flag bits twice to make it clear that they are specific to two node types (_N_begin and _N_class). But feel free to change this if you prefer to do things differently.

<regex>: Makewregex correctly match negated character classes

1a714f5

muellerj2 requested a review froma team as acode owner

April 12, 2025 11:57

github-project-automationbot added this toSTL Code Reviews

Apr 12, 2025

github-project-automationbot moved this toInitial Review inSTL Code Reviews

Apr 12, 2025

muellerj2 commented

Apr 13, 2025

View reviewed changes

stl/inc/regexShow resolvedHide resolved

StephanTLavavej added bug

Something isn't working

regexmeow is a substring of homeowner labels

Apr 14, 2025

StephanTLavavej self-assigned this

Apr 14, 2025

StephanTLavavej reviewed

Apr 18, 2025

View reviewed changes

stl/inc/regexShow resolvedHide resolved

Copy link

Member

StephanTLavavej commentedApr 18, 2025

So the follow-up question is: Do we go with this solution for#5243 as well, changing the buggy behavior when mixing new parser and old matcher as a side effect, or do we go for one of the more complex alternative solutions that has the advantage that it remains bug-compatible?

I think we should go with the first solution. Simpler is less risky and more maintainable, and changing buggy behavior in the event of mixing is fine.

StephanTLavavej added3 commits

April 18, 2025 13:31

_Lookup_char_class can be aconst member function.

63981c7

Properly nameneg_w_regex_skip.

ea15dec

Add test coverage for[^\W],[^\S],[^\D],[\w\W],[\s\S], …

8234a5c

…`[\d\D]`.

StephanTLavavej reviewed

Apr 19, 2025

View reviewed changes

stl/inc/regexShow resolvedHide resolved

tests/std/tests/VSO_0000000_regex_use/test.cpp OutdatedShow resolvedHide resolved

tests/std/tests/VSO_0000000_regex_use/test.cppShow resolvedHide resolved

Copy link

Member

StephanTLavavej commentedApr 19, 2025

Thanks, this is great! 😻 I pushed tiny changes and expanded the test coverage slightly.

StephanTLavavej approved these changes

Apr 19, 2025

View reviewed changes

StephanTLavavej removed their assignment

Apr 19, 2025

StephanTLavavej moved this fromInitial Review toReady To Merge inSTL Code Reviews

Apr 19, 2025

muellerj2 mentioned this pull request

Apr 20, 2025

<regex>: basic_regex wants regex_traits to provide things not required by [re.req]#995

Closed

StephanTLavavej moved this fromReady To Merge toMerging inSTL Code Reviews

Apr 22, 2025

Copy link

Member

StephanTLavavej commentedApr 22, 2025

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

Merge branch 'main' into regex-final-negated-char-class-fix

c2f290f

Copy link

Member

StephanTLavavej commentedApr 22, 2025

I resolved adjacent-edit conflicts in<regex> where#5392 changed:

To_Add_equiv2 and_Add_coll2, while this PR changed the parameters of_Add_named_class.
_Do_ex_class2 logic commented with// process =, while this PR changed the second argument of_Nfa._Add_named_class(_Cls, false); to_Rx_char_class_kind::_Positive.