Rename PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES to PCRE2_EXTRA_ALLOW_SURROGATES so it can be used when characters with those codes are used even when not escaped. This also resolves the ambuiguity that allowed the following:

PCRE2 version 10.34 2019-11-21
re> /\p{Cs}/match_invalid_utf
data> \x{dfff}=no_utf_check
No match

Several interrelated bug fixes:

An UTF-8 or UTF-32 encoded character with a surrogate codepoint should be allowed with PCRE2_EXTRA_ALLOW_SURROGATES. Before this change non UTF-16 will not allow surrogates if encoded in the pattern directly.
In ALT_BSUX mode, any escaped character should represent itself if not "special", but that was not the case for not ASCII characters.
When match_invalid_utf is enabled a surrogate character in the subject was being skipped, even if no_utf_check was used together with allow_surrogates

Issues that would be addressed with a new version

the documentation around \o and \x, specially when bigger than 255 needs reviewing for clarification.
pcre2test support for testing escaped UTF characters is suboptimal, so most testing was done with custom code that should be integrated instead.
JIT is broken when match_invalid_utf and allow_surrogate.

Issues that are still open and had been punted from this version:

ALT_BSUX allows escape characters up to \x{ffff} even without PCRE2_UTF.
PCRE2_ALT_BSUX might assume only UTF-16. It is not clear what 0xf1 might mean (\u00F1, assume binary matching like it seems to do with surrogates and match with an extended 0xf1 PCRE2_UCHAR, or error), if the later is something like PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES needed?.
PCRE2_ALT_BSUX match shouldn't really do UTF matching when using surrogates.
PCRE2_EXTRA_ALT_BSUX should restrict the use of , so for example non ASCII or \U throw an error. It probably need to break with PCRE2_ALT_SUX.
Behaviour is different between UTF-16 and UTF-32 but tests are still unified so tests are missing.

bugfixes for UTF support

2f7db49

Rename PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES to PCRE2_EXTRA_ALLOW_SURROGATESso it can be used when characters with those codes are used even when notescaped.  This also resolves the ambuiguity that allowed the following:  PCRE2 version 10.34 2019-11-21    re> /\p{Cs}/match_invalid_utf  data> \x{dfff}\=no_utf_check  No matchSeveral interrelated bug fixes:* An UTF-8 or UTF-32 encoded character with a surrogate codepoint should be  allowed with PCRE2_EXTRA_ALLOW_SURROGATES.  Before this change non UTF-16  will not allow surrogates if encoded in the pattern directly.* In ALT_BSUX mode, any escaped character should represent itself if not  "special", but that was not the case for not ASCII characters.* When match_invalid_utf is enabled a surrogate character in the subject was  being skipped, even if no_utf_check was used together with allow_surrogatesIssues that would be addressed with a new version* the documentation around \o and \x, specially when bigger than 255 needs  reviewing for clarification.* pcre2test support for testing escaped UTF characters is suboptimal, so most  testing was done with custom code that should be integrated instead.* JIT is broken when match_invalid_utf and allow_surrogate.Issues that are still open and had been punted from this version:* ALT_BSUX allows escape characters up to \x{ffff} even without PCRE2_UTF.* PCRE2_ALT_BSUX might assume only UTF-16. It is not clear what 0xf1 might  mean (\u00F1, assume binary matching like it seems to do with surrogates  and match with an extended  0xf1 PCRE2_UCHAR, or error), if the later is  something like PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES needed?.* PCRE2_ALT_BSUX match shouldn't really do UTF matching when using surrogates.* PCRE2_EXTRA_ALT_BSUX should restrict the use of \, so for example non ASCII  or \U throw an error.  It probably need to break with PCRE2_ALT_SUX.* Behaviour is different between UTF-16 and UTF-32 but tests are still unified  so tests are missing.

carenas commented

Apr 15, 2023

View reviewed changes

testdata/testoutput5

		\x{09f}
		No match

		/^\p{Cs}/match_invalid_utf,allow_surrogates

Copy link

ContributorAuthor

carenasApr 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this replicates the behaviour where utf is used instead of match_invalid_utf, by using the "allow_surrogates" to redefine what is valid, but maybe it shouldn't match instead since it is invalid UTF-16?

carenas commented

Apr 15, 2023

View reviewed changes

src/pcre2.h.in

		/* An additional compile options word is available in the compile context. */

		#definePCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 0x00000001u/* C */
		#definePCRE2_EXTRA_ALLOW_SURROGATES 0x00000001u/* C */

Copy link

ContributorAuthor

carenasApr 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It might be better to keep the old flag and add PCRE2_EXTRA_ALLOW_SURROGATES, which will imply it and extend it?

Copy link

Collaborator

PhilipHazelApr 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, I was already going to say that making the non-compatible change of removing an option is not something I'd want to do, especially as there is a way round it, as you suggest.

Copy link

Collaborator

PhilipHazelApr 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

... and I note that the GitHub tests have failed...

Copy link

ContributorAuthor

carenasApr 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

correct, I posted this as a work in progress, to help gather feedback on the approach and caveats, and a roundabout way to ask Zoltan for help implementing the new behaviour (use a flag to redefine what is invalid utf), so that match_invalid_utf and utf behave similarly and resolve the ambiguity when matching \p{Cs} in the JIT, if it was sound and correct for the interpreter.

Will update the change and make the RFC more explicit.