Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

bugfixes for UTF support (WIP)#226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
carenas wants to merge1 commit intoPCRE2Project:main
base:main
Choose a base branch
Loading
fromcarenas:sux

Conversation

@carenas
Copy link
Contributor

Rename PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES to PCRE2_EXTRA_ALLOW_SURROGATES so it can be used when characters with those codes are used even when not escaped. This also resolves the ambuiguity that allowed the following:

PCRE2 version 10.34 2019-11-21
re> /\p{Cs}/match_invalid_utf
data> \x{dfff}=no_utf_check
No match

Several interrelated bug fixes:

  • An UTF-8 or UTF-32 encoded character with a surrogate codepoint should be allowed with PCRE2_EXTRA_ALLOW_SURROGATES. Before this change non UTF-16 will not allow surrogates if encoded in the pattern directly.
  • In ALT_BSUX mode, any escaped character should represent itself if not "special", but that was not the case for not ASCII characters.
  • When match_invalid_utf is enabled a surrogate character in the subject was being skipped, even if no_utf_check was used together with allow_surrogates

Issues that would be addressed with a new version

  • the documentation around \o and \x, specially when bigger than 255 needs reviewing for clarification.
  • pcre2test support for testing escaped UTF characters is suboptimal, so most testing was done with custom code that should be integrated instead.
  • JIT is broken when match_invalid_utf and allow_surrogate.

Issues that are still open and had been punted from this version:

  • ALT_BSUX allows escape characters up to \x{ffff} even without PCRE2_UTF.
  • PCRE2_ALT_BSUX might assume only UTF-16. It is not clear what 0xf1 might mean (\u00F1, assume binary matching like it seems to do with surrogates and match with an extended 0xf1 PCRE2_UCHAR, or error), if the later is something like PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES needed?.
  • PCRE2_ALT_BSUX match shouldn't really do UTF matching when using surrogates.
  • PCRE2_EXTRA_ALT_BSUX should restrict the use of , so for example non ASCII or \U throw an error. It probably need to break with PCRE2_ALT_SUX.
  • Behaviour is different between UTF-16 and UTF-32 but tests are still unified so tests are missing.

Rename PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES to PCRE2_EXTRA_ALLOW_SURROGATESso it can be used when characters with those codes are used even when notescaped.  This also resolves the ambuiguity that allowed the following:  PCRE2 version 10.34 2019-11-21    re> /\p{Cs}/match_invalid_utf  data> \x{dfff}\=no_utf_check  No matchSeveral interrelated bug fixes:* An UTF-8 or UTF-32 encoded character with a surrogate codepoint should be  allowed with PCRE2_EXTRA_ALLOW_SURROGATES.  Before this change non UTF-16  will not allow surrogates if encoded in the pattern directly.* In ALT_BSUX mode, any escaped character should represent itself if not  "special", but that was not the case for not ASCII characters.* When match_invalid_utf is enabled a surrogate character in the subject was  being skipped, even if no_utf_check was used together with allow_surrogatesIssues that would be addressed with a new version* the documentation around \o and \x, specially when bigger than 255 needs  reviewing for clarification.* pcre2test support for testing escaped UTF characters is suboptimal, so most  testing was done with custom code that should be integrated instead.* JIT is broken when match_invalid_utf and allow_surrogate.Issues that are still open and had been punted from this version:* ALT_BSUX allows escape characters up to \x{ffff} even without PCRE2_UTF.* PCRE2_ALT_BSUX might assume only UTF-16. It is not clear what 0xf1 might  mean (\u00F1, assume binary matching like it seems to do with surrogates  and match with an extended  0xf1 PCRE2_UCHAR, or error), if the later is  something like PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES needed?.* PCRE2_ALT_BSUX match shouldn't really do UTF matching when using surrogates.* PCRE2_EXTRA_ALT_BSUX should restrict the use of \, so for example non ASCII  or \U throw an error.  It probably need to break with PCRE2_ALT_SUX.* Behaviour is different between UTF-16 and UTF-32 but tests are still unified  so tests are missing.
\x{09f}
No match

/^\p{Cs}/match_invalid_utf,allow_surrogates
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this replicates the behaviour where utf is used instead of match_invalid_utf, by using the "allow_surrogates" to redefine what is valid, but maybe it shouldn't match instead since it is invalid UTF-16?

/* An additional compile options word is available in the compile context. */

#definePCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 0x00000001u/* C */
#definePCRE2_EXTRA_ALLOW_SURROGATES 0x00000001u/* C */
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It might be better to keep the old flag and add PCRE2_EXTRA_ALLOW_SURROGATES, which will imply it and extend it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, I was already going to say that making the non-compatible change of removing an option is not something I'd want to do, especially as there is a way round it, as you suggest.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

... and I note that the GitHub tests have failed...

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

correct, I posted this as a work in progress, to help gather feedback on the approach and caveats, and a roundabout way to ask Zoltan for help implementing the new behaviour (use a flag to redefine what is invalid utf), so that match_invalid_utf and utf behave similarly and resolve the ambiguity when matching \p{Cs} in the JIT, if it was sound and correct for the interpreter.

Will update the change and make the RFC more explicit.

@carenascarenas changed the titlebugfixes for UTF supportbugfixes for UTF support (WIP)Apr 15, 2023
@NWilsonNWilson changed the base branch frommaster tomainNovember 5, 2025 15:08
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@PhilipHazelPhilipHazelPhilipHazel left review comments

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@carenas@PhilipHazel

[8]ページ先頭

©2009-2025 Movatter.jp