NotificationsYou must be signed in to change notification settings
Fork587
Star2.1k

memcollxfrm: Handle above-Unicode code points#22989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

khwilliamson merged 5 commits intoPerl:bleadfromkhwilliamson:locale_leak

Feb 20, 2025

Merged

memcollxfrm: Handle above-Unicode code points#22989

khwilliamson merged 5 commits intoPerl:bleadfromkhwilliamson:locale_leak

Feb 20, 2025

Conversation

Copy link

Contributor

khwilliamson commentedFeb 10, 2025

As stated in the comments added by this commit, it is undefined behavior to call strxfrm() on above-Unicode code points, and especially calling it with Perl's invented extended UTF-8. This commit changes all such input into a legal value, replacing all above-Unicode with the highest permanently unassigned code point, U+10FFFF.

This set of changes may require a perldelta entry, and please state your opinion

khwilliamson force-pushed thelocale_leak branch from9b053d8 to7d9b578Compare

February 11, 2025 12:55

tonycoz reviewed

Feb 17, 2025

View reviewed changes

locale.c OutdatedShow resolvedHide resolved

khwilliamson force-pushed thelocale_leak branch from7d9b578 toeb7e387Compare

February 17, 2025 18:56

Copy link

Contributor

tonycoz commentedFeb 17, 2025

That looks more reasonable, though I don't see why the i386 CI is failing, I couldn't reproduce it with a-m32 build on Debian.

Copy link

ContributorAuthor

khwilliamson commentedFeb 18, 2025

I have started a smoke-me to see what other platforms may have problems.

I suspect it is something in strcollxfrm. Is there a way to turn on -DLv for that platform?

Copy link

Contributor

tonycoz commentedFeb 18, 2025

Is there a way to turn on -DLv for that platform?

You could add that toswitches for the fresh_perl() call, possibly repeating the call with that switch if it fails without the switch.

khwilliamson added4 commits

February 18, 2025 10:30

locale.c: Remove useless ++ increment

e0f68da

This value is not going to be used again.  I put in the ++ out of habit.

utf8.h: Split a macro into components

4cdf8da

This creates an internal macro that skips some error checking for usewhen we don't care if it is completely well-formed or not.

run/locale.t: Add detail to test names

7b6f5fb

run/locale.t: Hoist code out of a block

8cc282a

The next commit will want to use the results later.

khwilliamson force-pushed thelocale_leak branch fromeb7e387 tobd52076Compare

February 18, 2025 17:30

Copy link

ContributorAuthor

khwilliamson commentedFeb 18, 2025

I looked over the code again, and realized that it copied as-is the initial portion of the string before the first bytes that needed to be translated, but did not advance the destination pointer to account for that, so that the translation overwrote the as-is portion. In the other string, no translation was needed, so the string's initial segment was intact, and was getting compared with the 10FFFF. Platforms could differ in how they lexically compare those

Copy link

Contributor

tonycoz commentedFeb 18, 2025

Platforms could differ in how they lexically compare those

Ideally we'd test the intermediate transformation from perl string to no-NULs-no-extended-UTF-8 form, since that doesn't depend on the underlying locale implementation.

To do that we'd need to split that out into a separate function and export it, but that's not something we've generally done in core perl.

Copy link

Contributor

tonycoz commentedFeb 18, 2025

It could change behaviour, I think it could use a brief perldelta entry.

tonycoz approved these changes

Feb 18, 2025

View reviewed changes

mem_collxfrm: Handle above-Unicode code points

535c63e

As stated in the comments added by this commit, it is undefined behaviorto call strxfrm() on above-Unicode code points, and especially callingit with Perl's invented extended UTF-8.  This commit changes all suchinput into a legal value, replacing all above-Unicode with the highestpermanently unassigned code point, U+10FFFF.

khwilliamson force-pushed thelocale_leak branch frombd52076 to535c63eCompare

February 20, 2025 00:22

khwilliamson merged commit9ddcbfa intoPerl:blead

Feb 20, 2025