- Notifications
You must be signed in to change notification settings - Fork587
memcollxfrm: Handle above-Unicode code points#22989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Uh oh!
There was an error while loading.Please reload this page.
That looks more reasonable, though I don't see why the i386 CI is failing, I couldn't reproduce it with a |
I have started a smoke-me to see what other platforms may have problems. I suspect it is something in strcollxfrm. Is there a way to turn on -DLv for that platform? |
You could add that to |
This value is not going to be used again. I put in the ++ out of habit.
This creates an internal macro that skips some error checking for usewhen we don't care if it is completely well-formed or not.
The next commit will want to use the results later.
I looked over the code again, and realized that it copied as-is the initial portion of the string before the first bytes that needed to be translated, but did not advance the destination pointer to account for that, so that the translation overwrote the as-is portion. In the other string, no translation was needed, so the string's initial segment was intact, and was getting compared with the 10FFFF. Platforms could differ in how they lexically compare those |
Ideally we'd test the intermediate transformation from perl string to no-NULs-no-extended-UTF-8 form, since that doesn't depend on the underlying locale implementation. To do that we'd need to split that out into a separate function and export it, but that's not something we've generally done in core perl. |
It could change behaviour, I think it could use a brief perldelta entry. |
As stated in the comments added by this commit, it is undefined behaviorto call strxfrm() on above-Unicode code points, and especially callingit with Perl's invented extended UTF-8. This commit changes all suchinput into a legal value, replacing all above-Unicode with the highestpermanently unassigned code point, U+10FFFF.
As stated in the comments added by this commit, it is undefined behavior to call strxfrm() on above-Unicode code points, and especially calling it with Perl's invented extended UTF-8. This commit changes all such input into a legal value, replacing all above-Unicode with the highest permanently unassigned code point, U+10FFFF.