NotificationsYou must be signed in to change notification settings
Fork33.3k
Star69.7k

gh-89188: add`PyUnicode_Data` and`PyUnicode_GetKind`#115211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

davidhewitt wants to merge2 commits intopython:mainfromdavidhewitt:gh-89188

Closed

gh-89188: add`PyUnicode_Data` and`PyUnicode_GetKind`#115211

davidhewitt wants to merge2 commits intopython:mainfromdavidhewitt:gh-89188

Conversation

Copy link

Contributor

davidhewitt commentedFeb 9, 2024•
edited by bedevere-appbot
Loading

This is inspired by#102589 (comment)

The idea is to addPyUnicode_Data andPyUnicode_GetKind to the version-specific API as exported symbols, so that we can use these downstream in PyO3 instead of having a dependency on the C bitfield.

We built this on stream today with help of my chat; it was suggested by@mejrs thatPyUnicode_GetData would be a better name for consistency.

I understood that@encukou didn't want these added to the stable ABI due to concerns about exposing PyUnicode encoding and also risk of dangling pointers. That's fine by me, the main aim here is just to get PyO3 off implementation details.

I guess we'll need tests and documentation too; I can add those if it looks like this API is acceptable to add.

Issue:Reliance on C bit fields in C API is undefined behavior #89188

pythongh-89188: addPyUnicode_DataandPyUnicode_GetKind

3279b86

bedevere-appbot added the awaiting review label

Feb 9, 2024

bedevere-appbot mentioned this pull request

Feb 9, 2024

Reliance on C bit fields in C API is undefined behavior#89188

Closed

📜🤖 Added by blurb_it.

0ed9e4d

Copy link

Member

encukou commentedFeb 9, 2024

@encukou didn't want these added to the stable ABI

Sorry if I wasn't clear, and sorry to spoil the fun, but I wasn't talking about only the stable ABI, but about the API in general.

the main aim here is just to get PyO3 off implementation details.

But what you're exposing hereare implementation details. We don't want to expose any more of them than we already historically have.

A good API for this will need more design. Before that, PyO3 can keep doing what it's doing now -- it's fragile, but that's because we want to change the details in the future, not set them in stone now.

Let's talk privately about how to best do this. If you want to put in more of your time, I'll be happy to mentor or pair-program.

In case other core devs don't agree with my opinion:

Public API needs docs and tests.
Additions to the C API now go through the C API WG (which I'm part of; as above, I doubt it'll approve this API)

dave-shawley mentioned this pull request

Feb 9, 2024

gh-115165: Fix Annotated for immutable types#115213

Merged

davidhewitt mentioned this pull request

Feb 9, 2024

gh-89188: replace bitfield with struct fields in PyASCIIObject#102589

Closed

Copy link

ContributorAuthor

davidhewitt commentedFeb 9, 2024

Sorry if I wasn't clear, and sorry to spoil the fun, but I wasn't talking about only the stable ABI, but about the API in general.

No fun spoiled at all! Regardless of the acceptance of this patch, it was nice to do this on stream so that others could see some of the steps to contributing a patch to CPython (with a bit of PyO3-based testing thrown in), and also so that folks could correct me where my C knowledge fell short. 😄

I had understood you that these are definitely not candidates for inclusion in the stable ABI, and that there are good reasons to not want these APIs as they are at all.

Before that, PyO3 can keep doing what it's doing now -- it's fragile, but that's because we want to change the details in the future, not set them in stone now.

My main worry is that it's fragile to the point of bordering on unsound, for reasons related to the bitfield discussed in#89188. The nuclear option is to completely deprecate this API on the PyO3 side and refuse to expose it to users, though it would be nice to find a more gentle middle ground, especially as that doesn't actually prevent users from trying this themselves, which might be worse.

For what it's worth, I did a little bit of digging to see what happens downstream of PyO3 in other Rust code related to these unicode objects. Here's a couple of use cases:

reading fromdata in tests in PyOxidizer - I guess@indygreg it would be helpful to understand why it was useful to check the encoded result, rather than trying to read out the data as utf8 again?
writing todata in orjson - this doesn't go viaPyUnicode_DATA but instead casts and looks past the end of the structure and then writes directly to the encoding. TBH this looks pretty extreme, I assume it was done for performance?
- I would have hoped thatPyUnicode_FromUtf8AndSize (which PyO3 uses for its string bindings) would be good enough.

Let's talk privately about how to best do this.

👍 let's aim to catch up soon and circle back here.

Eclips4 added the topic-C-API label

Feb 10, 2024

Copy link

Member

encukou commentedFeb 12, 2024

My main issue is that this API encourages users to rely on the detail that they can “borrow” data in one of the formats in the currentPyUnicode_KIND enum. But,future PyUnicode representations might not have any of them available for borrowing.

check the encoded result

The elements of Python strings are any numbers inrange(0x110000). Not all of these are valid Unicode codepoints. In particular, surrogates (0xD800-0xDFFF) are allowed instr but are not valid Unicode strings. AFAIK (do check this!), CPython's internal “UTF-8” buffer actually uses a (straightforward)extension of UTF-8. The user-facing Python methods do checks:

>>> chr(0xD800)'\ud800'>>> chr(0xD800).encode('utf-8')Traceback (most recent call last):  File "<stdin>", line 1, in <module>UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

(And yes, the docs aren't clear on what's supposed to be valid Unicode/UTF-8 and what's not. Generally, though, we don't check.)

orjson

I considerwriting the data an entirely separate use case fromreading it. It should have its own API.

Copy link

Member

vstinner commentedJan 27, 2025

See also withdrawnPEP 756 "Add PyUnicode_Export() and PyUnicode_Import() C functions".

Copy link

Member

vstinner commentedJan 28, 2025•
edited
Loading

I wrote an alternative implementation which keeps PyUnicode_KIND() and PyUnicode_DATA() names:#129412. Just implement them as function in addition the macros.

Copy link

Member

vstinner commentedJan 30, 2025

I wrote an alternative implementation which keeps PyUnicode_KIND() and PyUnicode_DATA() names:#129412. Just implement them as function in addition the macros.

I merged PRgh-129412 and so close this one.

vstinner closed this

Jan 30, 2025

Labels

awaiting review topic-C-API

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-89188: add`PyUnicode_Data` and`PyUnicode_GetKind`#115211

gh-89188: add`PyUnicode_Data` and`PyUnicode_GetKind`#115211

Uh oh!

Conversation

davidhewitt commentedFeb 9, 2024•
edited by bedevere-appbot
Loading

Uh oh!