Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-89188: addPyUnicode_Data andPyUnicode_GetKind#115211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Closed
davidhewitt wants to merge2 commits intopython:mainfromdavidhewitt:gh-89188

Conversation

@davidhewitt
Copy link
Contributor

@davidhewittdavidhewitt commentedFeb 9, 2024
edited by bedevere-appbot
Loading

This is inspired by#102589 (comment)

The idea is to addPyUnicode_Data andPyUnicode_GetKind to the version-specific API as exported symbols, so that we can use these downstream in PyO3 instead of having a dependency on the C bitfield.

We built this on stream today with help of my chat; it was suggested by@mejrs thatPyUnicode_GetData would be a better name for consistency.

I understood that@encukou didn't want these added to the stable ABI due to concerns about exposing PyUnicode encoding and also risk of dangling pointers. That's fine by me, the main aim here is just to get PyO3 off implementation details.

I guess we'll need tests and documentation too; I can add those if it looks like this API is acceptable to add.

mejrs and soof-golan reacted with rocket emoji
@encukou
Copy link
Member

@encukou didn't want these added to the stable ABI

Sorry if I wasn't clear, and sorry to spoil the fun, but I wasn't talking about only the stable ABI, but about the API in general.

the main aim here is just to get PyO3 off implementation details.

But what you're exposing hereare implementation details. We don't want to expose any more of them than we already historically have.

A good API for this will need more design. Before that, PyO3 can keep doing what it's doing now -- it's fragile, but that's because we want to change the details in the future, not set them in stone now.

Let's talk privately about how to best do this. If you want to put in more of your time, I'll be happy to mentor or pair-program.


In case other core devs don't agree with my opinion:

  • Public API needs docs and tests.
  • Additions to the C API now go through the C API WG (which I'm part of; as above, I doubt it'll approve this API)

@davidhewitt
Copy link
ContributorAuthor

Sorry if I wasn't clear, and sorry to spoil the fun, but I wasn't talking about only the stable ABI, but about the API in general.

No fun spoiled at all! Regardless of the acceptance of this patch, it was nice to do this on stream so that others could see some of the steps to contributing a patch to CPython (with a bit of PyO3-based testing thrown in), and also so that folks could correct me where my C knowledge fell short. 😄

I had understood you that these are definitely not candidates for inclusion in the stable ABI, and that there are good reasons to not want these APIs as they are at all.

Before that, PyO3 can keep doing what it's doing now -- it's fragile, but that's because we want to change the details in the future, not set them in stone now.

My main worry is that it's fragile to the point of bordering on unsound, for reasons related to the bitfield discussed in#89188. The nuclear option is to completely deprecate this API on the PyO3 side and refuse to expose it to users, though it would be nice to find a more gentle middle ground, especially as that doesn't actually prevent users from trying this themselves, which might be worse.

For what it's worth, I did a little bit of digging to see what happens downstream of PyO3 in other Rust code related to these unicode objects. Here's a couple of use cases:

  • reading fromdata in tests in PyOxidizer - I guess@indygreg it would be helpful to understand why it was useful to check the encoded result, rather than trying to read out the data as utf8 again?
  • writing todata in orjson - this doesn't go viaPyUnicode_DATA but instead casts and looks past the end of the structure and then writes directly to the encoding. TBH this looks pretty extreme, I assume it was done for performance?
    • I would have hoped thatPyUnicode_FromUtf8AndSize (which PyO3 uses for its string bindings) would be good enough.

Let's talk privately about how to best do this.

👍 let's aim to catch up soon and circle back here.

@encukou
Copy link
Member

My main issue is that this API encourages users to rely on the detail that they can “borrow” data in one of the formats in the currentPyUnicode_KIND enum. But,future PyUnicode representations might not have any of them available for borrowing.

check the encoded result

The elements of Python strings are any numbers inrange(0x110000). Not all of these are valid Unicode codepoints. In particular, surrogates (0xD800-0xDFFF) are allowed instr but are not valid Unicode strings. AFAIK (do check this!), CPython's internal “UTF-8” buffer actually uses a (straightforward)extension of UTF-8. The user-facing Python methods do checks:

>>> chr(0xD800)'\ud800'>>> chr(0xD800).encode('utf-8')Traceback (most recent call last):  File "<stdin>", line 1, in <module>UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

(And yes, the docs aren't clear on what's supposed to be valid Unicode/UTF-8 and what's not. Generally, though, we don't check.)

orjson

I considerwriting the data an entirely separate use case fromreading it. It should have its own API.

@vstinner
Copy link
Member

See also withdrawnPEP 756 "Add PyUnicode_Export() and PyUnicode_Import() C functions".

@vstinner
Copy link
Member

vstinner commentedJan 28, 2025
edited
Loading

I wrote an alternative implementation which keeps PyUnicode_KIND() and PyUnicode_DATA() names:#129412. Just implement them as function in addition the macros.

@vstinner
Copy link
Member

I wrote an alternative implementation which keeps PyUnicode_KIND() and PyUnicode_DATA() names:#129412. Just implement them as function in addition the macros.

I merged PRgh-129412 and so close this one.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@davidhewitt@encukou@vstinner@Eclips4

[8]ページ先頭

©2009-2025 Movatter.jp