Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32.4k
gh-110289: C API: Add PyUnicode_EqualToUTF8() function#110297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
gh-110289: C API: Add PyUnicode_EqualToUTF8() function#110297
Uh oh!
There was an error while loading.Please reload this page.
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
PyUnicode_EqualToString() inline UTF-8 encoder is hard for review for me right now, I would feel more comfortable with tests, especially on corner cases:
- string not encoded to UTF-8
- Evil surrogate characters
Uh oh!
There was an error while loading.Please reload this page.
Objects/unicodeobject.c Outdated
assert(str); | ||
if (PyUnicode_IS_ASCII(unicode)) { | ||
size_t len = (size_t)PyUnicode_GET_LENGTH(unicode); | ||
return strlen(str) == len && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would prefer to test the length first, to make the code more readable.
Like:
if (strlen(str) == len) { return 1;}return memcmp(...);
Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It is the same in_PyUnicode_EqualToASCIIString()
.
How
if (!a) {return0;}returnb;
is more readable than simplereturn a && b;
? It is what the&&
operator for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
is more readable than simple return a && b;?
For me, it's easier to reason about a single test per line when I review code.
Keepa && b
if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The readability problem as I see it, is that your&&
use has side effects; it is not a pure logic expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
For me, it's easier to reason about a single test per line when I review code.
Fortunately, every condition here is already on a separate line.
The readability problem as I see it, is that your
&&
use has side effects; it is not a pure logic expression.
It is how&&
works in C. There is a lot of code likearg != NULL and PyDict_Check(arg) && PyDict_GET_SIZE(arg) > count
. I do not think rewriting it in threeif
s withgoto
s can improve readability.
Suggestion for a different function name to avoid any confusion... and make it shorter: |
I considered two variants: |
Uh oh!
There was an error while loading.Please reload this page.
Doc/c-api/unicode.rst Outdated
Compare a Unicode object with a UTF-8 encoded C string and return true | ||
if they are equal and false otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Compare a Unicode object with a UTF-8 encoded C string and returntrue | |
if they are equaland false otherwise. | |
Compare a Unicode object with a UTF-8 encoded C string and returnnon-zero | |
if they are equalor 0 otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It looks to me, that "return true" is more often used than "return non-zero". In this case it is more accurate, because it always returns 1, not other non-zero value. Perhaps other functions which return non-zero was a macro that returned not 1 (something like(arg->flags & FLAG)
)?
Doc/whatsnew/3.13.rst Outdated
a :c:expr:`const char*` UTF-8 encoded bytes string and return true if they | ||
are equal or false otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
a:c:expr:`const char*` UTF-8 encoded bytes string and returntrue if they | |
are equal orfalse otherwise. | |
a:c:expr:`const char*` UTF-8 encoded bytes string and returnnon-zero if they | |
are equal or0 otherwise. |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Objects/unicodeobject.c Outdated
} | ||
else if (ch < 0x800) { | ||
if ((0xc0 | (ch >> 6)) != (unsigned char)*str++ || | ||
(0x80 | (ch & 0x3f)) != (unsigned char)*str++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
unsigned char byte1 = (0xc0 | (ch >> 6));unsigned char byte2 = (0x80 | (ch & 0x3f));if (str[0] != byte1 || str[1] != byte2) return 0;
And declare astr variable asunsigned char*
once to avoid casting str at each byte comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
If the first comparison fails, you do not need to calculate the second byte. The code looks more compact and uniform in the way it is written right now. All expressions I copied from the UTF-8 encoder which I wrote 11 years ago, so no need to recheck them. Casting to unsigned char is not a large burden, but if you prefer, I can introduce a newunsigned char*
variable.
Co-authored-by: Victor Stinner <vstinner@python.org>
Uh oh!
There was an error while loading.Please reload this page.
Doc/c-api/unicode.rst Outdated
@@ -1396,6 +1396,18 @@ They all return ``NULL`` or ``-1`` if an exception occurs. | |||
:c:func:`PyErr_Occurred` to check for errors. | |||
.. c:function:: int PyUnicode_EqualToUTF8(PyObject *unicode, const char *string) | |||
Compare a Unicode object with a UTF-8 encoded C string and return true (``1``) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Compare a Unicode object with a UTF-8 encoded C string and return true (``1``) | |
Compare a Unicode object with a UTF-8 encodedor ASCII encodingC string and return true (``1``) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Maybe "ASCII encoded"?
Objects/unicodeobject.c Outdated
assert(str); | ||
if (PyUnicode_IS_ASCII(unicode)) { | ||
size_t len = (size_t)PyUnicode_GET_LENGTH(unicode); | ||
return strlen(str) == len && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
is more readable than simple return a && b;?
For me, it's easier to reason about a single test per line when I review code.
Keepa && b
if you prefer.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Objects/unicodeobject.c Outdated
assert(str); | ||
if (PyUnicode_IS_ASCII(unicode)) { | ||
size_t len = (size_t)PyUnicode_GET_LENGTH(unicode); | ||
return strlen(str) == len && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The readability problem as I see it, is that your&&
use has side effects; it is not a pure logic expression.
I tried to rewrite the code in more vertically sparse way: intPyUnicode_EqualToUTF8(PyObject*unicode,constchar*str){assert(_PyUnicode_CHECK(unicode));assert(str);if (PyUnicode_IS_ASCII(unicode)) {size_tlen= (size_t)PyUnicode_GET_LENGTH(unicode);if (strlen(str)!=len) {return0; }if (memcmp(PyUnicode_1BYTE_DATA(unicode),str,len)!=0) {return0; }return1; }if (PyUnicode_UTF8(unicode)!=NULL) {size_tlen= (size_t)PyUnicode_UTF8_LENGTH(unicode);if (strlen(str)!=len) {return0; }if (memcmp(PyUnicode_UTF8(unicode),str,len)!=0) {return0; }return1; }constunsignedchar*s= (constunsignedchar*)str;Py_ssize_tlen=PyUnicode_GET_LENGTH(unicode);intkind=PyUnicode_KIND(unicode);constvoid*data=PyUnicode_DATA(unicode);/* Compare Unicode string and UTF-8 string */for (Py_ssize_ti=0;i<len;i++) {Py_UCS4ch=PyUnicode_READ(kind,data,i);if (ch==0) {return0; }elseif (ch<0x80) {if (s[0]!=ch) {return0; }s+=1; }elseif (ch<0x800) {if (s[0]!= (0xc0 | (ch >>6))) {return0; }if (s[1]!= (0x80 | (ch&0x3f))) {return0; }s+=2; }elseif (ch<0x10000) {if (Py_UNICODE_IS_SURROGATE(ch)) {return0; }if (s[0]!= (0xe0 | (ch >>12))) {return0; }if (s[1]!= (0x80 | ((ch >>6)&0x3f))) {return0; }if (s[2]!= (0x80 | (ch&0x3f))) {return0; }s+=3; }else {assert(ch <=MAX_UNICODE);if (s[0]!= (0xf0 | (ch >>18))) {return0; }if (s[1]!= (0x80 | ((ch >>12)&0x3f))) {return0; }if (s[2]!= (0x80 | ((ch >>6)&0x3f))) {return0; }if (s[3]!= (0x80 | (ch&0x3f))) {return0; }s+=4; } }return*s==0;} |
Uh oh!
There was an error while loading.Please reload this page.
Lib/test/test_capi/test_unicode.py Outdated
# CRASHES equaltoutf8(b'abc', b'abc') | ||
# CRASHES equaltoutf8([], b'abc') | ||
# CRASHES equaltoutf8(NULL, b'abc') | ||
# CRASHES equaltoutf8('abc') # NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
# CRASHES equaltoutf8('abc') #NULL | |
# CRASHES equaltoutf8('abc',NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
No, it does not work so.
NULL is defined as None, andequaltoutf8('abc', None)
is a TypeError.
Ifequaltoutf8()
is called with only one argument, it sets the second argument forPyUnicode_EqualToUTF8()
to NULL, so we can test it and ensure that it indeed crashes. It is a common approach used in other tests in this file forconst char *
argument. Some functions do not crash, but raise exception or return successfully for NULL, but this function simply crashes in debug build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Oh ok, I thought that they were just pseudo-code as comments. Sure, you can leave# NULL
if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Hmm, I copied this pattern from the test forPyUnicode_CompareWithASCIIString()
which was one of the first written tests. In newer tests I use "z#" which allows to pass None for NULL. Or perhaps I changed this everywhere except the test forPyUnicode_CompareWithASCIIString()
. So perhaps I can change this too.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Co-authored-by: Victor Stinner <vstinner@python.org>
I prepared a PR to add this function to Python 2.7-3.12 in the pythoncapi-compat project:python/pythoncapi-compat#78 I chose to write a simple implementation:
It's tempting to ask you to modify the API to return -1 on error, but on the other side I hate APIs with simple tasks like "compare two strings" which can fail :-( Most people simply... don't check for errors. So well, I like the propose API, function which cannot fail. |
Oh, other features of this implementation:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM. I just left a few more minor comments.
Uh oh!
There was an error while loading.Please reload this page.
Lib/test/test_capi/test_unicode.py Outdated
# CRASHES equaltoutf8(b'abc', b'abc') | ||
# CRASHES equaltoutf8([], b'abc') | ||
# CRASHES equaltoutf8(NULL, b'abc') | ||
# CRASHES equaltoutf8('abc') # NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Oh ok, I thought that they were just pseudo-code as comments. Sure, you can leave# NULL
if you prefer.
Uh oh!
There was an error while loading.Please reload this page.
encukou commentedOct 5, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
IMO we ned a general strategy around dealing with strings. Let's not solve just for PyUnicode_Equal, but design something that we'll also use for, say, dict and attribute lookup. Having two functions, for for both zero-terminated str and for separate length argument, sounds good to me. And we also want a third one that takes Which of those should be in what kind of C-API? Which should be in stable ABI, which can just be inline functions? What should the naming conventions be? Is the |
The 3 flavors should be exposed as regular function calls. |
Would you mind to elaborate why/how using null terminated C string became a bad thing in 2023? |
"Became a bad thing in 2023" is your interpretation. It hasalways been a design mistake, but it becomes even more glaring when interoperating with other languages which made the correct decision (that is, strings in those languages store their size explicitly). In the distant times when the CPython C API was only called from C software, expecting null-terminated strings was fine, but it's not anymore. |
I don't thin that it's worth to argue. We should just add an API without size, and an API with a size. That's all. The API without size is at least needed to upgrade all users of _PyUnicode_Equal() and _PyUnicode_EqualToASCIIId(), removed in Python 3.13. |
Ah... I'm reassured, thank you. |
Oh, apparently this PR is now discussed athttps://discuss.python.org/t/new-pyunicode-equaltoutf8-function/35377 |
Objects/unicodeobject.c Outdated
s += 4; | ||
} | ||
} | ||
return *s == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I suppose that if we return true at this point then we know thatstr
is the utf8 representation ofunicode
, does it make sense to copy the contents intounicode->utf8
so that future operations can fast-path without needing to encode again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It needs a separate research and discussion. The disadvantage is that it increases the consumed memory size, also it consumes some CPU time, so the benefit will be only if the UTF-8 cache is used in future.
If the idea turned out to be good, it can simply be implemented in the future.
davidhewittOct 5, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Makes total sense. I guess this also sits in an awkward place where it's likely that the user is best suited to know whether or not they want the utf-8 cache populated, but it's also an implementation detail that we don't really want to expose to users.For now I'll just mark this comment as resolved. Edit I can't, probably lack permissions I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
PyUnicode_EqualToUTF8() doesn't raise exception and cannot fail. Trying to allocate memory should not raise memory, but it sounds like a non-trivial side effect.
Worst case: 1 GB string, you call PyUnicode_EqualToUTF8() andsuddenly, Python allocates 1 GB more. I would besurprised by this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Maybe it's worth it to add a comment explaining why we don't cache the UTF-8 encoded string.
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Uh oh!
There was an error while loading.Please reload this page.
@@ -1396,18 +1396,28 @@ They all return ``NULL`` or ``-1`` if an exception occurs. | |||
:c:func:`PyErr_Occurred` to check for errors. | |||
.. c:function:: intPyUnicode_EqualToUTF8(PyObject *unicode, const char *string) | |||
.. c:function:: intPyUnicode_EqualToUTF8AndSize(PyObject *unicode, const char *string, Py_ssize_t size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
What do you think about renamingstring
toutf8_str
? Theutf8_
would be another way to document that it's expected to be encoded to UTF-8 and also it's easier (for me) to distinguish that the second argument is a bytes string, sincestring
name is quite generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It is a part of the bigger issue. See#62897.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Co-authored-by: Victor Stinner <vstinner@python.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM the updated PR which now also adds PyUnicode_EqualToUTF8AndSize(). You just have to fix the merge conflict.
So far, I didn't see any real blocker issue in the healthy discussion.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Offending docstrings were removed; dismissing my request for changes
@pitrou, does it look good to you now? |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
[function.PyUnicode_EqualToUTF8] | ||
added = '3.13' | ||
[function.PyUnicode_EqualToUTF8AndSize] | ||
added = '3.13' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Unrelated question, but is there a plan to generate this file fromDoc/data/stable_abi.dat
or the reverse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think thatDoc/data/stable_abi.dat
is generated fromMisc/stable_abi.toml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Ah, ok, thank you!
I added these functions to pythoncapi-compat:python/pythoncapi-compat@99ab0d3 |
…alToUTF8AndSize() functions (pythonGH-110297)
Uh oh!
There was an error while loading.Please reload this page.
📚 Documentation preview 📚:https://cpython-previews--110297.org.readthedocs.build/