Movatterモバイル変換

Copy link

Member

vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

PyUnicode_EqualToString() inline UTF-8 encoder is hard for review for me right now, I would feel more comfortable with tests, especially on corner cases:

string not encoded to UTF-8
Evil surrogate characters

Doc/c-api/unicode.rstShow resolvedHide resolved

Objects/unicodeobject.c Outdated

		assert(str);
		if (PyUnicode_IS_ASCII(unicode)) {
		size_t len = (size_t)PyUnicode_GET_LENGTH(unicode);
		return strlen(str) == len &&

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would prefer to test the length first, to make the code more readable.

Like:

if (strlen(str) == len) {    return 1;}return memcmp(...);

Same below.

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It is the same in_PyUnicode_EqualToASCIIString().

How

if (!a) {return0;}returnb;

is more readable than simplereturn a && b;? It is what the&& operator for.

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

is more readable than simple return a && b;?

For me, it's easier to reason about a single test per line when I review code.

Keepa && b if you prefer.

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The readability problem as I see it, is that your&& use has side effects; it is not a pure logic expression.

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For me, it's easier to reason about a single test per line when I review code.

Fortunately, every condition here is already on a separate line.

The readability problem as I see it, is that your&& use has side effects; it is not a pure logic expression.

It is how&& works in C. There is a lot of code likearg != NULL and PyDict_Check(arg) && PyDict_GET_SIZE(arg) > count. I do not think rewriting it in threeifs withgotos can improve readability.

Copy link

Member

vstinner commentedOct 3, 2023

Suggestion for a different function name to avoid any confusion... and make it shorter:PyUnicode_EqualToUTF8().

serhiy-storchaka added2 commits

October 3, 2023 21:20

Add tests and address review comments.

4793161

Merge branch 'main' into capi-PyUnicode_EqualToString

8b24911

Copy link

MemberAuthor

serhiy-storchaka commentedOct 3, 2023

I considered two variants:PyUnicode_EqualToUTF8String() andPyUnicode_EqualToString().

serhiy-storchaka changed the title~~gh-110289: C API: Add PyUnicode_EqualToString() function~~gh-110289: C API: Add PyUnicode_EqualToUTF8() function

Oct 3, 2023

vstinner reviewed

Doc/c-api/unicode.rst OutdatedShow resolvedHide resolved

Doc/c-api/unicode.rst Outdated

Comment on lines 1401 to 1402

		Compare a Unicode object with a UTF-8 encoded C string and return true
		if they are equal and false otherwise.

Copy link

Member

vstinnerOct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	Compare a Unicode object with a UTF-8 encoded C string and returntrue
	if they are equaland false otherwise.
	Compare a Unicode object with a UTF-8 encoded C string and returnnon-zero
	if they are equalor 0 otherwise.

Copy link

MemberAuthor

serhiy-storchakaOct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It looks to me, that "return true" is more often used than "return non-zero". In this case it is more accurate, because it always returns 1, not other non-zero value. Perhaps other functions which return non-zero was a macro that returned not 1 (something like(arg->flags & FLAG))?

Doc/whatsnew/3.13.rst Outdated

Comment on lines 1005 to 1006

		a :c:expr:`const char*` UTF-8 encoded bytes string and return true if they
		are equal or false otherwise.

Copy link

Member

vstinnerOct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	a:c:expr:`const char*` UTF-8 encoded bytes string and returntrue if they
	are equal orfalse otherwise.
	a:c:expr:`const char*` UTF-8 encoded bytes string and returnnon-zero if they
	are equal or0 otherwise.

Lib/test/test_capi/test_unicode.py OutdatedShow resolvedHide resolved

Doc/c-api/unicode.rstShow resolvedHide resolved

vstinner reviewed

Objects/unicodeobject.c Outdated

		}
		else if (ch < 0x800) {
		if ((0xc0 \| (ch >> 6)) != (unsigned char)*str++ \|\|
		(0x80 \| (ch & 0x3f)) != (unsigned char)*str++)

Copy link

Member

vstinnerOct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

unsigned char byte1 = (0xc0 | (ch >> 6));unsigned char byte2 = (0x80 | (ch & 0x3f));if (str[0] != byte1 || str[1] != byte2) return 0;

And declare astr variable asunsigned char* once to avoid casting str at each byte comparison.

Copy link

MemberAuthor

serhiy-storchakaOct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If the first comparison fails, you do not need to calculate the second byte. The code looks more compact and uniform in the way it is written right now. All expressions I copied from the UTF-8 encoder which I wrote 11 years ago, so no need to recheck them. Casting to unsigned char is not a large burden, but if you prefer, I can introduce a newunsigned char* variable.

serhiy-storchakaand others added2 commits

October 4, 2023 10:53

Apply suggestions from code review

c55f9ac

Co-authored-by: Victor Stinner <vstinner@python.org>

Address some of review comments and test the UTF-8 cache.

bdf2f1e

serhiy-storchaka marked this pull request as ready for review

October 4, 2023 08:35

serhiy-storchaka requested review froma team andencukou ascode owners

October 4, 2023 08:35

bedevere-appbot added the awaiting review label

vstinner reviewed

Doc/c-api/unicode.rstShow resolvedHide resolved

Doc/c-api/unicode.rst Outdated

		@@ -1396,6 +1396,18 @@ They all return ``NULL`` or ``-1`` if an exception occurs.
		:c:func:`PyErr_Occurred` to check for errors.


		.. c:function:: int PyUnicode_EqualToUTF8(PyObject unicode, const char string)

		Compare a Unicode object with a UTF-8 encoded C string and return true (``1``)

Copy link

Member

vstinnerOct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	Compare a Unicode object with a UTF-8 encoded C string and return true (``1``)
	Compare a Unicode object with a UTF-8 encodedor ASCII encodingC string and return true (``1``)

Copy link

MemberAuthor

serhiy-storchakaOct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybe "ASCII encoded"?

Objects/unicodeobject.c Outdated

		assert(str);
		if (PyUnicode_IS_ASCII(unicode)) {
		size_t len = (size_t)PyUnicode_GET_LENGTH(unicode);
		return strlen(str) == len &&

Copy link

Member

vstinnerOct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

is more readable than simple return a && b;?

For me, it's easier to reason about a single test per line when I review code.

Keepa && b if you prefer.

Lib/test/test_capi/test_unicode.py OutdatedShow resolvedHide resolved

erlend-aasland reviewed

Modules/_testcapi/unicode.c OutdatedShow resolvedHide resolved

Objects/unicodeobject.c Outdated

		assert(str);
		if (PyUnicode_IS_ASCII(unicode)) {
		size_t len = (size_t)PyUnicode_GET_LENGTH(unicode);
		return strlen(str) == len &&

Copy link

Contributor

erlend-aaslandOct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The readability problem as I see it, is that your&& use has side effects; it is not a pure logic expression.

Address review comments.

7223c14

Copy link

MemberAuthor

serhiy-storchaka commentedOct 4, 2023

I tried to rewrite the code in more vertically sparse way:

intPyUnicode_EqualToUTF8(PyObject*unicode,constchar*str){assert(_PyUnicode_CHECK(unicode));assert(str);if (PyUnicode_IS_ASCII(unicode)) {size_tlen= (size_t)PyUnicode_GET_LENGTH(unicode);if (strlen(str)!=len) {return0;        }if (memcmp(PyUnicode_1BYTE_DATA(unicode),str,len)!=0) {return0;        }return1;    }if (PyUnicode_UTF8(unicode)!=NULL) {size_tlen= (size_t)PyUnicode_UTF8_LENGTH(unicode);if (strlen(str)!=len) {return0;        }if (memcmp(PyUnicode_UTF8(unicode),str,len)!=0) {return0;        }return1;    }constunsignedchar*s= (constunsignedchar*)str;Py_ssize_tlen=PyUnicode_GET_LENGTH(unicode);intkind=PyUnicode_KIND(unicode);constvoid*data=PyUnicode_DATA(unicode);/* Compare Unicode string and UTF-8 string */for (Py_ssize_ti=0;i<len;i++) {Py_UCS4ch=PyUnicode_READ(kind,data,i);if (ch==0) {return0;        }elseif (ch<0x80) {if (s[0]!=ch) {return0;            }s+=1;        }elseif (ch<0x800) {if (s[0]!= (0xc0 | (ch >>6))) {return0;            }if (s[1]!= (0x80 | (ch&0x3f))) {return0;            }s+=2;        }elseif (ch<0x10000) {if (Py_UNICODE_IS_SURROGATE(ch)) {return0;            }if (s[0]!= (0xe0 | (ch >>12))) {return0;            }if (s[1]!= (0x80 | ((ch >>6)&0x3f))) {return0;            }if (s[2]!= (0x80 | (ch&0x3f))) {return0;            }s+=3;        }else {assert(ch <=MAX_UNICODE);if (s[0]!= (0xf0 | (ch >>18))) {return0;            }if (s[1]!= (0x80 | ((ch >>12)&0x3f))) {return0;            }if (s[2]!= (0x80 | ((ch >>6)&0x3f))) {return0;            }if (s[3]!= (0x80 | (ch&0x3f))) {return0;            }s+=4;        }    }return*s==0;}

and it causes dizziness and eye pain in me. It is physically difficult for me to read it.

vstinner reviewed

Doc/c-api/unicode.rstShow resolvedHide resolved

Lib/test/test_capi/test_unicode.py Outdated

		# CRASHES equaltoutf8(b'abc', b'abc')
		# CRASHES equaltoutf8([], b'abc')
		# CRASHES equaltoutf8(NULL, b'abc')
		# CRASHES equaltoutf8('abc') # NULL

Copy link

Member

vstinnerOct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	# CRASHES equaltoutf8('abc') #NULL
	# CRASHES equaltoutf8('abc',NULL)

Copy link

MemberAuthor

serhiy-storchakaOct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No, it does not work so.

NULL is defined as None, andequaltoutf8('abc', None) is a TypeError.

Ifequaltoutf8() is called with only one argument, it sets the second argument forPyUnicode_EqualToUTF8() to NULL, so we can test it and ensure that it indeed crashes. It is a common approach used in other tests in this file forconst char * argument. Some functions do not crash, but raise exception or return successfully for NULL, but this function simply crashes in debug build.

Copy link

Member

vstinnerOct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Oh ok, I thought that they were just pseudo-code as comments. Sure, you can leave# NULL if you prefer.

Copy link

MemberAuthor

serhiy-storchakaOct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hmm, I copied this pattern from the test forPyUnicode_CompareWithASCIIString() which was one of the first written tests. In newer tests I use "z#" which allows to pass None for NULL. Or perhaps I changed this everywhere except the test forPyUnicode_CompareWithASCIIString(). So perhaps I can change this too.

Lib/test/test_capi/test_unicode.pyShow resolvedHide resolved

Modules/_testcapi/unicode.c OutdatedShow resolvedHide resolved

Lib/test/test_capi/test_unicode.py OutdatedShow resolvedHide resolved

Doc/whatsnew/3.13.rst OutdatedShow resolvedHide resolved

Doc/c-api/unicode.rst OutdatedShow resolvedHide resolved

vstinner mentioned this pull request

Add PyUnicode_EqualToUTF8() functionpython/pythoncapi-compat#78

Merged

Apply suggestions from code review

b271327

Co-authored-by: Victor Stinner <vstinner@python.org>

Copy link

Member

vstinner commentedOct 4, 2023

I prepared a PR to add this function to Python 2.7-3.12 in the pythoncapi-compat project:python/pythoncapi-compat#78

I chose to write a simple implementation:

        utf8 = PyUnicode_AsUTF8AndSize(unicode, &len);        if (utf8 == NULL) {            // Memory allocation failure. The API cannot report error,            // so clear the exception and return 0.            PyErr_Clear();            return 0;        }

It's tempting to ask you to modify the API to return -1 on error, but on the other side I hate APIs with simple tasks like "compare two strings" which can fail :-( Most people simply... don't check for errors.

So well, I like the propose API, function which cannot fail.

Remove trailing spaces.

6f26ad6

Copy link

MemberAuthor

serhiy-storchaka commentedOct 4, 2023

Oh, other features of this implementation:

It can be called when an error is set and preserves it.
It does not use heap, so it can be used when MemoryError has been raised.

vstinner approved these changes

Copy link

Member

vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM. I just left a few more minor comments.

Include/unicodeobject.h OutdatedShow resolvedHide resolved

Lib/test/test_capi/test_unicode.py Outdated

		# CRASHES equaltoutf8(b'abc', b'abc')
		# CRASHES equaltoutf8([], b'abc')
		# CRASHES equaltoutf8(NULL, b'abc')
		# CRASHES equaltoutf8('abc') # NULL

Copy link

Member

vstinnerOct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Oh ok, I thought that they were just pseudo-code as comments. Sure, you can leave# NULL if you prefer.

Modules/_testcapi/unicode.c OutdatedShow resolvedHide resolved

bedevere-appbot added awaiting core review and removed awaiting review labels

Copy link

Member

encukou commentedOct 5, 2023•
edited
Loading

IMO we ned a general strategy around dealing with strings. Let's not solve just for PyUnicode_Equal, but design something that we'll also use for, say, dict and attribute lookup.

Having two functions, for for both zero-terminated str and for separate length argument, sounds good to me. And we also want a third one that takesPyUnicode. (Yes, in this case we have it already).

Which of those should be in what kind of C-API? Which should be in stable ABI, which can just be inline functions? What should the naming conventions be? Is thechar* const? What's the thread safety strory?
Please delay merging until after the sprint -- I hope to come up with a proposal for how to answer questions like that, consistently.

Copy link

Member

vstinner commentedOct 5, 2023

Which of those should be in what kind of C-API?

The 3 flavors should be exposed as regular function calls.

Copy link

Member

vstinner commentedOct 5, 2023

This is 2023 and null-encoded C strings are definitely not a good idea for new C APIs.

Would you mind to elaborate why/how using null terminated C string became a bad thing in 2023?

Copy link

Member

pitrou commentedOct 5, 2023

Would you mind to elaborate why/how using null terminated C string became a bad thing in 2023?

"Became a bad thing in 2023" is your interpretation. It hasalways been a design mistake, but it becomes even more glaring when interoperating with other languages which made the correct decision (that is, strings in those languages store their size explicitly).

In the distant times when the CPython C API was only called from C software, expecting null-terminated strings was fine, but it's not anymore.

Copy link

Member

vstinner commentedOct 5, 2023

In the distant times when the CPython C API was only called from C software, expecting null-terminated strings was fine, but it's not anymore.

I don't thin that it's worth to argue. We should just add an API without size, and an API with a size. That's all.

The API without size is at least needed to upgrade all users of _PyUnicode_Equal() and _PyUnicode_EqualToASCIIId(), removed in Python 3.13.

Copy link

Member

pitrou commentedOct 5, 2023

I don't thin that it's worth to argue.

Ah... I'm reassured, thank you.

Copy link

Member

vstinner commentedOct 5, 2023

Oh, apparently this PR is now discussed athttps://discuss.python.org/t/new-pyunicode-equaltoutf8-function/35377

davidhewitt reviewed

Oct 5, 2023

Objects/unicodeobject.c Outdated

		s += 4;
		}
		}
		return *s == 0;

Copy link

Contributor

davidhewittOct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I suppose that if we return true at this point then we know thatstr is the utf8 representation ofunicode, does it make sense to copy the contents intounicode->utf8 so that future operations can fast-path without needing to encode again?

Copy link

MemberAuthor

serhiy-storchakaOct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It needs a separate research and discussion. The disadvantage is that it increases the consumed memory size, also it consumes some CPU time, so the benefit will be only if the UTF-8 cache is used in future.

If the idea turned out to be good, it can simply be implemented in the future.

Copy link

Contributor

davidhewittOct 5, 2023•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Makes total sense. I guess this also sits in an awkward place where it's likely that the user is best suited to know whether or not they want the utf-8 cache populated, but it's also an implementation detail that we don't really want to expose to users.~~For now I'll just mark this comment as resolved.~~ Edit I can't, probably lack permissions I guess.

Copy link

Member

vstinnerOct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

PyUnicode_EqualToUTF8() doesn't raise exception and cannot fail. Trying to allocate memory should not raise memory, but it sounds like a non-trivial side effect.

Worst case: 1 GB string, you call PyUnicode_EqualToUTF8() andsuddenly, Python allocates 1 GB more. I would besurprised by this behavior.

Copy link

Member

vstinnerOct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybe it's worth it to add a comment explaining why we don't cache the UTF-8 encoded string.

serhiy-storchakaand others added2 commits

October 5, 2023 22:31

Apply suggestions from code review

ee5781d

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

Add PyUnicode_EqualToUTF8AndSize().

1a4eb7b

vstinner reviewed

Oct 6, 2023

Doc/c-api/unicode.rst OutdatedShow resolvedHide resolved

Doc/c-api/unicode.rst

		@@ -1396,18 +1396,28 @@ They all return ``NULL`` or ``-1`` if an exception occurs.
		:c:func:`PyErr_Occurred` to check for errors.


		.. c:function:: intPyUnicode_EqualToUTF8(PyObject unicode, const char string)
		.. c:function:: intPyUnicode_EqualToUTF8AndSize(PyObject unicode, const char string, Py_ssize_t size)

Copy link

Member

vstinnerOct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What do you think about renamingstring toutf8_str? Theutf8_ would be another way to document that it's expected to be encoded to UTF-8 and also it's easier (for me) to distinguish that the second argument is a bytes string, sincestring name is quite generic.

Copy link

MemberAuthor

serhiy-storchakaOct 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It is a part of the bigger issue. See#62897.

Doc/c-api/unicode.rstShow resolvedHide resolved

Lib/test/test_capi/test_unicode.py OutdatedShow resolvedHide resolved

Objects/unicodeobject.c OutdatedShow resolvedHide resolved

serhiy-storchakaand others added4 commits

October 7, 2023 15:43

Apply suggestions from code review

b124377

Co-authored-by: Victor Stinner <vstinner@python.org>

Add more parentheses.

029f1a0

Remove redundant arguments.

be2ffe8

Merge branch 'main' into capi-PyUnicode_EqualToString

29b26f7

vstinner approved these changes

Oct 7, 2023

Copy link

Member

vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM the updated PR which now also adds PyUnicode_EqualToUTF8AndSize(). You just have to fix the merge conflict.

So far, I didn't see any real blocker issue in the healthy discussion.

erlend-aasland previously requested changes

Oct 10, 2023

Lib/test/test_capi/test_unicode.py OutdatedShow resolvedHide resolved

serhiy-storchaka added2 commits

October 10, 2023 23:34

Turn docstrings into comments.

78de49d

Merge branch 'main' into capi-PyUnicode_EqualToString

fc79d5e

erlend-aasland dismissed theirstale review

October 10, 2023 20:46

Offending docstrings were removed; dismissing my request for changes

Copy link

MemberAuthor

serhiy-storchaka commentedOct 11, 2023

@pitrou, does it look good to you now?

pitrou reviewed

Lib/test/test_capi/test_unicode.pyShow resolvedHide resolved

pitrou reviewed

Lib/test/test_capi/test_unicode.pyShow resolvedHide resolved

pitrou reviewed

Misc/stable_abi.toml

		[function.PyUnicode_EqualToUTF8]
		added = '3.13'
		[function.PyUnicode_EqualToUTF8AndSize]
		added = '3.13'

Copy link

Member

pitrouOct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Unrelated question, but is there a plan to generate this file fromDoc/data/stable_abi.dat or the reverse?

Copy link

MemberAuthor

serhiy-storchakaOct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think thatDoc/data/stable_abi.dat is generated fromMisc/stable_abi.toml.

Copy link

Member

pitrouOct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, ok, thank you!

Add tests for empty strings.

19ad126

serhiy-storchaka merged commiteb50cd3 intopython:main

bedevere-appbot removed the awaiting core review label