Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32k
gh-133968: Add PyUnicodeWriter_WriteASCII() function#133973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
Replace most PyUnicodeWriter_WriteUTF8() calls withPyUnicodeWriter_WriteASCII().
JSON benchmark:#133832 (comment)
Benchmark hidden because not significant (11): encode 100 floats, encode ascii string len=100, encode Unicode string len=100, encode 1000 integers, encode 1000 floats, encode 1000 "ascii" strings, encode ascii string len=1000, encode escaped string len=896, encode 10000 integers, encode 10000 floats, encode 10000 "ascii" strings Up to1.20x faster to encode booleans is interesting knowing that these strings are very short: "true" (4 characters) and "false" (5 characters). |
@serhiy-storchaka: What do you think of this function? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Some nits
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Well, we had But We can add private |
Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
I don't think that it can become as fast or faster than a function which takes ASCII string as argument. If we know that the input string is ASCII, there is no need to scan the string for non-ASCII characters, and we can take the fast path. You're right that the UTF-8 decoder is already highly optimized. |
In short:
It's hard to beat |
Yes, although it was close, at least for moderately large strings. Could it be optimized even more? I don't know. But decision about |
I createdcapi-workgroup/decisions#65 issue. |
Benchmark:
On long strings (10,000 bytes), PyUnicodeWriter_WriteASCII() is up to 2x faster (1.36 us => 690 ns) than PyUnicodeWriter_WriteUTF8(). from_testcapiimportPyUnicodeWriterimportpyperfrange_100=range(100)defbench_write_utf8(text,size):writer=PyUnicodeWriter(0)for_inrange_100:writer.write_utf8(text,size)writer.write_utf8(text,size)writer.write_utf8(text,size)writer.write_utf8(text,size)writer.write_utf8(text,size)writer.write_utf8(text,size)writer.write_utf8(text,size)writer.write_utf8(text,size)writer.write_utf8(text,size)writer.write_utf8(text,size)defbench_write_ascii(text,size):writer=PyUnicodeWriter(0)for_inrange_100:writer.write_ascii(text,size)writer.write_ascii(text,size)writer.write_ascii(text,size)writer.write_ascii(text,size)writer.write_ascii(text,size)writer.write_ascii(text,size)writer.write_ascii(text,size)writer.write_ascii(text,size)writer.write_ascii(text,size)writer.write_ascii(text,size)runner=pyperf.Runner()sizes= (10,100,1_000,10_000)forsizeinsizes:text=b'x'*sizerunner.bench_func(f'write_utf8 size={size:,}',bench_write_utf8,text,size,inner_loops=1_000)forsizeinsizes:text=b'x'*sizerunner.bench_func(f'write_ascii size={size:,}',bench_write_ascii,text,size,inner_loops=1_000) |
Do we know where the bottleneck is for long strings? |
WriteUTF8() has to check for non-ASCII characters: this check has a cost. That's the bottleneck.
Maybe, I don't know if it would be faster. |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
I tried but failed to modify the code to copy while reading (checking if the string is encoded to ASCII). The code is quite complicated. |
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'm happy to have this function public. I always preferred using the faster versions of the writer API when I hardcoded strings, but they were private.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sorry for the late review, LGTM as well.
Uh oh!
There was an error while loading.Please reload this page.
Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII().
📚 Documentation preview 📚:https://cpython-previews--133973.org.readthedocs.build/