Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork34.2k
Use packtab for Unicode table packing#145463
Conversation
python-cla-botbot commentedMar 3, 2026 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
behdad commentedMar 3, 2026
Change has little impact on Python users. Thanks. |
behdad commentedMar 3, 2026
This basically changes the Unicode data tables packing from a two-level to a three-level structure. The perf impact should be minimal and offset by smaller data size. |
Vendor harfbuzz/packtab under Tools/unicode/packtab and use it inTools/unicode/makeunicodedata.py to generate packed lookup helpers forthe main codepoint->record/type maps.Switch unicodedata and unicodectype runtime lookups to those generatedhelpers.Measured on macOS arm64 builds:- python.exe: 6163528 -> 6092632 bytes (-70896, -1.15%)- unicodedata.so: 772352 -> 722912 bytes (-49440, -6.40%)- combined shipped:6935880 ->6815544 bytes (-120336, -1.73%)
Replace the remaining split-bin Unicode lookup tables in the unicodedatapath with packtab-generated helpers for:- decomposition indexes- NFC composition pairs- Unicode name inverse codepoint lookup- legacy 3.2.0 change indexesMeasured on macOS arm64 builds versus clean HEAD:- python.exe: 6163528 -> 6092632 bytes (-70896, -1.15%)- unicodedata.so: 772352 -> 673344 bytes (-99008, -12.82%)- combined shipped:6935880 -> 6765976 bytes (-169904, -2.45%)
All Unicode table lookups in this generator now emit packtab-basedhelpers, so the old splitbins compressor is no longer used.Validated by regenerating Unicode data, rebuilding python.exe andunicodedata.so, and running test_unicodedata and test_tools.
This comment was marked as duplicate.
This comment was marked as duplicate.
StanFromIreland commentedMar 3, 2026 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Do we even need to vendor it? It is a tool after all, we can just install it for regeneration? What are is the size difference of the files, do you have benchmarks? |
behdad commentedMar 3, 2026
Sure. I vendored it such that the artifacts can be exactly reproduced.
The binary sizes are reported in the opening comment. sloc is net shrinkage: |
behdad commentedMar 3, 2026
I'll get some benchmarks. |
Add a small Python-level benchmark under Tools/unicode for comparingunicodedata.category() lookup speed across builds on three fixedworkloads: all code points, BMP only, and ASCII only.Current results from optimized non-debug builds (-O3 -DNDEBUG),comparing clean HEAD vs the packtab branch:- all: baseline 98.98 ns median, packtab 108.44 ns median- bmp: baseline 97.44 ns median, packtab 105.01 ns median- ascii: baseline 83.80 ns median, packtab 82.53 ns median
Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
behdad commentedMar 3, 2026 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I added a small Python-level unicodedata.category() benchmark in Median results on my machine:
So in this Python-level benchmark, the packtab version is slightly faster for ASCII, but slower for BMP/full-Unicode lookups. My current hypothesis is:
That said, many real unicodedata workloads have strong codepoint locality, since Unicode scripts are generally encoded in contiguous ranges. So a uniform full-space scan is useful as a stress test, but it is not necessarily representative of typical text-processing access patterns. So the space win is real, but at least in this benchmark it appears to trade some non-ASCII lookup speed for reduced binary size and somewhat better hot-cache behavior on tiny working sets. |
StanFromIreland commentedMar 3, 2026 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
So, we save ~12% in the size of the file, but slow down lookup in some cases by 10%. Additionally, we greatly increase the maintenance burden (I also see the vendored files are already two releases behind?). I'm not convinced this is worth it, it seems to be costing us more than it is saving, -0.5 on this currently. |
behdad commentedMar 3, 2026
Thanks for looking. |
Uh oh!
There was an error while loading.Please reload this page.
This vendorspacktab under
Tools/unicode/packtaband uses it to regenerateCPython's Unicode lookup tables.
packtabis a small table-packing generator that emits compact lookup codefor large static tables.
This switches the generated
unicodectypeandunicodedatalookup pathsaway from the old split-bin tables and over to packtab-generated helpers,
including the decomposition, NFC composition, Unicode name inverse, and
UCD 3.2.0 change tables.
On a macOS arm64 build, this reduced:
python.exe: 6163528 -> 6092632 bytes (-70896, -1.15%)unicodedata.so: 772352 -> 673344 bytes (-99008, -12.82%)Tests run:
./python.exe -m test -j0 test_unicodedata test_tools