python/cpythonPublic

NotificationsYou must be signed in to change notification settings
Fork34.2k
Star71.8k

Use packtab for Unicode table packing#145463

Closed

behdad wants to merge 4 commits intopython:mainfrom

behdad:packtab

Closed

Use packtab for Unicode table packing#145463
behdad wants to merge 4 commits intopython:mainfrom
behdad:packtab

Conversation

Copy link

behdad commentedMar 3, 2026•
edited
Loading

This vendorspacktab underTools/unicode/packtab and uses it to regenerate
CPython's Unicode lookup tables.

packtab is a small table-packing generator that emits compact lookup code
for large static tables.

This switches the generatedunicodectype andunicodedata lookup paths
away from the old split-bin tables and over to packtab-generated helpers,
including the decomposition, NFC composition, Unicode name inverse, and
UCD 3.2.0 change tables.

On a macOS arm64 build, this reduced:

python.exe: 6163528 -> 6092632 bytes (-70896, -1.15%)
unicodedata.so: 772352 -> 673344 bytes (-99008, -12.82%)
combined shipped: 6935880 -> 6765976 bytes (-169904, -2.45%)

Tests run:

./python.exe -m test -j0 test_unicodedata test_tools

Copy link

python-cla-botbot commentedMar 3, 2026•
edited
Loading

All commit authors signed the Contributor License Agreement.

Copy link

bedevere-appbot commentedMar 3, 2026

Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply theskip news label instead.

bedevere-appbot added the awaiting review label

Mar 3, 2026

Copy link

Author

behdad commentedMar 3, 2026

If this change has little impact on Python users, wait for a maintainer to apply theskip news label instead.

Change has little impact on Python users. Thanks.

Copy link

Author

behdad commentedMar 3, 2026

This basically changes the Unicode data tables packing from a two-level to a three-level structure. The perf impact should be minimal and offset by smaller data size.

behdad added3 commits

March 3, 2026 08:10

Vendored packtab for Unicode index tables

cf25389

Vendor harfbuzz/packtab under Tools/unicode/packtab and use it inTools/unicode/makeunicodedata.py to generate packed lookup helpers forthe main codepoint->record/type maps.Switch unicodedata and unicodectype runtime lookups to those generatedhelpers.Measured on macOS arm64 builds:- python.exe: 6163528 -> 6092632 bytes (-70896, -1.15%)- unicodedata.so: 772352 -> 722912 bytes (-49440, -6.40%)- combined shipped:6935880 ->6815544 bytes (-120336, -1.73%)

Convert more Unicode tables to packtab

ff11ea5

Replace the remaining split-bin Unicode lookup tables in the unicodedatapath with packtab-generated helpers for:- decomposition indexes- NFC composition pairs- Unicode name inverse codepoint lookup- legacy 3.2.0 change indexesMeasured on macOS arm64 builds versus clean HEAD:- python.exe: 6163528 -> 6092632 bytes (-70896, -1.15%)- unicodedata.so: 772352 -> 673344 bytes (-99008, -12.82%)- combined shipped:6935880 -> 6765976 bytes (-169904, -2.45%)

Drop dead splitbins Unicode generator code

777e4e9

All Unicode table lookups in this generator now emit packtab-basedhelpers, so the old splitbins compressor is no longer used.Validated by regenerating Unicode data, rebuilding python.exe andunicodedata.so, and running test_unicodedata and test_tools.

behdad force-pushed thepacktab branch from0378ef1 to777e4e9Compare

March 3, 2026 11:11

This comment was marked as duplicate.

behdad mentioned this pull request

Mar 3, 2026

Port to packtabfonttools/unicodedata2#70

Closed

StanFromIreland requested review fromStanFromIreland,malemburg andserhiy-storchaka and removed request forserhiy-storchaka

March 3, 2026 16:33

Copy link

Member

StanFromIreland commentedMar 3, 2026•
edited
Loading

Do we even need to vendor it? It is a tool after all, we can just install it for regeneration?

What are is the size difference of the files, do you have benchmarks?

Copy link

Author

behdad commentedMar 3, 2026

Do we even need to vendor it? It is a tool after all, we can just install it for regeneration?

Sure. I vendored it such that the artifacts can be exactly reproduced.

What are is the size difference of the files, do you have benchmarks?

The binary sizes are reported in the opening comment. sloc is net shrinkage:

% git diff main.. | diffstat Modules/unicodedata.c                     |   21  Modules/unicodedata_db.h                  |10323 +++++++++++++++++++++++++++++++------------------------------------------------------------- Modules/unicodename_db.h                  | 7845 ++++++++++++++++++++++++++++----------------------------------------- Objects/unicodectype.c                    |    7  Objects/unicodetype_db.h                  | 3635 +++++++++----------------------- Tools/unicode/makeunicodedata.py          |  178 - Tools/unicode/packtab/LICENSE             |  201 + Tools/unicode/packtab/README.md           |  184 + Tools/unicode/packtab/packTab/__init__.py | 1701 +++++++++++++++ Tools/unicode/packtab/packTab/__main__.py |  270 ++ 10 files changed, 10146 insertions(+), 14219 deletions(-)

Copy link

Author

behdad commentedMar 3, 2026

do you have benchmarks?

I'll get some benchmarks.

Add unicodedata.category benchmark script

bbf6059

Add a small Python-level benchmark under Tools/unicode for comparingunicodedata.category() lookup speed across builds on three fixedworkloads: all code points, BMP only, and ASCII only.Current results from optimized non-debug builds (-O3 -DNDEBUG),comparing clean HEAD vs the packtab branch:- all: baseline 98.98 ns median, packtab 108.44 ns median- bmp: baseline 97.44 ns median, packtab 105.01 ns median- ascii: baseline 83.80 ns median, packtab 82.53 ns median

Copy link

bedevere-appbot commentedMar 3, 2026

Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply theskip news label instead.

Copy link

Author

behdad commentedMar 3, 2026•
edited
Loading

do you have benchmarks?
I'll get some benchmarks.

I added a small Python-level unicodedata.category() benchmark inTools/unicode/benchmark_unicodedata_category.py and compared an optimized non-debug main build (-O3 -DNDEBUG) against thispacktab branch.

Median results on my machine:

Full Unicode space:
- main: 98.98 ns/lookup
- packtab: 108.44 ns/lookup
BMP only:
- main: 97.44 ns/lookup
- packtab: 105.01 ns/lookup
ASCII only:
- main: 83.80 ns/lookup
- packtab: 82.53 ns/lookup

So in this Python-level benchmark, the packtab version is slightly faster for ASCII, but slower for BMP/full-Unicode lookups.

My current hypothesis is:

ASCII benefits from the smaller table footprint, which likely improves cache behavior.
The broader Unicode cases pay for the extra lookup indirection: the old split-bin scheme was effectively 2 table fetches, while the current packtab layouts here are often 3 more scattered memory fetches.

That said, many real unicodedata workloads have strong codepoint locality, since Unicode scripts are generally encoded in contiguous ranges. So a uniform full-space scan is useful as a stress test, but it is not necessarily representative of typical text-processing access patterns.

So the space win is real, but at least in this benchmark it appears to trade some non-ASCII lookup speed for reduced binary size and somewhat better hot-cache behavior on tiny working sets.

Copy link

Member

StanFromIreland commentedMar 3, 2026•
edited
Loading

So, we save ~12% in the size of the file, but slow down lookup in some cases by 10%. Additionally, we greatly increase the maintenance burden (I also see the vendored files are already two releases behind?). I'm not convinced this is worth it, it seems to be costing us more than it is saving, -0.5 on this currently.

Copy link

Author

behdad commentedMar 3, 2026

Thanks for looking.

behdad closed this

Mar 3, 2026

Labels

awaiting review

Movatterモバイル変換

Uh oh!

Conversation

behdad commentedMar 3, 2026• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

python-cla-botbot commentedMar 3, 2026• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

bedevere-appbot commentedMar 3, 2026

Uh oh!

behdad commentedMar 3, 2026

Uh oh!

behdad commentedMar 3, 2026

Uh oh!

This comment was marked as duplicate.

StanFromIreland commentedMar 3, 2026• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

behdad commentedMar 3, 2026

Uh oh!

behdad commentedMar 3, 2026

Uh oh!

bedevere-appbot commentedMar 3, 2026

Uh oh!

behdad commentedMar 3, 2026• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

StanFromIreland commentedMar 3, 2026• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

behdad commentedMar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

behdad commentedMar 3, 2026•
edited
Loading

python-cla-botbot commentedMar 3, 2026•
edited
Loading

StanFromIreland commentedMar 3, 2026•
edited
Loading

behdad commentedMar 3, 2026•
edited
Loading

StanFromIreland commentedMar 3, 2026•
edited
Loading