Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-133968: Use private unicode writer for json#133832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
vstinner merged 11 commits intopython:mainfromnineteendo:json-private-unicode-writer
May 31, 2025

Conversation

nineteendo
Copy link
Contributor

@nineteendonineteendo commentedMay 10, 2025
edited
Loading

pyperformance (with--enable-optimizations and--with-lto)

main.json=========Performance version: 1.11.0Python version: 3.15.0a0 (64-bit) revision c600310663Report on macOS-13.7.6-x86_64-i386-64bit-Mach-ONumber of logical CPUs: 8Start date: 2025-05-30 16:28:48.633929End date: 2025-05-30 16:29:21.986698feature.json============Performance version: 1.11.0Python version: 3.15.0a0 (64-bit) revision 566637c24aReport on macOS-13.7.6-x86_64-i386-64bit-Mach-ONumber of logical CPUs: 8Start date: 2025-05-30 16:29:28.723283End date: 2025-05-30 16:29:59.110914### json_loads ###Mean +- std dev: 34.2 us +- 7.3 us -> 30.7 us +- 0.9 us: 1.11x fasterSignificant (t=5.22)

jsonyx-performance-tests (with--enable-optimizations and--with-lto)

decodemainfeaturedifference
Dict with 65,536 booleans12295.20 μs12303.52 μsno difference
List of 65,536 empty strings2220.59 μs1891.46 μs1.17x faster
List of 65,536 ASCII strings7524.63 μs7094.07 μs1.06x faster
List of 65,536 strings168058.21 μs179960.84 μs1.07x slower

@ZeroIntensity
Copy link
Member

What's the point? This just adds more maintenance if we make changes to howPyUnicodeWriter works.

@nineteendo
Copy link
ContributorAuthor

They might have caused a performance regression compared to 3.13:faster-cpython/ideas#726
I'm still benchmarking, but wanted to already run the tests.

@nineteendo
Copy link
ContributorAuthor

cc@vstinner,@mdboom

@nineteendonineteendo marked this pull request as ready for reviewMay 11, 2025 15:38
Copy link
Member

@ZeroIntensityZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think this is a good idea.

  • json optimizations have been rejected in the past--useujson or something like that if performance is critical.
  • If we change howPyUnicodeWriter works, it adds more maintenance, especially if we use this as precedent for doing this elsewhere.
  • We should aim for optimizingPyUnicodeWriter as a broader change, not speed up each individual case by using the private API.

Copy link
Member

@vstinnervstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would it be possible to only replace PyUnicodeWriter_WriteUTF8() with _PyUnicodeWriter_WriteASCIIString()? Do you get similar performance in this case?

ZeroIntensity reacted with thumbs up emoji
@vstinner
Copy link
Member

See also#133186.

@ZeroIntensity
Copy link
Member

If_PyUnicodeWriter_WriteASCIIString is significantly faster thanPyUnicodeWriter_WriteUTF8, then we should expose it as a public API.

@vstinner
Copy link
Member

If _PyUnicodeWriter_WriteASCIIString is significantly faster than PyUnicodeWriter_WriteUTF8, then we should expose it as a public API.

I chose to not expose it since it generates an invalid string if the input string contains non-ASCII characters. But yeah, maybe we should expose it. The function only validates the input string in debug mode for best performance.

@nineteendo
Copy link
ContributorAuthor

nineteendo commentedMay 11, 2025
edited
Loading

Would it be possible to only replace PyUnicodeWriter_WriteUTF8() with _PyUnicodeWriter_WriteASCIIString()? Do you get similar performance in this case?

Maybe, but my current benchmark has too much overhead to measure this accurately. I'll have to rewrite it first.

I hope we can figure out how to get the performance of the public API very close to the private one, such that everyone feels comfortable using it.

@nineteendo
Copy link
ContributorAuthor

I updated the benchmark, but I don't understand why:

  • writing integers is now 20% faster
  • reading and writing unicode strings is now 2-5% slower (shouldn't be caused by noise)

Does this have something to do with theoverallocate parameter?

@vstinner
Copy link
Member

Does this have something to do with the overallocate parameter?

The private API doesn't enable overallocation by default.

@vstinner
Copy link
Member

I replaced PyUnicodeWriter_WriteUTF8() with _PyUnicodeWriter_WriteASCIIString() in Modules/_json.c and ran a benchmark:

Benchmarkrefwrite_ascii
encode 100 booleans9.54 us8.83 us: 1.08x faster
encode 1000 booleans60.8 us53.1 us: 1.15x faster
encode escaped string len=8964.11 us4.10 us: 1.00x faster
encode 10000 booleans569 us487 us: 1.17x faster
encode 10000 integers1.03 ms1.03 ms: 1.00x slower
encode 10000 floats2.11 ms2.13 ms: 1.01x slower
Geometric mean(ref)1.02x faster

Benchmark hidden because not significant (15): encode 100 integers, encode 100 floats, encode 100 "ascii" strings, encode ascii string len=100, encode escaped string len=128, encode Unicode string len=100, encode 1000 integers, encode 1000 floats, encode 1000 "ascii" strings, encode ascii string len=1000, encode Unicode string len=1000, encode 10000 "ascii" strings, encode ascii string len=10000, encode escaped string len=9984, encode Unicode string len=10000

I built Python with./configure && make and used CPU Isolation on Linux.

Benchmark code:

importjsonimportpyperfrunner=pyperf.Runner()forcountin (100,1_000,10_000):runner.bench_func(f'encode{count} booleans',json.dumps, [True,False]* (count//2))runner.bench_func(f'encode{count} integers',json.dumps,list(range(count)))runner.bench_func(f'encode{count} floats',json.dumps, [1.0]*count)runner.bench_func(f'encode{count} "ascii" strings',json.dumps, ['ascii']*count)text='ascii'text*= (count//len(text)or1)runner.bench_func(f'encode ascii string len={len(text)}',json.dumps,text)text=''.join(chr(ch)forchinrange(128))text*= (count//len(text)or1)runner.bench_func(f'encode escaped string len={len(text)}',json.dumps,text)text='abcd€'text*= (count//len(text)or1)runner.bench_func(f'encode Unicode string len={len(text)}',json.dumps,text)

@vstinner
Copy link
Member

I also ran my benchmark on this PR:

Benchmarkrefchange
encode 100 booleans9.53 us6.26 us: 1.52x faster
encode 100 integers13.8 us11.6 us: 1.20x faster
encode 100 floats24.8 us19.0 us: 1.30x faster
encode 100 "ascii" strings17.1 us12.4 us: 1.37x faster
encode ascii string len=100902 ns877 ns: 1.03x faster
encode escaped string len=1281.10 us1.07 us: 1.03x faster
encode Unicode string len=1001.07 us1.04 us: 1.03x faster
encode 1000 booleans58.9 us29.6 us: 1.99x faster
encode 1000 integers103 us81.5 us: 1.26x faster
encode 1000 floats209 us152 us: 1.37x faster
encode 1000 "ascii" strings131 us86.9 us: 1.51x faster
encode ascii string len=10003.48 us3.46 us: 1.00x faster
encode escaped string len=8964.12 us3.96 us: 1.04x faster
encode 10000 booleans546 us257 us: 2.12x faster
encode 10000 integers1.00 ms788 us: 1.27x faster
encode 10000 floats2.04 ms1.46 ms: 1.39x faster
encode 10000 "ascii" strings1.27 ms806 us: 1.57x faster
encode ascii string len=1000028.4 us28.4 us: 1.00x slower
encode escaped string len=998438.5 us36.3 us: 1.06x faster
encode Unicode string len=1000042.4 us43.2 us: 1.02x slower
Geometric mean(ref)1.26x faster

Benchmark hidden because not significant (1): encode Unicode string len=1000

nineteendo reacted with thumbs up emoji

@nineteendo
Copy link
ContributorAuthor

The private API doesn't enable overallocation by default.

Yeah, but both the old code and the public API do, so it's not that. (And you really don't want to turn it off)

encodeoverallocatenormalslow down
List of 65,536 booleans1222.61 μs10456.60 μs8.55x slower
List of 65,536 ints3174.27 μs12961.39 μs4.08x slower
Dict with 65,536 booleans10011.93 μs29166.14 μs2.91x slower
List of 65,536 ASCII strings13303.33 μs23714.00 μs1.78x slower
List of 65,536 floats37103.57 μs48716.57 μs1.31x slower
List of 65,536 strings91757.30 μs113949.94 μs1.24x slower
decodeoverallocatenormalslow down
Dict with 65,536 booleans12194.03 μs12098.16 μs1.01x faster
List of 65,536 ASCII strings7011.86 μs7101.61 μs1.01x slower
List of 65,536 strings36049.66 μs36412.55 μs1.01x slower

@nineteendo
Copy link
ContributorAuthor

nineteendo commentedMay 12, 2025
edited
Loading

This line is inefficient for exact string instances (Py_INCREF is enough):

PyObject*str=PyObject_Str(obj);

encodeprivatepublicslow down
List of 65,536 booleans1217.10 μs1854.86 μs1.52x slower
List of 65,536 ints3190.74 μs3701.18 μs1.16x slower
Dict with 65,536 booleans8783.92 μs11459.45 μs1.30x slower
List of 65,536 ASCII strings12502.92 μs14842.52 μs1.19x slower
List of 65,536 floats37008.47 μs39790.32 μs1.08x slower
List of 65,536 strings90841.32 μs94637.59 μs1.04x slower
decodeprivatepublicslow down
List of 65,536 ASCII strings7064.05 μs7186.32 μs1.02x slower
List of 65,536 strings36033.15 μs36904.06 μs1.02x slower

@nineteendo
Copy link
ContributorAuthor

nineteendo commentedMay 13, 2025
edited
Loading

Here's the comparison with a minimal PR:

encodefullminimalimprovement
List of 65,536 booleans1222.01 μs1201.54 μs1.02x faster
List of 65,536 ints3156.75 μs3047.81 μs1.04x faster
Dict with 65,536 booleans9442.35 μs8518.33 μs1.11x faster
List of 65,536 ASCII strings13405.99 μs12027.66 μs1.11x faster
List of 65,536 floats37037.70 μs36684.89 μs1.01x faster
List of 65,536 strings92007.41 μs87721.40 μs1.05x faster
decodefullminimalimprovement
List of 65,536 ASCII strings7029.30 μs7431.43 μs1.06x slower
List of 65,536 strings36401.90 μs34232.03 μs1.06x faster

@nineteendonineteendo marked this pull request as draftMay 13, 2025 07:31
@nineteendonineteendo marked this pull request as ready for reviewMay 13, 2025 09:13
@vstinner
Copy link
Member

I created issue#133968 to track this work.

@vstinner, could you add a fast path for exact string instances in PyUnicodeWriter_WriteStr()?

I wrote#133969 to add a fast path.

nineteendo reacted with thumbs up emoji

@vstinner
Copy link
Member

I wrote#133969 to add a fast path.

Merged. I confirmed with two benchmarks that this small optimization makes a big difference on some use cases such as encoding short strings in JSON.

@vstinner
Copy link
Member

@ZeroIntensity:

If _PyUnicodeWriter_WriteASCIIString is significantly faster than PyUnicodeWriter_WriteUTF8, then we should expose it as a public API.

Ok, I created#133973 to add PyUnicodeWriter_WriteASCII().

@nineteendo
Copy link
ContributorAuthor

Ok, I created#133973 to add PyUnicodeWriter_WriteASCII().

If that's merged we would use this aproach in 3.14, right?

@nineteendo
Copy link
ContributorAuthor

@vstinner, it looks like the regression injson.loads() is caused by the heap allocation inPyUnicodeWriter_Create().
I've now delayed the allocation until it's necessary. Thoughts?

@vstinner
Copy link
Member

@vstinner, it looks like the regression in json.loads() is caused by the heap allocation in PyUnicodeWriter_Create().

Are you sure about that? The implementation uses a freelist which avoids the heap allocation in most cases.

@vstinner
Copy link
Member

I've now delayed the allocation until it's necessary. Thoughts?

Would you mind to create a separated PR just for that?

@ZeroIntensity
Copy link
Member

Are the benchmarks creating an unrealistic number of concurrent writers? That would starve the freelist and create some allocation overhead, but only on the benchmarks.

@nineteendo
Copy link
ContributorAuthor

nineteendo commentedMay 16, 2025
edited
Loading

Are you sure about that? The implementation uses a freelist which avoids the heap allocation in most cases.

You're right, it does seem to be using the only entry of the freelist. (I disabled the malloc to check)
There might be some overhead compared to using the stack though.

@vstinner
Copy link
Member

I suggest closing this PR. It's not worth it anymore (according to the benchmark below) and I prefer to stick to the public C API.


I made two small optimizations in the publicPyUnicodeWriter API:

With these optimizations, it seems like this PR is less appealing. I ran a benchmark to compare this PR to the current main branch:

Benchmarkmainpr133832
encode 100 booleans6.52 us6.61 us: 1.01x slower
encode 100 integers11.9 us11.7 us: 1.01x faster
encode 100 floats19.9 us20.7 us: 1.04x slower
encode 100 "ascii" strings13.3 us13.5 us: 1.01x slower
encode ascii string len=100901 ns884 ns: 1.02x faster
encode escaped string len=1281.11 us1.07 us: 1.03x faster
encode 1000 booleans32.6 us31.8 us: 1.03x faster
encode 1000 integers88.3 us83.2 us: 1.06x faster
encode 1000 floats161 us168 us: 1.04x slower
encode 1000 "ascii" strings96.6 us94.1 us: 1.03x faster
encode ascii string len=10003.49 us3.50 us: 1.00x slower
encode escaped string len=8964.14 us3.95 us: 1.05x faster
encode Unicode string len=10004.92 us5.35 us: 1.09x slower
encode 10000 booleans284 us272 us: 1.05x faster
encode 10000 integers850 us797 us: 1.07x faster
encode 10000 floats1.56 ms1.59 ms: 1.02x slower
encode 10000 "ascii" strings897 us857 us: 1.05x faster
encode ascii string len=1000028.5 us29.2 us: 1.02x slower
encode escaped string len=998438.5 us37.2 us: 1.03x faster
encode Unicode string len=1000042.4 us46.8 us: 1.10x slower
Geometric mean(ref)1.00x faster

The best speedup is 1.07x faster for "encode 10000 integers".

The worst slowdown is 1.10x slower for "encode Unicode string len=10000".

Overall, the impact is "1.00x faster" which is not impressive.

ZeroIntensity reacted with thumbs up emoji

@vstinner
Copy link
Member

Hum. I would be interested by a change which would just remove _PyUnicodeWriter_IsEmpty(), without touching WriteUTF8/WriteASCII calls.

@nineteendo
Copy link
ContributorAuthor

I suggest closing this PR. It's not worth it anymore (according to the benchmark below) and I prefer to stick to the public C API.

According to the pyperformance benchmark, json_loads is still 10% slower because of the freelist. And after I've updated the PR, it will only be using the public API.

@nineteendo
Copy link
ContributorAuthor

Done. Decoding empty strings is now 17% faster. Annoyingly, decoding strings with escapes is 7% slower.

@vstinnervstinner merged commitc81446a intopython:mainMay 31, 2025
40 checks passed
@vstinner
Copy link
Member

I merged your change, thanks.

nineteendo reacted with rocket emoji

@nineteendonineteendo deleted the json-private-unicode-writer branchMay 31, 2025 12:07
@nineteendo
Copy link
ContributorAuthor

This still needs to be backported.

@ZeroIntensity
Copy link
Member

Hm, do we backport performance improvements?

@nineteendo
Copy link
ContributorAuthor

We backported the other fixes for the performance regression. See the issue.

@ZeroIntensity
Copy link
Member

I thought the introduction ofWriteASCII fixed the regression.

@vstinner
Copy link
Member

I would prefer to not backport this change.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@vstinnervstinnervstinner left review comments

@ZeroIntensityZeroIntensityZeroIntensity requested changes

Assignees
No one assigned
Labels
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

3 participants
@nineteendo@ZeroIntensity@vstinner

[8]ページ先頭

©2009-2025 Movatter.jp