Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork34k
gh-143196: fix heap-buffer-overflow in JSON encoder indent cache#143246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
- Validate that _current_indent_level is 0 when indent is set, preventing uninitialized cache access and re-entrant __mul__ attacks- Convert integer indent to string of spaces in encoder_new, matching the Python-level behavior
python-cla-botbot commentedDec 28, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
picnixz commentedDec 28, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I suspect this PR has been generated by an LLM, or at least the description had. Can you confirm this? Because it's not a security issue. In addition, please read our policy on what when using LLM is acceptable:https://devguide.python.org/getting-started/generative-ai/ (if English isn't your first language, it's fine, but please indicate whether an LLM has been used and how). |
caverac commentedDec 28, 2025
@picnixz I'm sorry, I always considered heap-buffer-overflow an exploitable issue. But I changed the language to "bug", perhaps more aligned with the reported issue. Is that ok? |
picnixz commentedDec 28, 2025
Bugs like that only become security issues when they are exploitable and we need to assess how feasible for an adversary it is to exploit it. In our case, the attack surface is very small and relies on invoking "semi-public" functions (and then it should be exploitable! usually there is just a hard crash, which could be considered a DoS in some sense if it's serving a web app with auto-reload for instance). Anyway, now, I'd like you to confirm whether this has been generated by a LLM or not (and if so, which part). I'm sorry to be doubtful but because of the emergence of LLM, we have many new contributors that simply use LLM to generate their PRs and that is unacceptable. |
caverac commentedDec 28, 2025
@picnixz PR description proofread with ChatGTP, but if that represents a violation I'm ok with closing the PR |
picnixz commentedDec 28, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
If it's only the description, it's fine. But in the future, you don't need to have an LLM generate the description. We don't really care which files were modified for instance (though it's good to know if breaking changes were introduced) or whether the tests were added (we expect them to be added!) Affected versions are already mentioned on the issue as well. |
| } | ||
| /* Convert indent to str if it's not None or already a string */ | ||
| if (indent!=Py_None&& !PyUnicode_Check(indent)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Instead of this complex checks (argument 1 also is checked) I'd suggest that we convert__new__ to argument clinic. However, I'm not entirely sure that we won't lose perfs. How many times do we call this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Performance considerations
For context, my original proposal was to use a simpler change
staticPyObject*encoder_new(PyTypeObject*type,PyObject*args,PyObject*kwds){ ...if (indent!=Py_None&& !PyUnicode_Check(indent)) {PyErr_Format(PyExc_TypeError,"make_encoder() argument 4 must be str or None, ""not %.200s",Py_TYPE(indent)->tp_name);returnNULL; } ...}
Let me call thatfix-v1.
After comment/suggestion in the reported issue I implemented the unidecode check with the additional logical statements (that is the current branch), let me call thatfix.
And finally we havemain. What I did was runsome tests against all these three branches. Basically ran a warm-up of the following snippet and then 100 times to measure average executing time
start=time.perf_counter()json.dumps(data,indent=2)end=time.perf_counter()
These are the results
| Test | fix | fix-v1 | main |
|---|---|---|---|
| 128KB.json | 1.97 ms | 2.46 ms | 1.95 ms |
| 256KB.json | 2.04 ms | 2.03 ms | 2.03 ms |
| 512KB.json | 4.04 ms | 4.05 ms | 4.07 ms |
| 1MB.json | 8.83 ms | 8.00 ms | 8.14 ms |
| 5MB.json | 42.02 ms | 41.48 ms | 42.02 ms |
My local is a bit outdated, and I'm pretty confident these numbers can reduced in a faster cpu, but I do believe that performance is not impacted, or at least my changes are statistically consistent withmain
Using the argument clinic
I'm rabbit-holing a bit into that approach, not entirely sure yet how it works. But my take on it: it would require some major changes to the module, that I don't feel comfortable messing with. If that is the desired path, I will be happy to document my findings in the reported issue, and leave it to more expert/capable hands to complete.
"How many times do we call this function?"
I am not entirely sure what do you mean with this question, sorry 😢encoder_new is called once perJSONEncoder instance creation. In typical usage (json.dumps()),
json.dumps(data, indent=2) |-- JSONEncoder.encode() |-- JSONEncoder.iterencode() |-- c_make_encoder() |-- encoder_new() <= called only once |-- _iterencode(obj, 0) |-- encoder_listencode_obj() |-- encoder_listencode_list() |-- encoder_listencode_dict()it's called once per encoding operation. "We" don't call it, the user does and, as I showed above, they are not going to experience any measurable changes in performance. My apologies again if I misunderstood your question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The "we" was a generic "we" (academical one). But you answered my question well. I wondered whetherc_make_encoder was called multiple times when doing a singlejson.dumps().
Conversion to AC may introduce an overhead in how the arguments are pre-processed but if you're not comfortable with this, we can first fix this (and backport it) and then switch to AC formain.
| if (indent_level>0) { | ||
| memset(PyUnicode_1BYTE_DATA(indent),' ',indent_level); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
| if (indent_level>0) { | |
| memset(PyUnicode_1BYTE_DATA(indent),' ',indent_level); | |
| } | |
| memset(PyUnicode_1BYTE_DATA(indent),' ',sizeof(Py_UCS1)*indent_level); |
memset allows for a 0 size (and does nothing) as long as the pointer is valid. Alternatively, we could usePyUnicode_Fill but I think we usually bypass this and usememset directly.
eendebakpt commentedDec 28, 2025
@picnixz The to @caverac I agree with you that performance is not really impacted (for any of the solutions proposed so far). But if you do want to test it: usehttps://github.com/psf/pyperf or the timeit module with much smaller json files. |
caverac commentedDec 29, 2025
@eendebakpt Making a bit of an update to your suggestion will definitely work _iterencode=c_make_encoder(markers,self.default,_encoder,NoneifindentisNoneelsestr(indent),# <== this lineself.key_separator,self.item_separator,self.sort_keys,self.skipkeys,self.allow_nan) And it both: drastically reduces the footprint of my changes and removes any concerns about performance. But your other example still fails importjsonindent=' 'encoder=json.encoder.c_make_encoder(None,default=lambdaobj:obj,encoder=lambdaobj:obj,indent=indent,key_separator=":",item_separator=", ",skipkeys=False,allow_nan=False,sort_keys=False, )print('start')encoder([None],1)print('end') even if we change the import. I'm looking for guidance here, which version of this do we want "This is internal, misuse is your own fault" v. "Even if you misuse it, we won't crash"? Perhaps adding this check if (indent_level!=0) {PyErr_SetString(PyExc_ValueError,"_current_indent_level must be 0 when indent is set");PyUnicodeWriter_Discard(writer);returnNULL; } is a good compromise? |
serhiy-storchaka commentedJan 12, 2026
#143618 is a simpler solution. |
Uh oh!
There was an error while loading.Please reload this page.
Summary
This PR fixes two related bugs in the JSON encoder's indentation cache (introduced ingh-95382):
Heap-buffer-overflow via uninitialized cache: When
c_make_encoderis called with_current_indent_level > 0, the cache is created with only 1 element, butupdate_indent_cacheexpects the cache to be incrementally built starting from level 0, causing out-of-bounds memory access.Use-after-free via re-entrant
__mul__:PySequence_Repeat(s->indent, indent_level)increate_indent_cachecan execute arbitrary Python code through a custom__mul__method, potentially causing use-after-free.Changes
Modules/_json.cIndent validation and conversion in
encoder_new:str,int, orNonefor indentencoder.py)TypeErrorValueErrorIndent level validation in
encoder_call:_current_indent_level == 0when indent is set__mul__attack vector sincePySequence_Repeatis only called whenindent_level != 0Lib/test/test_json/test_speedups.pytest_indent_argument_to_encoder: Tests indent type validation and conversiontest_nonzero_indent_level_with_indent: Tests indent level validationPerformance
No performance impact on the hot path. The indentation cache optimization fromgh-95382 remains fully intact.
Our changes only add:
int→str)indent_level != 0checkencoder_callThe performance-critical recursive encoding path (
encoder_listencode_list,encoder_listencode_dict) is completely unchanged. The cache is still built incrementally and reused across recursive calls exactly as before.Affected Versions
The fix applies cleanly to both
mainand3.14branches.json.encoderindentation cache via re-entrant__mul__#143196