NotificationsYou must be signed in to change notification settings
Fork34k
Star71.2k

gh-143196: fix heap-buffer-overflow in JSON encoder indent cache#143246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

caverac wants to merge1 commit intopython:mainfromcaverac:gh-143196-fix-json-encoder-indent-cache

Closed

gh-143196: fix heap-buffer-overflow in JSON encoder indent cache#143246

caverac wants to merge1 commit intopython:mainfromcaverac:gh-143196-fix-json-encoder-indent-cache

Conversation

Copy link

caverac commentedDec 28, 2025•
edited
Loading

Summary

This PR fixes two related bugs in the JSON encoder's indentation cache (introduced ingh-95382):

Heap-buffer-overflow via uninitialized cache: Whenc_make_encoder is called with_current_indent_level > 0, the cache is created with only 1 element, butupdate_indent_cache expects the cache to be incrementally built starting from level 0, causing out-of-bounds memory access.
Use-after-free via re-entrant__mul__:PySequence_Repeat(s->indent, indent_level) increate_indent_cache can execute arbitrary Python code through a custom__mul__ method, potentially causing use-after-free.

Changes

`Modules/_json.c`

Indent validation and conversion inencoder_new:
- Acceptstr,int, orNone for indent
- Convert integer indent to string of spaces (matching Python-level behavior inencoder.py)
- Reject other types withTypeError
- Reject negative integers withValueError
Indent level validation inencoder_call:
- Require_current_indent_level == 0 when indent is set
- This prevents the uninitialized cache access
- This also eliminates the__mul__ attack vector sincePySequence_Repeat is only called whenindent_level != 0

`Lib/test/test_json/test_speedups.py`

Addedtest_indent_argument_to_encoder: Tests indent type validation and conversion
Addedtest_nonzero_indent_level_with_indent: Tests indent level validation

Performance

No performance impact on the hot path. The indentation cache optimization fromgh-95382 remains fully intact.

Our changes only add:

Addition	When it runs	Cost
Indent conversion (`int` →`str`)	Once, at encoder creation	O(indent_level), typically 2-4
`indent_level != 0` check	Once, at start of`encoder_call`	O(1), single integer comparison

The performance-critical recursive encoding path (encoder_listencode_list,encoder_listencode_dict) is completely unchanged. The cache is still built incrementally and reused across recursive calls exactly as before.

Affected Versions

Python Version	Affected
3.9 - 3.13	No (cache feature not present)
3.14+	Yes
3.15 (main)	Yes

The fix applies cleanly to bothmain and3.14 branches.

Issue:Heap-buffer-overflow injson.encoder indentation cache via re-entrant__mul__ #143196

pythongh-143196: fix heap-buffer-overflow in JSON encoder indent cache

ff88078

- Validate that _current_indent_level is 0 when indent is set, preventing uninitialized cache access and re-entrant __mul__ attacks- Convert integer indent to string of spaces in encoder_new, matching the Python-level behavior

Copy link

python-cla-botbot commentedDec 28, 2025•
edited
Loading

All commit authors signed the Contributor License Agreement.

Copy link

bedevere-appbot commentedDec 28, 2025

Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply theskip news label instead.

bedevere-appbot mentioned this pull request

Dec 28, 2025

Heap-buffer-overflow injson.encoder indentation cache via re-entrant__mul__#143196

Closed

bedevere-appbot added the awaiting review label

Dec 28, 2025

Copy link

Member

picnixz commentedDec 28, 2025•
edited
Loading

I suspect this PR has been generated by an LLM, or at least the description had. Can you confirm this? Because it's not a security issue. In addition, please read our policy on what when using LLM is acceptable:https://devguide.python.org/getting-started/generative-ai/ (if English isn't your first language, it's fine, but please indicate whether an LLM has been used and how).

Copy link

Author

caverac commentedDec 28, 2025

I suspect this PR has been generated by an LLM, or at least the description had. Can you confirm this? Because it's not a security issue. In addition, please read our policy on what when using LLM is acceptable:https://devguide.python.org/getting-started/generative-ai/ (if English isn't your first language, it's fine, but please indicate whether an LLM has been used and how).

@picnixz I'm sorry, I always considered heap-buffer-overflow an exploitable issue. But I changed the language to "bug", perhaps more aligned with the reported issue. Is that ok?

Copy link

Member

picnixz commentedDec 28, 2025

Bugs like that only become security issues when they are exploitable and we need to assess how feasible for an adversary it is to exploit it. In our case, the attack surface is very small and relies on invoking "semi-public" functions (and then it should be exploitable! usually there is just a hard crash, which could be considered a DoS in some sense if it's serving a web app with auto-reload for instance).

Anyway, now, I'd like you to confirm whether this has been generated by a LLM or not (and if so, which part). I'm sorry to be doubtful but because of the emergence of LLM, we have many new contributors that simply use LLM to generate their PRs and that is unacceptable.

Copy link

Author

caverac commentedDec 28, 2025

Bugs like that only become security issues when they are exploitable and we need to assess how feasible for an adversary it is to exploit it. In our case, the attack surface is very small and relies on invoking "semi-public" functions (and then it should be exploitable! usually there is just a hard crash, which could be considered a DoS in some sense if it's serving a web app with auto-reload for instance).
Anyway, now, I'd like you to confirm whether this has been generated by a LLM or not (and if so, which part). I'm sorry to be doubtful but because of the emergence of LLM, we have many new contributors that simply use LLM to generate their PRs and that is unacceptable.

@picnixz PR description proofread with ChatGTP, but if that represents a violation I'm ok with closing the PR

Copy link

Member

picnixz commentedDec 28, 2025•
edited
Loading

If it's only the description, it's fine. But in the future, you don't need to have an LLM generate the description. We don't really care which files were modified for instance (though it's good to know if breaking changes were introduced) or whether the tests were added (we expect them to be added!) Affected versions are already mentioned on the issue as well.

picnixz reviewed

Dec 28, 2025

View reviewed changes

Modules/_json.c

		}

		/* Convert indent to str if it's not None or already a string */
		if (indent!=Py_None&& !PyUnicode_Check(indent)) {

Copy link

Member

picnixzDec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Instead of this complex checks (argument 1 also is checked) I'd suggest that we convert__new__ to argument clinic. However, I'm not entirely sure that we won't lose perfs. How many times do we call this function?

Copy link

Author

caveracDec 28, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Performance considerations

For context, my original proposal was to use a simpler change

staticPyObject*encoder_new(PyTypeObject*type,PyObject*args,PyObject*kwds){    ...if (indent!=Py_None&& !PyUnicode_Check(indent)) {PyErr_Format(PyExc_TypeError,"make_encoder() argument 4 must be str or None, ""not %.200s",Py_TYPE(indent)->tp_name);returnNULL;    }    ...}

Let me call thatfix-v1.

After comment/suggestion in the reported issue I implemented the unidecode check with the additional logical statements (that is the current branch), let me call thatfix.

And finally we havemain. What I did was runsome tests against all these three branches. Basically ran a warm-up of the following snippet and then 100 times to measure average executing time

start=time.perf_counter()json.dumps(data,indent=2)end=time.perf_counter()

These are the results

Test	fix	fix-v1	main
128KB.json	1.97 ms	2.46 ms	1.95 ms
256KB.json	2.04 ms	2.03 ms	2.03 ms
512KB.json	4.04 ms	4.05 ms	4.07 ms
1MB.json	8.83 ms	8.00 ms	8.14 ms
5MB.json	42.02 ms	41.48 ms	42.02 ms

My local is a bit outdated, and I'm pretty confident these numbers can reduced in a faster cpu, but I do believe that performance is not impacted, or at least my changes are statistically consistent withmain

Using the argument clinic

I'm rabbit-holing a bit into that approach, not entirely sure yet how it works. But my take on it: it would require some major changes to the module, that I don't feel comfortable messing with. If that is the desired path, I will be happy to document my findings in the reported issue, and leave it to more expert/capable hands to complete.

"How many times do we call this function?"

I am not entirely sure what do you mean with this question, sorry 😢encoder_new is called once perJSONEncoder instance creation. In typical usage (json.dumps()),

json.dumps(data, indent=2)    |-- JSONEncoder.encode()          |-- JSONEncoder.iterencode()                |-- c_make_encoder()                      |-- encoder_new()  <= called only once                |-- _iterencode(obj, 0)                      |-- encoder_listencode_obj()                            |-- encoder_listencode_list()                            |-- encoder_listencode_dict()

it's called once per encoding operation. "We" don't call it, the user does and, as I showed above, they are not going to experience any measurable changes in performance. My apologies again if I misunderstood your question

Copy link

Member

picnixzDec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The "we" was a generic "we" (academical one). But you answered my question well. I wondered whetherc_make_encoder was called multiple times when doing a singlejson.dumps().

Conversion to AC may introduce an overhead in how the arguments are pre-processed but if you're not comfortable with this, we can first fix this (and backport it) and then switch to AC formain.

Modules/_json.c

Comment on lines +1345 to +1347

		if (indent_level>0) {
		memset(PyUnicode_1BYTE_DATA(indent),' ',indent_level);
		}

Copy link

Member

picnixzDec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	if (indent_level>0) {
	memset(PyUnicode_1BYTE_DATA(indent),' ',indent_level);
	}
	memset(PyUnicode_1BYTE_DATA(indent),' ',sizeof(Py_UCS1)*indent_level);

memset allows for a 0 size (and does nothing) as long as the pointer is valid. Alternatively, we could usePyUnicode_Fill but I think we usually bypass this and usememset directly.

Copy link

Contributor

eendebakpt commentedDec 28, 2025

@picnixz Thec_make_encoder is only called once (here:https://github.com/python/cpython/blob/main/Lib/json/encoder.py#L254) and is semi-private. We could make it private and change the invocation from:

            _iterencode = c_make_encoder(                markers, self.default, _encoder, indent,                self.key_separator, self.item_separator, self.sort_keys,                self.skipkeys, self.allow_nan)

            _iterencode = c_make_encoder(                markers, self.default, _encoder, str(indent),                self.key_separator, self.item_separator, self.sort_keys,                self.skipkeys, self.allow_nan)

@caverac I agree with you that performance is not really impacted (for any of the solutions proposed so far). But if you do want to test it: usehttps://github.com/psf/pyperf or the timeit module with much smaller json files.

Copy link

Author

caverac commentedDec 29, 2025

@picnixz Thec_make_encoder is only called once (here:https://github.com/python/cpython/blob/main/Lib/json/encoder.py#L254) and is semi-private. We could make it private and change the invocation from:
            _iterencode = c_make_encoder(                markers, self.default, _encoder, indent,                self.key_separator, self.item_separator, self.sort_keys,                self.skipkeys, self.allow_nan)
to
            _iterencode = c_make_encoder(                markers, self.default, _encoder, str(indent),                self.key_separator, self.item_separator, self.sort_keys,                self.skipkeys, self.allow_nan)
@caverac I agree with you that performance is not really impacted (for any of the solutions proposed so far). But if you do want to test it: usehttps://github.com/psf/pyperf or the timeit module with much smaller json files.

@eendebakpt Making a bit of an update to your suggestion will definitely work

_iterencode=c_make_encoder(markers,self.default,_encoder,NoneifindentisNoneelsestr(indent),# <== this lineself.key_separator,self.item_separator,self.sort_keys,self.skipkeys,self.allow_nan)

And it both: drastically reduces the footprint of my changes and removes any concerns about performance. But your other example still fails

importjsonindent='  'encoder=json.encoder.c_make_encoder(None,default=lambdaobj:obj,encoder=lambdaobj:obj,indent=indent,key_separator=":",item_separator=", ",skipkeys=False,allow_nan=False,sort_keys=False,                   )print('start')encoder([None],1)print('end')

even if we change the import. I'm looking for guidance here, which version of this do we want "This is internal, misuse is your own fault" v. "Even if you misuse it, we won't crash"?

Perhaps adding this check

if (indent_level!=0) {PyErr_SetString(PyExc_ValueError,"_current_indent_level must be 0 when indent is set");PyUnicodeWriter_Discard(writer);returnNULL;        }

is a good compromise?

serhiy-storchaka self-requested a review

January 9, 2026 14:37

Copy link

Member

serhiy-storchaka commentedJan 12, 2026

#143618 is a simpler solution.

serhiy-storchaka closed this

Jan 12, 2026

Labels

awaiting review

Movatterモバイル変換

Uh oh!

gh-143196: fix heap-buffer-overflow in JSON encoder indent cache#143246

gh-143196: fix heap-buffer-overflow in JSON encoder indent cache#143246

Uh oh!

Conversation

caverac commentedDec 28, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Summary

Changes

Modules/_json.c

Lib/test/test_json/test_speedups.py

Performance

Affected Versions

Uh oh!

python-cla-botbot commentedDec 28, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

bedevere-appbot commentedDec 28, 2025

Uh oh!

picnixz commentedDec 28, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

caverac commentedDec 28, 2025

Uh oh!

picnixz commentedDec 28, 2025

Uh oh!

caverac commentedDec 28, 2025

Uh oh!

picnixz commentedDec 28, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

picnixzDec 28, 2025

Choose a reason for hiding this comment

Uh oh!

caveracDec 28, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Performance considerations

Using the argument clinic

"How many times do we call this function?"

Uh oh!

picnixzDec 28, 2025

Choose a reason for hiding this comment

Uh oh!

picnixzDec 28, 2025

Choose a reason for hiding this comment

Uh oh!

eendebakpt commentedDec 28, 2025

Uh oh!

caverac commentedDec 29, 2025

Uh oh!

serhiy-storchaka commentedJan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

caverac commentedDec 28, 2025•
edited
Loading

`Modules/_json.c`

`Lib/test/test_json/test_speedups.py`

python-cla-botbot commentedDec 28, 2025•
edited
Loading

picnixz commentedDec 28, 2025•
edited
Loading

picnixz commentedDec 28, 2025•
edited
Loading

caveracDec 28, 2025•
edited
Loading