# Call message_from_bytes, gathering some memory usage stats in the processtracemalloc.start()start_time=time.perf_counter()msg=message_from_bytes(msg_bytes,policy=email.policy.default)time_taken=time.perf_counter()-start_timeafter_bytes,after_peak_bytes=tracemalloc.get_traced_memory()tracemalloc.stop()# "How many bytes did we allocate, that were ultimately discarded?"peak_overhead=after_peak_bytes-after_bytes# "How large was that overhead, relative to the size of the message?"overhead_ratio=peak_overhead/len(msg_bytes)iflen(msg_bytes)>0elseNone

Changes:

Removed full copy caused bytext.decode inparser.BytesParser.parsebytes, by decoding one chunk of bytes at a time instead.
Removed full copy (with 4x expansion on ASCII text) caused byStringIO use inparser.Parser.parsestr, slicing into the string instead. This also impactedparser.BytesParser.parsebytes
Removed circumstantial full copy (with 4x expansion on ASCII text) caused byStringIO use infeedparser.BufferedSubFile, by reverting back to using a list, while still retaining the universal newline behavior exhibited byStringIO
Added runtime optimization to avoid expensive line-by-line processing, in circumstances where we are consuming a blob (i.e: message/attachment body) rather than lines. This optimization accounts for both single-part messages (i.e: blob until end-of-file) and multi-part message (i.e: blob until end-of-part). For single part messages, it dumps every chunk fed intoBufferedSubFile. For multi part messages, it dumps every chunk lacking any potential boundaries (i.e: no- character), as well as every line lacking any boundary, up until the boundary we are looking for as indicated by_eofstack. Without this change, the above changes would've introduced a noticeable runtime performance regression. With this change, runtime performance is significantly improved.
Added tests to assert memory overhead (that is, wasted peak memory) does not exceed 1.05x the size of the underlying email for large sized (10 MiB) emails.

Benchmarking

As a part of internal testing, I performed some benchmarking by directly measuring the time to parse ~907k email files usingmessage_from_bytes. For each blob, a script calledemail.message_from_bytes, measured the memory usage usingtracemalloc as well as the time taken usingtime.perf_counter(), and then did the same function call and measurements using a fork of the email library which at the time included only these changes. It then deep-compared the output of each, to validate that they're exactly equal.

General information:

Number of emails benchmarked against: 907,274
Total bytes parsed: 44,643,207,581 bytes
Average bytes: 49,205.87 bytes

Without the changes, these were the stats of the Python 3.12.9 email parser:

Total overhead: 322,356,018,510 bytes
Minimum overhead: 2768 bytes
Maximum overhead: 3,244,859,393 bytes
Average overhead: 355,301.726 bytes
Average overhead ratio: 7.22x
Time stats:
Total time taken: 5,120.472s
Min time taken: 0.000174s
Max time taken: 36.738s
Average time taken: 0.00564s

With the changes, these were the stats of this email parser when using Python 3.12.9:

Total overhead: 74,988,772,312 bytes (76.737% decrease)
Minimum overhead: 3464 bytes (25.144% increase, but this seems negligible since it's a minimum)
Maximum overhead: 816,899,342 bytes (74.824% decrease)
Average overhead: 82,651.839 bytes (76.737% decrease)
Average overhead ratio: 1.679x (76.745% decrease)
Time stats:
Total time taken: 1780.947s (65.219% decrease)
Min time taken: 0.000134s (22.988% decrease)
Max time taken: 10.3979s (71.697% decrease)
Average time taken: 0.00196s (65.248% decrease)

Focusing in on the totals, this represents:

A 76.737% decrease in memory overhead
A 65.219% decrease in time taken

pythonGH-115512: Optimize peak memory usage and runtime for large emails

da36214

JAJames requested a review froma team as acode owner

April 18, 2025 21:33

Copy link

python-cla-botbot commentedApr 18, 2025•
edited
Loading

All commit authors signed the Contributor License Agreement.

bedevere-appbot added the awaiting review label

Apr 18, 2025

Copy link

bedevere-appbot commentedApr 18, 2025

Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply theskip news label instead.

bedevere-appbot mentioned this pull request

Apr 18, 2025

email.message_from_bytes heavy memory use#115512

Open

JAJames added2 commits

April 18, 2025 18:11

Add NEWS entry per bot

c2eb551

Correct class references

d6233e9

picnixz previously requested changes

Apr 18, 2025

View reviewed changes

Copy link

Member

picnixz left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Before reviewing this PR, please

Remove all type annotations. They are left tohttps://github.com/python/typeshed.
Wrap all lines under 80 characters.
Avoid comments that state what the code does. There are some trivial comments here and there. Some are not very useful.
Do not check statistical overheads. They entirely depend on the host machine and other parameters that are hard to guarantee. We only test functionalities but we don't want to necessarily test that X or Y takes more or less time than this.
If possible, make smaller PRs, targeting either time or memory improvements, and if possible, only one function or method at a time.

Note: it'd be a good idea to provide the full benchmarking script so that others can also verify the results.

bedevere-appbot removed the awaiting review label

Apr 18, 2025

Copy link

bedevere-appbot commentedApr 18, 2025

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phraseI have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

bedevere-appbot added the awaiting changes label

Apr 18, 2025

JAJames added4 commits

April 18, 2025 19:31

Comment: remove annotations

530f6d4

Comment: 80-rule, remove some comments

13ebb39

Comment: Remove TestPeakMemoryUsage

2f6002e

Lint: Remove spaces

4fa6755

Copy link

Author

JAJames commentedApr 19, 2025

I have made the requested changes; please review again

Re: Failing build check: This looks unrelated to my changes, and was not failing in previous commits.

Re: Testing for memory usage:

I removed the memory usage tests, but I do think there's some value to testing something of that nature in some sort of automated way. Memory usage tests are easier to get consistency out of than time-based tests. Maybe some subset of tests could be run in a controlled environment (i.e: the Ubuntu tests check), and skipped otherwise. Maybe it merits its own separate test suite. In general though, the way it was being tested was intentionally as close to deterministic as possible. Repeatedly running the same tests on the same machine seemed to be consistently producing the same results, as best as I could tell, at least. I understand if that's entirely out of the scope of this issue & PR, though.

Re: Splitting the PR:

Theparser.py changes are entirely separable, but I don't think the changes within each file are easily separable. If that's sufficient, I can split the PR into two PRs, with theparser.py changes standing on their own.

Re: Benchmark script:

I'll have to look into releasing some variation of the benchmark script. It may take a fair bit of time (at least a week), and it's unfortunately not a zero-effort endeavor. It hadn't occurred to me that it might be helpful to include. Let me follow up on this.

bedevere-appbot added awaiting change review and removed awaiting changes labels

Apr 19, 2025

Copy link

bedevere-appbot commentedApr 19, 2025

Thanks for making the requested changes!

@picnixz: please review the changes made to this pull request.

bedevere-appbot requested a review frompicnixz

April 19, 2025 18:35

Missed a comment which served as an annotation

4f36227

JAJames commented

Apr 19, 2025

View reviewed changes

Lib/email/feedparser.py

		for line in self._input:
		if line is NeedMoreData:
		yield NeedMoreData
		continue
		self._cur.epilogue =EMPTYSTRING.join(epilogue)
		self._cur.epilogue =''

Copy link

Author

JAJamesApr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note: I kept the behavior here the same, but I'm not actually certain whether it's correct. The previous code appeared as though it intended to assign the remainder of the message to the epilogue, but then did not.

So, this data is discarded. This is the only place where we discard the rest of the message like this. It's out of the scope of this PR, and it's very narrow in scope, but it's interesting.

Copy link

Member

picnixzMay 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think it's just to make sure that the attribute is initialized.

Copy link

Member

picnixz commentedApr 27, 2025

I removed the memory usage tests, but I do think there's some value to testing something of that nature in some sort of automated way.

We have thebigmem resource for tests with large memory but we should carefully craft the payloads so that we don't just randomly test large emails. I would appreciate however a separate PR if we were to do it. However, usingtracemalloc to track the memory usage is too fragile IMO as other things may happen inbetween due to other interpreter's changes and we wouldn't only catch the peak memory usage of encoding emails.

If that's sufficient, I can split the PR into two PRs, with the parser.py changes standing on their own.

No, now it's fine. Splitting the PR is worth only if the splitted parts can be used and are not just dead code.

I still want to know: is it possible to improve only the time complexity without touching the memory complexity and then improve the memory complexity in a second phase without too many changes, are the improvements entangled?

Copy link

Author

JAJames commentedApr 27, 2025

We have thebigmem resource for tests with large memory but we should carefully craft the payloads so that we don't just randomly test large emails. I would appreciate however a separate PR if we were to do it.

Thanks, I wasn't aware of this, and I'm happy to poke at this in a separate PR. The main 3 test cases I was focusing in on were "1 large non-multipart plaintext email" (the simplest case), "1 large multipart email with 1 properly body-formatted large attachment" (the ideal case for attachments), and "1 large multipart email with 1 non-body-formatted large attachment on a single line" (the worst case), since each of those had slightly different memory usage characteristics.

I still want to know: is it possible to improve only the time complexity without touching the memory complexity and then improve the memory complexity in a second phase without too many changes, are the improvements entangled?

Short answer:
It's unfortunately pretty tightly coupled, especially if the goal is to improve time complexity first, and memory complexity second. The order in which these changes were implemented, was to focus on the memory usage first (as it was causing OOM issues), and then ensure no major time performance regression was introduced (which led to the dumping logic). It's easier to implement the time optimizations on top of the memory optimizations, than the other way around.

Long answer:
The main time improvement infeedparser.py is primarily caused by pushing data directly to_dump_destination, avoiding the line-by-line slicing once we've reached the body, allowing the parser to cross newline boundaries or push entire chunks (8KiB each when usingmessage_from_bytes) directly. The nature of this change though depends on the line parsing logic being rewritten, which was originally intended solely to reduce memory overhead. If the line parsing logic isn't rewritten, meaning_partial remains aStringIO and the line-breaking logic continues to be invoked via_partial.readlines(), it should still be theoretically possible to implement in a way that benefits at least a subset of email payloads, but I'm not certain it would be able to avoid at least some of the unnecessary slicing. Essentially, by continuing to useStringIO to break the lines here, we would have to always break on lines anytime there's any potential boundary (so anytime- is in the chunk), which I fear would significantly undermine the performance benefit, at least in certain contexts. Without those changes, I'm not absolutely certain there's a practical way to implement that dumping scheme, without somewhat significant effort and maybe introducing some edge cases.

If these absolutely had to be separated, it would be far less work to have a PR for the memory optimizations first (which introduces a small time performance regression, as splitting on universal newlines in Python is slower than usingStringIO.splitlines()), and to PR time optimizations on top of those changes, than the other way around. However, splitting those changes there would still not be zero-effort, especially since the changes would have to be retested, and I'd be very grateful if I did not have to split those up.

That said, I can point out the functions which exist primarily for those time optimizations:

_can_dump_data: Helper method to check whether a chunk of data can be safely dumped
_can_dump_partial: Helper method to check whether a partial line of data can be safely dumped
_is_dump_midline: Helper method is check whether we previously dumped a partial (in which case, we can dump the rest of the line, because the whole line is dumpable)
_get_dump: Invokes the dumping logic, so it can cross line boundaries or push entire chunks to the result, wherever possible
_pop_dump: Retrieves & removes the result from the_get_dump call
All of the changes to theFeedParser class itself, are pretty much just to utilize_get_dump and_pop_dump
Any branch which requires_dump_destination to be set, is related to the time optimization

Copy link

Member

picnixz commentedApr 30, 2025

I would be happy if we can manage to land this before beta but I'm away for the next 10 days so I won't be able to review this. Maybe@bitdancer has time to review this, or@vadmium /@serhiy-storchaka but otherwise it will need to wait until 3.15.

picnixz dismissed theirstale review

May 9, 2025 10:05

bedevere-appbot added awaiting review and removed awaiting change review labels

May 9, 2025

picnixz reviewed

May 9, 2025

View reviewed changes

Copy link

Member

picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's still hard to review as there are many things that happen. I can remember one or two booleans but maybe we could have a better real state machine with named states to remember what we are doing with having to add comments everywhere.

This PR will need more eyes and I'm sorry in advance that we'll likely have a lot of back and forth just for me to correctly understand the implementaion, though I have my idea now.

Ok, avoid having blank lines. We add blank lines to separate logical sections in the function but standard library likes to be more compact. I also stopped correcting this, but if your comment starts with a capital letter, add a period at the end (or just do everything in lower case). It may seem silly, but without a period I'm actually expecting something to follow the line after so, keep in mind the possible accessibility issues.

Now, you can also add a note inwhatsnew/3.15.rst because this is substantial improvement.

Lib/email/feedparser.py

Comment on lines +97 to +103

		def _check_eofstack(self, data, start=0, end=sys.maxsize):
		for ateof in reversed(self._eofstack):
		if ateof(data, start, end):
		# We're at the false EOF.
		return True

		return False

Copy link

Member

picnixzMay 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	def_check_eofstack(self,data,start=0,end=sys.maxsize):
	forateofinreversed(self._eofstack):
	ifateof(data,start,end):
	# We're at the false EOF.
	returnTrue

	returnFalse
	def_check_eofstack(self,data,start=0,end=sys.maxsize):
	# check if we can find a dummy EOF
	returnany(
	ateof(data,start,end)
	forateofinreversed(self._eofstack)
	)

Lib/email/feedparser.py

Comment on lines +191 to +194

		if not self._dump_destination:
		return False

		return self._dump_destination[-1][-1] not in ('\n', '\r')

Copy link

Member

picnixzMay 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	ifnotself._dump_destination:
	returnFalse

	returnself._dump_destination[-1][-1]notin ('\n','\r')
	ifnotself._dump_destination:
	returnFalse
	returnself._dump_destination[-1][-1]notin ('\n','\r')

Lib/email/feedparser.py

		if not line:
		pass
		elif self._dump_destination is None:
		# We're not dumping data. Just flush the partial to lines

Copy link

Member

picnixzMay 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	# We're not dumping data. Just flush the partial to lines
	# We're not dumping data. Just flush the partial to lines.

Lib/email/feedparser.py

Comment on lines +130 to +137

		if not data:
		return

		if self._can_dump_data(data):
		self._dump_destination.append(data)
		return

		self._push_data(data)

Copy link

Member

picnixzMay 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	ifnotdata:
	return

	ifself._can_dump_data(data):
	self._dump_destination.append(data)
	return

	self._push_data(data)
	ifnotdata:
	pass
	elifself._can_dump_data(data):
	self._dump_destination.append(data)
	else:
	self._push_data(data)

Lib/email/feedparser.py

		if self._dump_destination is None:
		return False

		# We're dumping; check for easy optimizations

Copy link

Member

picnixzMay 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	# We're dumping; check for easy optimizations
	# We're dumping; check for easy optimizations.

Lib/email/feedparser.py

		# We previously pushed a dangling line expecting \n to follow,
		# however we received other data instead. Therefore, that \r
		# does actually terminate a line. Go ahead and push it.
		self._flush_partial()

Copy link

Member

picnixzMay 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

One way to reduce indentation, and make it easier to read perhaps, is to actually handleself._dangling_partial inflush_partial directly, or in another method, sayself._try_flush_partial().

Lib/email/feedparser.py

Comment on lines +239 to +240

		self._partial.append(data)
		self._dangling_partial = True

Copy link

Member

picnixzMay 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would suggest having a method doing the partial.append() and the dangling_partial=True (AFAICT, doing the append should always set the flag)

Copy link

Author

JAJamesMay 9, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This would make sense; the thought also later occurred to me that we can just have a property which checks "is there a partial, and does it end in'\r'?", but I didn't want to re-benchmark this at the time to see if there was any measurable difference. I'm a little less worried now though, and tracking less explicit state may be more maintainable. It is admittedly error-prone.

We only set the flag though when the line is ending on a'\r' specifically, not on every append(), but on every append ending in'\r'. Here, we know the last character is'\r' because the universal newline is present, but lacks an end (so we need to see if a'\n' follows).

Lib/email/feedparser.py

Comment on lines +286 to +287

		if data[-1] == '\r':
		self._dangling_partial = True

Copy link

Member

picnixzMay 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is a recurrent check, so how about having a "self._check_if_danling_partial()".

Lib/email/feedparser.py

Comment on lines +364 to +369

		needs_more_data = False
		for line in self:
		if line is NeedMoreData:
		needs_more_data = True
		break
		_dump_destination.append(line)

Copy link

Member

picnixzMay 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You can useitertools.takewhile:

iself=iter(self)x=itertools.takewhile(operator.is_(NeedMoreData),iself)_dump_destination.extend(x)ifnext(iself,None)isNone:# self was consumedelse:# we can pull more data

Lib/email/feedparser.py

		if needs_more_data:
		# Flush our partial, if we can
		if self._partial and self._can_dump_partial(self._partial[0]):
		_dump_destination.extend(self._partial)

Copy link

Member

picnixzMay 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Should it beself._partial or''.join(self._partial) in this case? also, why are we only checking the firstpartial[0] content but still extend_dump_destination with the entire partial?

Copy link

Author

JAJamesMay 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Since we're only checking the partial, we only need to check the very first character of the partial (to see if it's a-), as well as the state of the parser. If it starts with a potential boundary (-), and we care about boundary matches, we assume we can't dump this at all, as we don't know the content of the rest of the line, and so we can't do the boundary matches. The rest of the line's content isn't looked at. It's worth noting that the partial will never contain a complete line here.

The extend() here just ensures we're definitely grabbing everything, since the partial could be many parts if it's a very long partial line.

Copy link

Author

JAJames commentedMay 9, 2025

This PR will need more eyes and I'm sorry in advance that we'll likely have a lot of back and forth just for me to correctly understand the implementaion, though I have my idea now.

This is fine; the more eyes, the better 👍

Ok, avoid having blank lines. We add blank lines to separate logical sections in the function but standard library likes to be more compact. I also stopped correcting this, but if your comment starts with a capital letter, add a period at the end (or just do everything in lower case). It may seem silly, but without a period I'm actually expecting something to follow the line after so, keep in mind the possible accessibility issues.

This sounds fine. I will do a pass to apply all of the stylistic comments soon, but I've gone ahead and made sure to at least reply to the functional comments. At some point I picked up "don't leave trailing periods in comments" as part of my personal style, but I'm not really sure when or why. I think it might've had something to do with inconsistent trailing periods in comments. In any case, it's easy to change these here to match the project.

Now, you can also add a note inwhatsnew/3.15.rst because this is substantial improvement.

Ack; will do alongside the other comments.

Copy link

Member

picnixz commentedMay 9, 2025

At some point I picked up "don't leave trailing periods in comments" as part of my personal style, but I'm not really sure when or why

I can understand, and I also debated with myself whether to be pedantic or not. However, with a period, I'm sure that people will read "sentences" and not just half-sentences. I just observed that sometimes, without a final period, I'm expecting for more comment to follow which sometimes breaks the reading flow.

For now, you can just forget about the style comments and we can focus on functionality only.

Labels

awaiting review

2 participants

Movatterモバイル変換

Uh oh!

GH-115512: Optimize peak memory usage and runtime for large emails#132709

Are you sure you want to change the base?

GH-115512: Optimize peak memory usage and runtime for large emails#132709

Conversation

JAJames commentedApr 18, 2025

Benchmarking

Uh oh!

python-cla-botbot commentedApr 18, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

bedevere-appbot commentedApr 18, 2025

Uh oh!

picnixz left a comment• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bedevere-appbot commentedApr 18, 2025

Uh oh!

JAJames commentedApr 19, 2025

Uh oh!

bedevere-appbot commentedApr 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

picnixz commentedApr 27, 2025

Uh oh!

JAJames commentedApr 27, 2025

Uh oh!

picnixz commentedApr 30, 2025

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JAJamesMay 9, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JAJames commentedMay 9, 2025

Uh oh!

picnixz commentedMay 9, 2025

Uh oh!

Uh oh!

python-cla-botbot commentedApr 18, 2025•
edited
Loading

picnixz left a comment•
edited
Loading

JAJamesMay 9, 2025•
edited
Loading