NotificationsYou must be signed in to change notification settings
Fork32.1k
Star67.3k

gh-88500: Reduce memory use of`urllib.unquote`#96763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

gpshead merged 5 commits intopython:mainfromgpshead:gh/88500/unquote_mem_use

Dec 11, 2022

Merged

gh-88500: Reduce memory use of`urllib.unquote`#96763

gpshead merged 5 commits intopython:mainfromgpshead:gh/88500/unquote_mem_use

Dec 11, 2022

Conversation

Copy link

Member

gpshead commentedSep 12, 2022•
edited
Loading

urllib.unquote_to_bytes andurllib.unquote could both potentially generateO(len(string)) intermediatebytes orstr objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram.

This switches the implementation to using an expandingbytearray and a generator internally instead of precomputedsplit() style operations.

Microbenchmarks with some antagonistic inputs likemess = "\u0141%%%20a%fe"*1000 show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal. The slowdown scales consistently linear with input size as expected.

Memory usage observed manually using/usr/bin/time -v onpython -m timeit runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.

Observed memory usage is ~1/2 forunquote() and <1/3 forunquote_to_bytes() usingpython -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)' as a test.

Issue:Reduce memory usage of urllib.unquote and unquote_to_bytes #88500

Closes#88500.

pythongh-88500: Reduce memory use of urllib.unquote

c02c357

`urllib.unquote_to_bytes` and `urllib.unquote` could both potentiallygenerate `O(len(string))` intermediate `bytes` or `str` objects whilecomputing the unquoted final result depending on the input provided. AsPython objects are relatively large, this could consume a lot of ram.This switches the implementation to using an expanding `bytearray` and agenerator internally instead of precomputed `split()` style operations.

gpshead self-assigned this

Sep 12, 2022

bedevere-bot added the awaiting core review label

Sep 12, 2022

Copy link

MemberAuthor

gpshead commentedSep 12, 2022•
edited
Loading

Memory usage observed manually using/usr/bin/time -v onpython -m timeit runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.

Memory usage is ~1/2 forunquote() and <1/3 forunquote_to_bytes() usingpython -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)' as a test.