Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32.1k
gh-101178: refactor base64.b85encode to be memory friendly#112248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
eb34187
to54793e7
CompareI've changed the algorithm to use a dedicated generator, using |
54793e7
to39b1d7e
CompareTested on a Linux VM the new code is actually faster with the 5Mb dataset 🤔 Memory gain is as follow: branch: |
@pitrou any chance you could spare a little time to review this? |
Could you post timeit numbers for both small and large inputs? (please be sure to compile in non-debug mode) |
Lib/base64.py Outdated
for c in unpack512(b[offset:offset+512]): | ||
yield c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This can be simplified to:
forcinunpack512(b[offset:offset+512]): | |
yieldc | |
yieldfromunpack512(b[offset:offset+512]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Changed (I'm not yet used to yield from ^^)
Lib/test/test_base64.py Outdated
# since the result is to large to fit inside a test, | ||
# use a hash method to validate the test | ||
self.assertEqual(len(result), 784) | ||
self.assertEqual(hashlib.md5(result).hexdigest(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'm not sure md5 is always available. WDYT@gpshead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Good catch, forgot about this
According tohttps://docs.python.org/3/library/hashlib.html#hashlib.algorithms_guaranteed md5 may not be present, I'll switch to a sha1
899c88d
tod0e7691
Compare@pitrou here is a timeit script / results on a Macbook M1 (I don't have access to a Linux version right now) importtimeitimportrandomfrombase64importb85encodeREPEAT=5COUNT=10_000SMALL_INPUT:bytes=b"hello world"MEDIUM_INPUT:bytes# 200 bytesBIG_INPUT:bytes# 5000 bytesSMALL_COUNT=500_000MEDIUM_COUNT=100_000BIG_COUNT=20_000definit():globalMEDIUM_INPUT,BIG_INPUTrnd=random.Random()rnd.seed(42)MEDIUM_INPUT=rnd.randbytes(200)BIG_INPUT=rnd.randbytes(5000)defmain():init()fornamein"SMALL","MEDIUM","BIG":timer=timeit.Timer(f"b85encode({name}_INPUT)",globals=globals())count=globals()[f"{name}_COUNT"]values=timer.repeat(REPEAT,count)values=", ".join("%.3fs"%xforxinvalues)print(f"Timeit{name} ({count} iterations:{values}")if__name__=='__main__':main() Results:
|
pitrou commentedMar 3, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
The performance decrease is a bit unfortunate. Instead of defining a separate def_85encode(b,chars,chars2,pad=False,foldnuls=False,foldspaces=False):# Helper function for a85encode and b85encodeifnotisinstance(b,bytes_types):# TODO can this be `memoryview(b).cast('B')` instead?b=memoryview(b).tobytes()defencode_words(words):# Encode a sequence of 32-bit words, excluding paddingchunks= [b'z'iffoldnulsandnotwordelseb'y'iffoldspacesandword==0x20202020else (chars2[word//614125]+chars2[word//85%7225]+chars[word%85])forwordinwords]returnb''.join(chunks)n1=len(b)//512# number of 512 bytes unpackn2= (len(b)-n1*512)//4# number of 4 bytes unpackpadding= (-len(b))%4unpack512=struct.Struct("!128I").unpackunpack4=struct.Struct("!I").unpackoffset=0blocks= []for_inrange(n1):blocks.append(encode_words(unpack512(b[offset:offset+512])))offset+=512for_inrange(n2):blocks.append(encode_words(unpack4(b[offset:offset+4])))offset+=4ifpadding:# TODO deal with last bytes and padding...returnb''.join(blocks) |
Performance regression also on Linux amd64:
I find it a bit strange because my initial test a few months back the execution was slower on macOS but slightly faster on Linux. I attribued that to the fact that the memory allocation was clostly enough to be a factor |
@pitrou sorry for the huge lag I gave up after spending a lot of time trying to find a way to have be both CPU and memory friendly for small and large dataset Until last week when I realized that I could simply rewrite the function in C (which I have done), to have both performance and memory improvements My question is, should I create a new PR or |
sobolevn commentedFeb 14, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
You surely can do both :) |
Initially done to reduce the huge memory consumption of the previousimplementation for large inputs, and that no memory-friendly python way wasfound that did not include a performance regressionThis implementation also greatly improve performance in all casesSigned-off-by: Romuald Brunet <romuald@chivil.com>
Regression was found while testing the new C implementation, when foldspaceswas used with b85encode (since a chunk could end in z without having beenfolded)
d0e7691
to74fc245
Compare@pitrou /@sobolevn I rewrote this PR from scratch to use the a C implementation instead of a python one Note that I do not consider myself a seasoned C developer so the implementation may be lacking. I've tried to maximize compatibility with previous implementation even if the _85encode method is private, we could drop I've also added a test to check for a regression I found while testing the new implementation with random data |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Modules/binascii.c Outdated
@@ -1239,13 +1239,101 @@ binascii_b2a_qp_impl(PyObject *module, Py_buffer *data, int quotetabs, | |||
return rv; | |||
} | |||
/*[clinic input] | |||
binascii.b2a_base85 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It would feel weird not to havebinascii.a2b_base85
so I would suggest keeping it private for now. Ideally,base64
should have its own C accelerator module but maybe it's an overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Renamed to private
That was what delayed me initially because I had no idea how to add a module dedicated to base64, I found out that it did use the binascii one only last week
By the way, I didn't look at the implementation in detail yet, but if you want to compare your implementation with a popular one, you can have a look athttps://github.com/git/git/blob/master/base85.c#L40. |
Apply suggestionsCo-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
romuald commentedFeb 16, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Thanks, I didn't know where too look when I started (the base algorithm originally came from a LLM :/) Inspiring from git's code, this could be used instead:
There is a 20% performance gain for large inputs (starting a 5kb) since the potential padding does not need to be computed for each chunk. Shall I use this method instead? |
picnixz commentedFeb 16, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Yes, since base85 can be used for encoding large inputs it's worth it I think. |
Inspired from git sourcehttps://github.com/git/git/blob/03944513488db4a81fdb4c21c3b515e4cb260b05/base85.c#L79This avoid checking the chunk size on every iteration and thus improves performance
Since j is not unsigned anymore we can reverse the table lookup loop
size_t i = 0 ; | ||
int padding = 0; | ||
while (i < bin_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would also credit the git implementation for this one as it's heavily based on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Credit added. I don't know if I should phrase it differently?
I've also added some other comments to try to explain the logic more
Uh oh!
There was an error while loading.Please reload this page.
Current description
Rewrote the
base64._85encode
method logic in C, by plugging in thebinascii
module (already taking care of the bae64 methods)By using C and a single buffer, the memory use is reduced to a minimum, addressing the initial issue.
It also greatly improves performance as a bonus:
main
branch
Script used to test:https://gist.github.com/romuald/7aeba5f40693bb351da4abe62ad7321d
Previous description (python refactor)
not up to date with current PR
Refactor code to make use of generators instead of allocating 2 potentially huge lists for large datasets
Memory gain only measured using macOS and a 5Mb input.
Using
main
:With refactor:
The execution time is more than doubled, which may not be acceptable. However the memory used is reduced by more than 90%edit: changed the algorithm to be more efficient, the performance decrease now seems to be negligible
I also have no idea how (and if) I should test this
Here is the script I've used to measure the execution time,
thememdebug
can probably be adapted to read/proc/{pid}
on Linuxedit: updated to work on Linux too