NotificationsYou must be signed in to change notification settings
Fork32.1k
Star67.3k

gh-101178: refactor base64.b85encode to be memory friendly#112248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

romuald wants to merge7 commits intopython:main

base:main

Choose a base branch

fromromuald:gh-101178-b58encode-memuse

Open

gh-101178: refactor base64.b85encode to be memory friendly#112248

romuald wants to merge7 commits intopython:mainfromromuald:gh-101178-b58encode-memuse

Conversation

Copy link

Contributor

romuald commentedNov 18, 2023•
edited
Loading

Current description

Rewrote thebase64._85encode method logic in C, by plugging in thebinascii module (already taking care of the bae64 methods)

By using C and a single buffer, the memory use is reduced to a minimum, addressing the initial issue.

It also greatly improves performance as a bonus:

main

SMALL (11 bytes): 1575427 iterations (1.27 µs per call, 115.41 ns per byte)MEDIUM (200 bytes): 204909 iterations (9.76 µs per call, 48.80 ns per byte)BIG (5000 bytes): 8623 iterations (231.94 µs per call, 46.39 ns per byte)VERYBIG (500000 bytes): 81 iterations (24.69 ms per call, 49.38 ns per byte)

branch

SMALL (11 bytes): 11230718 iterations (178.08 ns per call, 16.19 ns per byte)MEDIUM (200 bytes): 6004721 iterations (333.07 ns per call, 1.67 ns per byte)BIG (5000 bytes): 458005 iterations (4.37 µs per call, 873.35 ps per byte)VERYBIG (500000 bytes): 4772 iterations (419.11 µs per call, 838.22 ps per byte)

Script used to test:https://gist.github.com/romuald/7aeba5f40693bb351da4abe62ad7321d

Previous description (python refactor)

not up to date with current PR

Refactor code to make use of generators instead of allocating 2 potentially huge lists for large datasets

Memory gain only measured using macOS and a 5Mb input.

Usingmain:

Before encodingPhysical footprint:         16.3MPhysical footprint (peak):  21.3MAfter encodingPhysical footprint:         45.0MPhysical footprint (peak):  244.1M

With refactor:

Before encodingPhysical footprint:         14.6MPhysical footprint (peak):  19.6MAfter encodingPhysical footprint:         28.5MPhysical footprint (peak):  34.4M

~~The execution time is more than doubled, which may not be acceptable. However the memory used is reduced by more than 90%~~
edit: changed the algorithm to be more efficient, the performance decrease now seems to be negligible

I also have no idea how (and if) I should test this

Here is the script I've used to measure the execution time,~~thememdebug can probably be adapted to read/proc/{pid} on Linux~~
edit: updated to work on Linux too

importosimportsysimportrandomimporthashlibimportplatformimportsubprocessfromtimeimporttimefrombase64importb85encodedefmemdebug():ifplatform.system=="Darwin":ifnotos.environ.get("MallocStackLogging"):returnres=subprocess.check_output(["malloc_history",str(os.getpid()),"-highWaterMark","-allBySize"])forlineinres.splitlines():ifline.startswith(b"Physical"):print(line.decode())elifplatform.system()=="Linux":withopen(f"/proc/{os.getpid()}/status")asreader:forlineinreader:ifline.startswith("VmPeak:"):print(line,end="")defmain():# use a stable inputrnd=random.Random()rnd.seed(42)data=rnd.randbytes(5_000_000)memdebug()start=time()importpdbtry:res=b85encode(data)exceptException:# pdb.post_mortem()raiseend=time()memdebug()print("Data length:",len(data))print("Output length:",len(res))print(f"Decode time:{end-start:.3f}s")h=hashlib.md5(res).hexdigest()print("Hashed result",h)asserth=="ad97e45ba085865e70f7aa05c9a31388"if__name__=='__main__':main()

Issue:base64.b85encode uses significant amount of RAM #101178

bedevere-appbot added the awaiting review label

Nov 18, 2023

bedevere-appbot mentioned this pull request

Nov 18, 2023

base64.b85encode uses significant amount of RAM#101178

Open

romuald force-pushed thegh-101178-b58encode-memuse branch fromeb34187 to54793e7Compare

November 19, 2023 13:51

Copy link

ContributorAuthor

romuald commentedNov 19, 2023

I've changed the algorithm to use a dedicated generator, usingunpack("!512I") unstead of 128unpack("!I") seems to be far more efficient. The issue is now that this may not be as readable due to the additional generator

romuald force-pushed thegh-101178-b58encode-memuse branch from54793e7 to39b1d7eCompare

November 19, 2023 13:59

arhadthedev added the performancePerformance or resource usage label

Nov 19, 2023

Copy link

Member

arhadthedev commentedNov 19, 2023

cc@sobolevn as a commenter in the issue,@pitrou an an author of the original_85encode.

Copy link

ContributorAuthor

romuald commentedNov 19, 2023

Tested on a Linux VM the new code is actually faster with the 5Mb dataset 🤔

Memory gain is as follow:

branch:VmPeak: 31532 kB -> 45344 kB
main:VmPeak: 33608 kB -> 253752 kB

Copy link

ContributorAuthor

romuald commentedMar 1, 2024

@pitrou any chance you could spare a little time to review this?

Copy link

Member

pitrou commentedMar 1, 2024

Could you post timeit numbers for both small and large inputs? (please be sure to compile in non-debug mode)

pitrou reviewed

Mar 1, 2024

View reviewed changes

Lib/base64.py Outdated

Comment on lines 315 to 316

		for c in unpack512(b[offset:offset+512]):
		yield c

Copy link

Member

pitrouMar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This can be simplified to:

Suggested change

	forcinunpack512(b[offset:offset+512]):
	yieldc
	yieldfromunpack512(b[offset:offset+512])

Copy link

ContributorAuthor

romualdMar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Changed (I'm not yet used to yield from ^^)

pitrou reviewed

Mar 1, 2024

View reviewed changes

Lib/test/test_base64.py Outdated

		# since the result is to large to fit inside a test,
		# use a hash method to validate the test
		self.assertEqual(len(result), 784)
		self.assertEqual(hashlib.md5(result).hexdigest(),

Copy link

Member

pitrouMar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm not sure md5 is always available. WDYT@gpshead ?

Copy link

ContributorAuthor

romualdMar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Good catch, forgot about this
According tohttps://docs.python.org/3/library/hashlib.html#hashlib.algorithms_guaranteed md5 may not be present, I'll switch to a sha1

romuald force-pushed thegh-101178-b58encode-memuse branch from899c88d tod0e7691Compare

March 3, 2024 15:30

Copy link

ContributorAuthor

romuald commentedMar 3, 2024

@pitrou here is a timeit script / results on a Macbook M1 (I don't have access to a Linux version right now)

importtimeitimportrandomfrombase64importb85encodeREPEAT=5COUNT=10_000SMALL_INPUT:bytes=b"hello world"MEDIUM_INPUT:bytes# 200 bytesBIG_INPUT:bytes# 5000 bytesSMALL_COUNT=500_000MEDIUM_COUNT=100_000BIG_COUNT=20_000definit():globalMEDIUM_INPUT,BIG_INPUTrnd=random.Random()rnd.seed(42)MEDIUM_INPUT=rnd.randbytes(200)BIG_INPUT=rnd.randbytes(5000)defmain():init()fornamein"SMALL","MEDIUM","BIG":timer=timeit.Timer(f"b85encode({name}_INPUT)",globals=globals())count=globals()[f"{name}_COUNT"]values=timer.repeat(REPEAT,count)values=", ".join("%.3fs"%xforxinvalues)print(f"Timeit{name} ({count} iterations:{values}")if__name__=='__main__':main()

Results:

main branchTimeit SMALL (500000 iterations: 0.617s, 0.607s, 0.605s, 0.605s, 0.604sTimeit MEDIUM (100000 iterations: 1.000s, 0.999s, 0.999s, 0.999s, 0.999sTimeit BIG (20000 iterations: 4.789s, 4.788s, 4.794s, 4.800s, 4.782sgh-101178-b58encode-memuse branchTimeit SMALL (500000 iterations: 1.193s, 1.186s, 1.184s, 1.190s, 1.174sTimeit MEDIUM (100000 iterations: 1.748s, 1.701s, 1.701s, 1.701s, 1.700sTimeit BIG (20000 iterations: 5.705s, 5.675s, 5.673s, 5.668s, 5.672s

Copy link

Member

pitrou commentedMar 3, 2024•
edited
Loading

The performance decrease is a bit unfortunate. Instead of defining a separate_85buffer_iter_words generator, perhaps we can look for a hybrid approach, something like (just a sketch):

def_85encode(b,chars,chars2,pad=False,foldnuls=False,foldspaces=False):# Helper function for a85encode and b85encodeifnotisinstance(b,bytes_types):# TODO can this be `memoryview(b).cast('B')` instead?b=memoryview(b).tobytes()defencode_words(words):# Encode a sequence of 32-bit words, excluding paddingchunks= [b'z'iffoldnulsandnotwordelseb'y'iffoldspacesandword==0x20202020else                  (chars2[word//614125]+chars2[word//85%7225]+chars[word%85])forwordinwords]returnb''.join(chunks)n1=len(b)//512# number of 512 bytes unpackn2= (len(b)-n1*512)//4# number of 4 bytes unpackpadding= (-len(b))%4unpack512=struct.Struct("!128I").unpackunpack4=struct.Struct("!I").unpackoffset=0blocks= []for_inrange(n1):blocks.append(encode_words(unpack512(b[offset:offset+512])))offset+=512for_inrange(n2):blocks.append(encode_words(unpack4(b[offset:offset+4])))offset+=4ifpadding:# TODO deal with last bytes and padding...returnb''.join(blocks)

Copy link

ContributorAuthor

romuald commentedMar 4, 2024

Performance regression also on Linux amd64:

Timeit SMALL (500000 iterations: 0.874s, 0.880s, 0.913s, 0.868s, 0.869sTimeit MEDIUM (100000 iterations: 1.208s, 1.201s, 1.211s, 1.211s, 1.212sTimeit BIG (20000 iterations: 5.753s, 5.737s, 5.736s, 5.773s, 5.954sTimeit SMALL (500000 iterations: 1.581s, 1.600s, 1.596s, 1.583s, 1.542sTimeit MEDIUM (100000 iterations: 2.200s, 2.243s, 2.293s, 2.303s, 2.340sTimeit BIG (20000 iterations: 7.443s, 7.478s, 7.678s, 7.502s, 7.457s

I find it a bit strange because my initial test a few months back the execution was slower on macOS but slightly faster on Linux. I attribued that to the fact that the memory allocation was clostly enough to be a factor

Copy link

ContributorAuthor

romuald commentedFeb 14, 2025

@pitrou sorry for the huge lag

I gave up after spending a lot of time trying to find a way to have be both CPU and memory friendly for small and large dataset

Until last week when I realized that I could simply rewrite the function in C (which I have done), to have both performance and memory improvements

My question is, should I create a new PR orpush -f on this one?

Copy link

Member

sobolevn commentedFeb 14, 2025•
edited
Loading

You surely can do both :)
But I prefer to re-use PRs.

romuald added2 commits

February 16, 2025 16:05

pythongh-101178: C implementation of base64._a85encode

84e20ea

Initially done to reduce the huge memory consumption of the previousimplementation for large inputs, and that no memory-friendly python way wasfound that did not include a performance regressionThis implementation also greatly improve performance in all casesSigned-off-by: Romuald Brunet <romuald@chivil.com>

Add possible regression test in test_base64

74fc245

Regression was found while testing the new C implementation, when foldspaceswas used with b85encode (since a chunk could end in z without having beenfolded)

romuald force-pushed thegh-101178-b58encode-memuse branch fromd0e7691 to74fc245Compare

February 16, 2025 15:05

Copy link

ContributorAuthor

romuald commentedFeb 16, 2025

@pitrou /@sobolevn I rewrote this PR from scratch to use the a C implementation instead of a python one

Note that I do not consider myself a seasoned C developer so the implementation may be lacking.

I've tried to maximize compatibility with previous implementation even if the _85encode method is private, we could dropchars2 for example, and possibly change the type of XXchars to bytes

I've also added a test to check for a regression I found while testing the new implementation with random data

picnixz reviewed

Feb 16, 2025

View reviewed changes

Modules/binascii.c OutdatedShow resolvedHide resolved

Modules/binascii.c Outdated

		@@ -1239,13 +1239,101 @@ binascii_b2a_qp_impl(PyObject module, Py_buffer data, int quotetabs,
		return rv;
		}

		/*[clinic input]
		binascii.b2a_base85

Copy link

Member

picnixzFeb 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would feel weird not to havebinascii.a2b_base85 so I would suggest keeping it private for now. Ideally,base64 should have its own C accelerator module but maybe it's an overkill.

Copy link

ContributorAuthor

romualdFeb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Renamed to private

That was what delayed me initially because I had no idea how to add a module dedicated to base64, I found out that it did use the binascii one only last week

Copy link

Member

picnixz commentedFeb 16, 2025

By the way, I didn't look at the implementation in detail yet, but if you want to compare your implementation with a popular one, you can have a look athttps://github.com/git/git/blob/master/base85.c#L40.

Review changes Modules/binascii.c

60a3ae6

Apply suggestionsCo-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

Copy link

ContributorAuthor

romuald commentedFeb 16, 2025•
edited
Loading

By the way, I didn't look at the implementation in detail yet, but if you want to compare your implementation with a popular one, you can have a look athttps://github.com/git/git/blob/master/base85.c#L40.

Thanks, I didn't know where too look when I started (the base algorithm originally came from a LLM :/)

Inspiring from git's code, this could be used instead:

    size_t i = 0 ;    size_t underflow = 0;  // was overflow, but underflow may be a better name since `i` will not go over bin_len    while (i < bin_len) {        // translate each 4 byte chunk to 32bit integer        uint32_t value = 0;        for (int cnt = 24; cnt >= 0; cnt -= 8) {            value |= bin_data[i] << cnt;            if (++i == bin_len) {                // Number of bytes under the 4 bytes rounded value                underflow = cnt / 8;                break;            }        }        // ...    }

There is a 20% performance gain for large inputs (starting a 5kb) since the potential padding does not need to be computed for each chunk.

Shall I use this method instead?

Copy link

Member

picnixz commentedFeb 16, 2025•
edited
Loading

There is a 20% performance gain for large inputs [...] Shall I use this method instead?

Yes, since base85 can be used for encoding large inputs it's worth it I think.

romuald added2 commits

February 16, 2025 20:33

Review fixes: update algorithm

aaa09e1

Inspired from git sourcehttps://github.com/git/git/blob/03944513488db4a81fdb4c21c3b515e4cb260b05/base85.c#L79This avoid checking the chunk size on every iteration and thus improves performance

Further plagiate git's implementation

cb46a5d

Since j is not unsigned anymore we can reverse the table lookup loop

picnixz reviewed

Feb 17, 2025

View reviewed changes

Modules/binascii.c

		size_t i = 0 ;
		int padding = 0;

		while (i < bin_len) {

Copy link

Member

picnixzFeb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would also credit the git implementation for this one as it's heavily based on it.

Copy link

ContributorAuthor

romualdFeb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Credit added. I don't know if I should phrase it differently?

I've also added some other comments to try to explain the logic more

romuald added2 commits

February 17, 2025 22:47

Rename b2a_base85 as private for now

c88450b

Credit to git's implementation and more comments

2fc892c

Labels

awaiting review performance

Performance or resource usage

5 participants

Movatterモバイル変換

Uh oh!

gh-101178: refactor base64.b85encode to be memory friendly#112248

Are you sure you want to change the base?

gh-101178: refactor base64.b85encode to be memory friendly#112248

Uh oh!

Conversation

romuald commentedNov 18, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Current description

Previous description (python refactor)

Uh oh!

romuald commentedNov 19, 2023

Uh oh!

arhadthedev commentedNov 19, 2023

Uh oh!

romuald commentedNov 19, 2023

Uh oh!

romuald commentedMar 1, 2024

Uh oh!

pitrou commentedMar 1, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

romuald commentedMar 3, 2024

Uh oh!

pitrou commentedMar 3, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

romuald commentedMar 4, 2024

Uh oh!

romuald commentedFeb 14, 2025

Uh oh!

sobolevn commentedFeb 14, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

romuald commentedFeb 16, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

picnixz commentedFeb 16, 2025

Uh oh!

romuald commentedFeb 16, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

picnixz commentedFeb 16, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

romuald commentedNov 18, 2023•
edited
Loading

pitrou commentedMar 3, 2024•
edited
Loading

sobolevn commentedFeb 14, 2025•
edited
Loading

romuald commentedFeb 16, 2025•
edited
Loading

picnixz commentedFeb 16, 2025•
edited
Loading