python/cpythonPublic

NotificationsYou must be signed in to change notification settings
Fork34k
Star71.4k

gh-140557: Force alignment of empty`bytearray` and`array.array` buffers#140559

Merged

encukou merged 11 commits intopython:mainfrom

jakelishman:max-align-buffers

Jan 26, 2026

Merged

gh-140557: Force alignment of empty`bytearray` and`array.array` buffers#140559
encukou merged 11 commits intopython:mainfrom
jakelishman:max-align-buffers

Conversation

Copy link

Contributor

jakelishman commentedOct 24, 2025•
edited by bedevere-appbot
Loading

This ensures the buffers used by the emptybytearray andarray.array are aligned the same as a pointer returned by the allocator. This is a more convenient default for interop with other languages that have stricter requirements of type-safe buffers (e.g. Rust's&[T] type) even when empty.

I tried to do the same forbytes, but I think its default buffer is only forcibly aligned on an 8 because of theuint64_t member inPyBytesObject, and it ends up dependent on wherebytes_empty gets laid out. If that's desirable too, I might need some help figuring out a strategy for it.

I'm not sure where's appropriate to put a test for this, or if it can/should be documented as reliable.

Issue:#140557

Issue:Align default empty buffers ofbytearray andarray.array #140557

Force alignment of emptybytearray andarray.array buffers

6b37e66

This ensures the buffers used by the empty `bytearray` and `array.array`are aligned the same as a pointer returned by the allocator.  This is amore convenient default for interop with other languages that havestricter requirements of type-safe buffers (e.g. Rust's `&[T]` type)even when empty.

Copy link

python-cla-botbot commentedOct 24, 2025•
edited
Loading

All commit authors signed the Contributor License Agreement.

bedevere-appbot mentioned this pull request

Oct 24, 2025

Align default empty buffers ofbytearray andarray.array#140557

Closed

bedevere-appbot added the awaiting review label

Oct 24, 2025

jakelishman mentioned this pull request

Oct 24, 2025

BUG: Data pointer is unaligned for empty arrays after pickle roundtripnumpy/numpy#30062

Closed

serhiy-storchaka approved these changes

Oct 24, 2025

View reviewed changes

Copy link

Member

serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM. 👍

bedevere-appbot added awaiting merge and removed awaiting review labels

Oct 24, 2025

serhiy-storchaka requested a review fromencukou

October 24, 2025 16:58

Copy link

Contributor

cmaloney commentedOct 26, 2025•
edited
Loading

I'm 👎 on this forbytearray; I've been looking at making the emptybytearray point to an emptybytes object where this wouldn't hold true.

see:gh-139871 for ways having abytes inside can make things faster (less copies of data often)

Copy link

Contributor

cmaloney commentedOct 26, 2025

re:bytearray, it also supports a "fast" start-delete which moves the start offset (ob_start) inside the allocated space by an arbitrary count of bytes which, to me, implies that unaligned data access from other languages needs to be supported in accessing/manipulating its internal buffers/data.

Copy link

ContributorAuthor

jakelishman commentedOct 26, 2025•
edited
Loading

This doesn't enforce that every buffer always has aligned access - you can easily take out a view of a bytes or bytearray that you offset by a byte - so other languages have to still handle the case of an unaligned pointer. It just makes the default empty object aligned at zero cost, and that object is common.

The emptybytes internal pointer still ends up aligned on an 8, which would in practice still make the pointer aligned formost data types you might be casting to, so swapping to that would still be an improvement over the status quo.

Copy link

ContributorAuthor

jakelishman commentedOct 26, 2025

We do still have to handle unaligned access everywhere, I agree there's no escaping it. The goal here is only around making unaligned objects less common -bytearray being aligned on a 1 (and actually turning up on a 1) is a lot more commonly visible now in Python 3.14 that pickle 5 uses it in the data stream to representPickleBuffer in band, so more libraries (e.g. Numpy) want to zero-copt view onto the recreated buffer.

Copy link

Contributor

cmaloney commentedOct 26, 2025

I'm still concerned here:

Botharray andbytearray externalize their item storage. ThePyObject* that is thearray object (orbytearray) is a separate allocation from the storage buffer for the elements. As an optimization, CPython doesn't allocate space if the length is 0, and these two values are used to handle a couple otherwise hard to handle return cases. Thearray one is only used inarray_buffer_getbuf so thatmemoryview() / buffer protocol works on anarray with no storage allocated itself. Thebytearray one shows up in the C APIPyByteArray_AS_STRING only whenbytearray has no internal storage.

These are just placeholders to fill corner cases which seem like they shouldn't be common in code (to use the returned pointer without doing out of bounds access you'd need to look at thelen() / size).
This adds a new CPython "guarantee" that it sounds like want to depend on in code but no test that would break if changes effect it. It should be possible to write one for alignment withhttps://docs.python.org/3/library/stdtypes.html#memoryview
The default "buffer" forbytearray (andarray.array) are both immutable and shouldn't really be indexed into, read from, or written to; they're 0 bytes long, code can't set / modify data in them, and there is no data to read from them.

Copy link

ContributorAuthor

jakelishman commentedOct 26, 2025

I'm not aiming to make this a guarantee at all, just that by default the zero allocation case is already aligned. In Rust and other languages we do still have to handle unaligned buffers, just like Numpy alread has to in pure C.

The default "buffer" for bytearray (and array.array) are both immutable and shouldn't really be indexed into, read from, or written to; they're 0 bytes long, code can't set / modify data in them, and there is no data to read from them.

Right, this is where other languages have stronger guarantees - I mentioned it ingh-140557 as the motivation that in Rust, creating a&[T] primitive slice requires an aligned non-null pointer even if it's invalid for any reads. We can't convert every buffer to a slice with zero copies, so we already have alternative handling, but this patch makes the emptybytearray object go down a common path rather than a colder path by default. For similar fast-path alignment reasons, Numpy internally sets its "aligned" flag on empty buffers even if the pointer isn't actually aligned for the data type it purportedly points to, to avoid triggering copying code / extra handling in ufuncs that require alignment.

Copy link

ContributorAuthor

jakelishman commentedOct 26, 2025

About rarity of appearance: both of these pointers show through the buffer protocol like you mentioned, and that's where I come across them in language interop - that's the defined way to get zero-copy access to a data buffer owned by Python space.

We can't have zero-copy access to misaligned buffers, but in practice, the vast majority of buffers thatare back by an allocation end up aligned anyway, so requiring copies is rare (since you have to deliberately offset an allocated pointer by a sub-unit amount). We can add special handling to produce an empty slice in Rust that doesn't refer to the same pointer we actually get from the Python buffer protocol, if it's misaligned as an extra optimisation to avoid a copy, so we don'trequire CPython support, but if the default is actually aligned, then there's less point propagating this hypothetical optimising special-handling code through a lot of downstream packages and just using the slower "copy to force alignment" paths that (should) already exist for it.

When I wrote this patch, it was zero cost to CPython to achieve that for the defaults. If you have additional work that would seriously raise the cost to CPython then the calculus is different, though even reliably having the empty buffer aligned on an 8, like the emptybytes buffer is in practice, would be enough for the majority of cases I care about.

Copy link

Contributor

cmaloney commentedOct 26, 2025

Can you link to the code or provide a rust sample which needs to special-case handling a zero-length bytearray? I think that would help me understand here.

Copy link

Member

serhiy-storchaka commentedOct 27, 2025

It is not guaranteed that the start of the bytearray buffer has some alignment. This is a CPython implementation detail. But some code depends on this, and it may not work on non-x86 platforms if this is not aligned. There are exceptions: if there was a deletion from the beginning of the bytearray (we cannot do anything with this, but the user can guarantee that this did not happen), and when the bytearray is empty. The latter case is easy to fix for us, it costs nothing.

Copy link

ContributorAuthor

jakelishman commentedOct 27, 2025

cmaloney: Let's say I've got FFI got that wraps thePy_buffer interface, first by making (not precise - I'm just including the illustrative stuff):

structPyBuffer{buf:*mut(),itemsize:isize,ndim:::std::ffi::c_int,shape:*constisize,strides:*constisize,}implPyBuffer{/// Is the slice contiguous in memory?fnis_contiguous(&self) ->bool{/* ... */}/// How many bytes can be read from it?fnlen_bytes(&self) ->usize{/* ... */}}

Let's say I've got one of these structs that I then initialised withPyObject_GetBuffer, and now I want to expose a Rust-native slice view for a specific Rust type, which might not match the "native" type of the buffer, since I might have been given an array created from abytearray object whoseitemsize is 1 but actually represents storage ofuint64_t (u64 in Rust). I return anResult here to signify to the caller that the function is fallible¹:

enumSliceError{Unaligned,Noncontiguous,Nullptr,}fnslice_from_buffer<T>(buf:&PyBuffer) ->Result<&[T],SliceError>{if buf.buf.is_null(){returnErr(SliceError::Nullptr);}if !buf.is_contiguous(){returnErr(SliceError::Noncontiguous);}if !buf.buf.is_aligned::<T>(){returnErr(SliceError::Unaligned);}// SAFETY: pointer is non-null, aligned, and valid for this many contiguous reads:Ok(unsafe{ std::slice::from_raw_parts(    buf.buf,    buf.len_bytes() / std::mem::size_of::<T>()})}

I have to do the alignment and contiguous checks before I callstd::slice::from_raw_parts, because it's undefined behaviour in Rust to create a slice backed by an unaligned or null pointer, even if the length is zero. (The reason is to enable specific type-niche optimisations in the compiler to save space in compound types that contain&[T].)

A Rust caller of this function is still responsible for handling the case of an unaligned pointer, which might cause them to do something like

matchslice_from_buffer::<u64>(&buf){Ok(slice) =>use_slice(slice),Err(SliceError::Noncontiguous |SliceError::Unaligned) =>{let aligned_contiguous =/* copy buffer to somewhere aligned */;use_slice(aligned_contiguous.as_slice())}Err(SliceError::Nullptr) =>panic!(),}

The idea of this PR is just to make it so that slightly more stuff by default gets to go down the happy path, in a way that doesn't cost CPython anything. It's still the Rust user's responsibility to handle the unhappy path since that's totally valid Python code still, and the Rust library's responsibility to make sure everything is safe for FFI use in Rust². This PR isn't intending to add any restrictions on what CPython is allowed to do or what other Python implementations may do.

Your#139871 looks to me to also achieve the same goal I was going for here, just as a side effect (since the emptybytes buffer happens to be 8-byte aligned in CPython), so if that merged, it'd improve thebytearray situation as well. This PR isslightly stronger forbytearray, but that's largely immaterial (it most likely only affects SIMD types).

Footnotes

this code is usually in Python interface libraries likePyO3, whose implementation is here, but I wrote the example manually because PyO3's has additional complications and spreads the required checks into a few separate places. Also, the same sort of code comes up in lots of specialised places too, likerust-numpy that provides FFI access to Numpy arrays, etc.↩
there are safe Rust-side optimisations that can be done around empty slices (which I'm planning to contribute in several Rust projects) such as creating a slice out of a dangling pointer magicked from thin air, which helps these cases as well,↩

Copy link

Contributor

cmaloney commentedOct 27, 2025•
edited
Loading

I'm 👍 for doingarray here as it supports larger than 1-byte element objects. Does still need a test so if the code is broken/removed accidentally CPython devs will notice. That should be implementable withhttps://docs.python.org/3/library/ctypes.html#ctypes.alignment.

bytearray should be covered by my CPython internals change which will help your case. For bytes-like and slicing in general across rust and other languages I foundhttps://davidben.net/2024/01/15/empty-slices.html really helpful to improve my knowledge. I don't think it makes sense for CPython to adopt the rust semantic for its "bytes-like" generally; and "buffer protocol" /memoryview across the ecosystem is based on C "bytes" for better and worse.

cmaloney previously requested changes

Oct 29, 2025

View reviewed changes

Copy link

Contributor

cmaloney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Reduce scope to justarray.array + add a test; empty buffer + non-emptyarray.array storage being aligned I think is a nice improvement

Copy link

Member

serhiy-storchaka commentedOct 29, 2025

I do not think that bytearray should be excluded. If not a special optimization of using NULL for empty array, it would have standard alignment (as returned bymalloc()). The fact that currently it can have worse alignment is the result of optimization. Optimization should not cause regression.

Copy link

Contributor

cmaloney commentedOct 29, 2025

re:bytearraygh-140557 /GH-140128 makes it aligned as a side effect becausebytes is aligned (and removes usage of _PyByteArray_empty_string; ob_start is always set)

Copy link

Member

serhiy-storchaka commentedOct 29, 2025

Then I do not see reasons to object this change.

Copy link

Contributor

cmaloney commentedOct 29, 2025

I don't like having implementation details relied on by external code/systems without a test / validation. To me that creates a potential future release blocker if the assertion gets broken. I don't see adding a test here as a really onerous requirement.
Aligning_PyByteArray_empty_string is going to be a no op / just waste space shortly, so isn't a lot of value in doing in this PR as this will just show up in 3.15

I have slight nit in wording of the NEWS entry; I think it would be good to mention Rust and to make it more concise.

Copy link

Member

serhiy-storchaka commentedOct 29, 2025

I think there is a good chance that the code that relies on this already exist. It works only because the Intel plathform is tolerable to unaligned access (and there is no actually access to memory here, because it is an empty buffer). But on other platforms pointers of different type can use different registers.

This is not limited to Rust. In C, casting a pointer with wrong alignment can be an undefined behavior. If bytesarray is used instead of array as a collection of 16-, 32- or 64-bit integers, there may be problems.

jakelishmanand others added3 commits

October 30, 2025 00:36

Add tests of buffer pointer alignment

1fea8e5

Make NEWS more concise

b7eed2b

Merge remote-tracking branch 'python/main' into max-align-buffers

dd9e50b

jakelishman force-pushed themax-align-buffers branch from719e0e2 todd9e50bCompare

October 30, 2025 00:38

serhiy-storchaka reviewed

Dec 31, 2025

View reviewed changes

Copy link

Member

serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think the cost of the tests exceeds their value. They are complex and fragile.

If you insist on adding tests, they can be implemented by adding a simple helper in_testcapi. The tests should be decorated as CPython-only. They should also test non-empty arrays of different types.

jakelishman added3 commits

January 6, 2026 16:29

Merge remote-tracking branch 'python/main' into max-align-buffers

de09bbe

Revert Python-spacectypes tests

ad966dc

This reverts commit9d454de.This reverts commit1fea8e5.

Add alignment tests with_testcapi helper

b036215

Copy link

ContributorAuthor

jakelishman commentedJan 6, 2026

Thanks for the ping, and happy holidays everybody. I reverted the old fragile tests and added (my best attempt at) the new_testcapi-based versions. I'm back at work properly now after the break, so should be more responsive again.

Copy link

ContributorAuthor

jakelishman commentedJan 6, 2026

Ah, the test failure here is because the handling of thebytearray type aligns the empty array on 8s, likebytes does, whereas on some platforms, the max align is 16.

Shall I attempt to modify the basebytes/bytearray object to be aligned on the max alignment, or only commit as far as aligning on 8s?

Slacken test to match 'size_t' only

2dec490

Copy link

ContributorAuthor

jakelishman commentedJan 6, 2026

I don't think it's practical/possible to force alignment of the emptybytes buffer up to 16s - doing so would require the offset of.ob_sval inPyBytesObject to shift by 8 bytes on 16-byteMAX_ALIGN_T systems, which would just be wasted memory in the vast majority of cases, andbytesallocations already weren't reliably onMAX_ALIGN_T for the same reason, so there's no "optimisations shouldn't be observable" argument to make there.

I've changed the test case to explicit check forsize_t compatible alignment only, but happy to go a different approach if there's a preferred one.

cmaloney reviewed

Jan 7, 2026

View reviewed changes

Lib/test/test_buffer.py OutdatedShow resolvedHide resolved

cmaloney reviewed

Jan 7, 2026

View reviewed changes

Lib/test/test_buffer.py OutdatedShow resolvedHide resolved

cmaloney reviewed

Jan 7, 2026

View reviewed changes

Misc/NEWS.d/next/Core_and_Builtins/2025-10-24-17-30-51.gh-issue-140557.X2GETk.rst OutdatedShow resolvedHide resolved

Misc/NEWS.d/next/Core_and_Builtins/2025-10-24-17-30-51.gh-issue-140557.X2GETk.rstShow resolvedHide resolved

Modules/_testcapi/buffer.c OutdatedShow resolvedHide resolved

cmaloney dismissed theirstale review

January 7, 2026 06:28

stale; new state addresses core concerns

Fixup review comments

92a5fff

cmaloney approved these changes

Jan 8, 2026

View reviewed changes

cmaloney added the 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section label

Jan 8, 2026

Copy link

bedevere-bot commentedJan 8, 2026

🤖 New build scheduled with the buildbot fleet by@cmaloney for commit92a5fff 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F140559%2Fmerge

If you want to schedule another build, you need to add the🔨 test-with-buildbots label again.

bedevere-bot removed the 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section label

Jan 8, 2026

cmaloney reviewed

Jan 8, 2026

View reviewed changes

Lib/test/test_buffer.py Outdated

		def test_array_alignment(self):
		# gh-140557: pointer alignment of buffers including empty allocation
		# should be at least to `size_t`.
		align = struct.calcsize("N")

Copy link

Contributor

cmaloneyJan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Chatted with a Rust export on the "Rust for CPython Discord". Came to the conclusion with them that we should alignarray.array to 8 bytes since it can store 8 byte things (typecodes: "qQd",https://docs.python.org/3/library/array.html#array.array). On 32 bit platformssize_t will only require 4 byte alignment.

Rather thanstruct.calcsize("N") I think we can just use the constant 8 here + have a comment pointing to the item size table in thearray.arraydoc?

Copy link

joshtriplettJan 8, 2026•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Chiming in to confirm that we had this discussion. 👍 for aligning to 8 bytes on all platforms; it seems simpler and less surprising than trying to marginally optimize for thesubset of 4-byte platforms wherelong long anddouble are 4-byte aligned. Might also have benefits for other purposes, such as SIMD algorithms.

Perhttps://en.wikipedia.org/wiki/Data_structure_alignment:

when compiling for 32-bit x86:
[...]
A double (eight bytes) will be 8-byte aligned on Windows and 4-byte aligned on Linux (8-byte with -malign-double compile time option).
A long long (eight bytes) will be 8-byte aligned on Windows and 4-byte aligned on Linux (8-byte with -malign-double compile time option).

Copy link

ContributorAuthor

jakelishmanJan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Oh yeah thanks - I was overly focused on how the buffer insidePyBytes ends up aligned and not thinking properly. I changed the test ofarray.array to take the max item size from all available formats for the platform, though I can change it to be hard-coded 8 if that's preferred.

Copy link

Contributor

cmaloneyJan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Max ofarray.array's elements I like / means both new element adding or buffer allocation changes would result in the test needing an intentional update / helps ensure changes are intentional

Copy link

cosmicexplorerJan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I haven't fully paged in to this (awesome!) workstream yet, but I wanted to note that I'mincredibly inspired by the unstable rustcore::simd feature, and I maintain both thehttps://docs rs/re2 andhttps://docs.rs/vectrorscan-async (hyperscan) rust wrapper crates.

With regards to SIMD, I have two ideas:

Reflecting the FFI constraints into SIMD-compatible memory for rustcore::simd could be a great investigation for rust's external types proposal:Tracking issue for RFC 1861: Extern types rust-lang/rust#43467 (comment)
I have begun work that searches for SIMD instructions in the cpython configure script:https://discuss.python.org/t/add-zero-copy-conversion-of-bytearray-to-bytes-by-providing-bytes/79164/70
- I would be very interested to discuss this sort of thing further. In that post I mention two specific use cases (one of them being url quoting in pip) which would produce very noticeable real-world speedups in foundational python tooling like pip.

I'm also right now doing a cleanroom implementation of zstd in rust (I can explain my concerns with ruzstd) and making use of alignment for SIMD to improve over the zstd C implementation (currently working on a ring buffer component), which I very much intend to plug into a python native module.

This comment is in full support of this PR and this workstream. Please let me know if I can follow up and learn more about this! Thanks! ^_^

Require max observable alignment in array test

d3ca57f

Copy link

ContributorAuthor

jakelishman commentedJan 13, 2026

I think I've addressed all the extant comments in this PR, by the way. Happy to make more changes if there's more needed to land it.

Copy link

cosmicexplorer commentedJan 15, 2026

very non-blocking nitpick@jakelishman re#140559 (comment):

I have to do the alignment and contiguous checks before I callstd::slice::from_raw_parts, because it's undefined behaviour in Rust to create a slice backed by an unaligned or null pointer, even if the length is zero. (The reason is to enable specific type-niche optimisations in the compiler to save space in compound types that contain&[T].)

I have been handling this precise situation in my rust code withslice::align_to{,_mut}() instead of erroring out. That method is implemented usingpointer::align_offset() and then aconst! { ... } evaluation of the gcd in the stdlib:https://github.com/rust-lang/rust/blob/a6acf0f07f0ed1c12e26dc0db3b9bf1d0504a0bb/library/core/src/slice/mod.rs#L4220

Is that approach not applicable here? Am I missing some reason why we knowpointer::align_offset() would be out of range, for example?

I would love to address this in a followup change if not! Thanks!