The poor write performance of sharded zarrs in the zarr-python implementation is currently a major limiting factor to its adoption by our group. We found that writing shard-by-shard in an empty sharded array is one magnitude slower than writing in unsharded zarrs. This is surprising, as writing full shards should only be marginally slower than writing unsharded chunks.

Whilethis 2023 discussion suggests that the latency is caused by the re-generation of the index table, this does not seem to be the case in the latest implementation, which saves all chunks in memory and (properly) waits for all chunks to be available before generating the index table (see_encode_partial_single).

Instead I found that a major cause of slowdown comes from the implementation of theBuffer class, which callsnp.concatenate every time bytes are added to the buffer. As a proof of concept, I have implemented an alternativeDelayedBuffer class that keeps individual byte chunks in a list, and only concatenates them when needed. On a simple benchmark that uses512**3 shards and128**3 chunks and a local store, it reduces the time to write one shard from ~10 sec to ~1 sec, which is on par with the time taken to write the same512**3 array in an unsharded zarr (~0.9 sec).

I am keeping this as a draft for now as it is a hacky proof-of-concept implementation, but I am happy to clean it up if this is found to be a good solution (with guidance on how to implement the delayed buffer in a way that is compatible with the buffer prototype logic, which I don't fully understand). All tests pass except one that checks whether a store receives aTestBuffer (as it instead receives a `DelayedBuffer).

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented indocs/user-guide/*.rst
Changes documented as a new file inchanges/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

balbasty added6 commits

April 10, 2025 13:04

Sharding: do not merge new/old if old is empty + avoid sequential buf…

4082888

…fer concatenation

Preliminary benchmark for sharded writes

09bc21d

Refactor DelayedBuffer + implement efficient __getitem__

1a9158a

Revert to original MergingShardBuilder (if/else not needed with faste…

23fbc55

…r buffer)

Implement __setitem__ in DelayedBuffer

050477d

Pass tests (except one)

9b66073

github-actionsbot added the needs release notesAutomatically applied to PRs which haven't added release notes label

Apr 11, 2025

Copy link

Contributor

d-v-b commentedApr 11, 2025

@balbasty thank you so much for this work. I think your detective work here will be very much appreciated.

general question: why are we doing concatenation at all? is there a reason why we can't statically allocate all the memory we need in advance? I thought the sharding format gave explicit byte ranges for each chunk, and thus the size of any combination of shards can be known prior to fetching anything.

Copy link

Author

balbasty commentedApr 11, 2025

general question: why are we doing concatenation at all? is there a reason why we can't statically allocate all the memory we need in advance? I thought the sharding format gave explicit byte ranges for each chunk, and thus the size of any combination of shards can be known prior to fetching anything.

I don't believe so. The index table has a fixed size, but the chunks have variable size (hence the index table). Otherwise compressed chunks would take more space than needed. The format is eitherindex_table + stack([encoded_chunks]) orstack([encoded_chunks]) + index_table.

LDeakin mentioned this pull request

Apr 17, 2025

(chore): update benchmarkszarrs/zarr_benchmarks#6

Merged

Copy link

Contributor

d-v-b commentedApr 19, 2025

general question: why are we doing concatenation at all? is there a reason why we can't statically allocate all the memory we need in advance? I thought the sharding format gave explicit byte ranges for each chunk, and thus the size of any combination of shards can be known prior to fetching anything.
I don't believe so. The index table has a fixed size, but the chunks have variable size (hence the index table). Otherwise compressed chunks would take more space than needed. The format is eitherindex_table + stack([encoded_chunks]) orstack([encoded_chunks]) + index_table.

What I mean is that, when we get the index table, we also get the size of each compressed chunk. And when we are fetching chunks from a shard, we always know in advance which chunks we need. So it seems like the combination of the shard index + the set of requested chunks is sufficient to specify the required memory for compressed chunks exactly. Does this check out?