Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Improve write performance of shards#2977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
balbasty wants to merge6 commits intozarr-developers:main
base:main
Choose a base branch
Loading
frombalbasty:perf-write-shard

Conversation

balbasty
Copy link

The poor write performance of sharded zarrs in the zarr-python implementation is currently a major limiting factor to its adoption by our group. We found that writing shard-by-shard in an empty sharded array is one magnitude slower than writing in unsharded zarrs. This is surprising, as writing full shards should only be marginally slower than writing unsharded chunks.

Whilethis 2023 discussion suggests that the latency is caused by the re-generation of the index table, this does not seem to be the case in the latest implementation, which saves all chunks in memory and (properly) waits for all chunks to be available before generating the index table (see_encode_partial_single).

Instead I found that a major cause of slowdown comes from the implementation of theBuffer class, which callsnp.concatenate every time bytes are added to the buffer. As a proof of concept, I have implemented an alternativeDelayedBuffer class that keeps individual byte chunks in a list, and only concatenates them when needed. On a simple benchmark that uses512**3 shards and128**3 chunks and a local store, it reduces the time to write one shard from ~10 sec to ~1 sec, which is on par with the time taken to write the same512**3 array in an unsharded zarr (~0.9 sec).

I am keeping this as a draft for now as it is a hacky proof-of-concept implementation, but I am happy to clean it up if this is found to be a good solution (with guidance on how to implement the delayed buffer in a way that is compatible with the buffer prototype logic, which I don't fully understand). All tests pass except one that checks whether a store receives aTestBuffer (as it instead receives a `DelayedBuffer).

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented indocs/user-guide/*.rst
  • Changes documented as a new file inchanges/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

aldenks reacted with heart emoji
@github-actionsgithub-actionsbot added the needs release notesAutomatically applied to PRs which haven't added release notes labelApr 11, 2025
@d-v-b
Copy link
Contributor

@balbasty thank you so much for this work. I think your detective work here will be very much appreciated.

general question: why are we doing concatenation at all? is there a reason why we can't statically allocate all the memory we need in advance? I thought the sharding format gave explicit byte ranges for each chunk, and thus the size of any combination of shards can be known prior to fetching anything.

@balbasty
Copy link
Author

general question: why are we doing concatenation at all? is there a reason why we can't statically allocate all the memory we need in advance? I thought the sharding format gave explicit byte ranges for each chunk, and thus the size of any combination of shards can be known prior to fetching anything.

I don't believe so. The index table has a fixed size, but the chunks have variable size (hence the index table). Otherwise compressed chunks would take more space than needed. The format is eitherindex_table + stack([encoded_chunks]) orstack([encoded_chunks]) + index_table.

@d-v-b
Copy link
Contributor

general question: why are we doing concatenation at all? is there a reason why we can't statically allocate all the memory we need in advance? I thought the sharding format gave explicit byte ranges for each chunk, and thus the size of any combination of shards can be known prior to fetching anything.

I don't believe so. The index table has a fixed size, but the chunks have variable size (hence the index table). Otherwise compressed chunks would take more space than needed. The format is eitherindex_table + stack([encoded_chunks]) orstack([encoded_chunks]) + index_table.

What I mean is that, when we get the index table, we also get the size of each compressed chunk. And when we are fetching chunks from a shard, we always know in advance which chunks we need. So it seems like the combination of the shard index + the set of requested chunks is sufficient to specify the required memory for compressed chunks exactly. Does this check out?

@d-v-bd-v-b added the performancePotential issues with Zarr performance (I/O, memory, etc.) labelMay 14, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers
No reviews
Assignees
No one assigned
Labels
needs release notesAutomatically applied to PRs which haven't added release notesperformancePotential issues with Zarr performance (I/O, memory, etc.)
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

2 participants
@balbasty@d-v-b

[8]ページ先頭

©2009-2025 Movatter.jp