Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

RFC: Zarr-Python 3.0 Design Doc#1569

Nov 14, 2023· 2 comments· 9 replies
Discussion options

As part of the Zarr-Python Refactoring working group (#1480) I have developed adraft proposal for a major block of work in the Zarr-Python project, leading version 3.0 and including support for the V3 spec. This post is both aRequest for Comment and a forum for discussion.

Goals (read the doc for details)

  • Provide a complete implementation of Zarr V3 through the Zarr-Python API
  • Align the Zarr-Python array API with thearray API Standard
  • Clear the way for exciting extensions / ZEPs (i.e. sharding, variable chunking, etc.)
  • Provide a developer API that can be used to implement and register V3 extensions
  • Improve the performance of Zarr-Python by streamlining the interface between the Store layer and higher level APIs (e.g. Groups and Arrays)
  • Clean up the internal and user facing APIs
  • Improve code quality and robustness (e.g. achieve 100% type hint coverage)

Thanks to@d-v-b,@normanrz,@rabernat,@dcherian,@monodeldiablo,@olimcc,@martindurant, and@JackKelly for their input so far. cc @zarr-developers/python-core-devs.

Note: The next Zarr-Python Refactor working group is slated to meet at 9a PT on November 22, 2023. We'll be discussing this topic in detail then, please feel free to join (#1480).

You must be logged in to vote

Replies: 2 comments 9 replies

Comment options

Looks awesome!

Working withbatches of chunks when doing IO and/or (de)compression from sharded Zarrs

TL;DR: The current design looks great for handling something like 1,000 chunks per second1. But, if we want to push Zarr to handle more like one million chunks per second (from a sharded Zarr), then it'd be great to discuss adding support for processingbatches of chunks in one go.

I'd be keen to discuss whether this draft proposal would be an appropriate time to consider adding support for processingbatches of chunks (in a single function call). For example, allowingArray to ask the lower layers to load & decompress a million chunks, spread across a small number of shards, all in a single function call. There are several possible advantages, most of which have been discussed before:

Yes, in principle, theasync get(key: str) -> bytes: function proposed intheStore API in the current version of the Zarr-Python 3.0 doc does allow for parallel IO (by callingstore.get(key2) beforeget(key1) has completed). But this makes it harder (or, in some cases, impossible) for theStore implementation to apply relevant optimisations3.

And, if we're trying to load one million chunks per second then we'll bump into the issue that function calls are surprisingly slow in Python:each function call (with arguments) takes about 350 ns4 (just for the function call: not doing any actual work!) A million function calls would take 0.35 seconds! So, it should be more efficient to do singleget_items function call, with a million chunks as the function argument.

Some reasons why we might want to ignore this suggestion for Zarr-Python:

  • Maybe Zarr-Python isn't the right place to try to implement "extreme Zarr performance"? Pure-Python code will struggle to max out the hardware (for IO). Although it might be interesting to consider a world in which a Zarr "front-end" is implemented in Python, but someStores are implemented in a compiled language (with a Python API).
  • Maybe I'm the only one who cares about (or thinks it's sane to consider!) loading one million chunks per second?!?
  • Perhaps adding support forbatched processing should be a topic of discussion for a future update to the Zarr spec, not this update to Zarr-Python.

Footnotes

Footnotes

  1. I've picked the number "1,000" from thin air, TBH!

  2. Although I haven't been able to find any benchmarks to show that vectored I/O is actuallyfaster, especially when using Linux io_uring or Windows I/O Ring. (Because io_uring and I/O Ring allow for multipleread requests to be submitted in a single system call. Which, perhaps, negates the obvious advantage to vectored I/O)

  3. Store could collect multipleget() requests by waiting a while, to see if it gets any moreget() requests, before performing any IO. But requiring theStore to wait feels like it defeats the objective of going fast! And it adds complexity, and the time overhead of performing a million function calls in Python.

  4. The figure of 350 ns is from an old (2015) blog post. Since then, Python 3.10 and 3.11 sped up function calls a bit (20%??). But I couldn't find any more modern benchmarks.

You must be logged in to vote
9 replies
@martindurant
Comment options

I would say it's not really the loop or the function call that will make the difference, but (as other benchmarking already found) memcopies, cpython API calls and syscalls. For instance

def g(n):    for i in range(n):        pass

10.2ms

def g(n):    for i in range(n):        b[10:20]

(where b is a bytes object)
61ms

@JackKelly
Comment options

Hi@jni, ha! Sorry I didn't "connect the dots" between your blog and your GitHub ID! It's a small world, huh?!? Thanks loads for re-doing the benchmarks!

@martindurant, I completely agree that memcopies are surprisingly slow. As you've mentioned in the past, the ideal case would be that - for uncompressed Zarr chunks - we can do exactly one copy: from storage to the final numpy array (ideally using DMA). If we have batched processing of chunks, then there's at least a chance that, even for chunks that need to be "scattered" throughout the final numpy array, it'd still be possible to do that using DMA (e.g. usingreadv to read different parts of a single chunk into different memory locations, with a single system call).

@martindurant
Comment options

I was not really advocating for a low level super fix like that, but by all means try it. I think the python readinto and decompress_into are probably enough, but the point is that the "batch" concept I don't think will make much difference so long as high latency reads are awaited together. I consider this the low-hanging fruit, together with not reading bytes we don't need.

All contingent on contiguous memory blocks. Where striding is at issue, we probably can't do better than numpy copies, unless we write our own decompression wrapper capable of "getting every fourth uncompressed byte" or similar pattern.

@JackKelly
Comment options

On the topic of where to put this in the stack...

I feel I'm too new to the Zarr-Python world to have a particularly strong grasp of the pros and cons of where to place this. But I will give it more thought.

One option thatI'm considering for my Zarr Rust experiments, is to have a method which - ported to Python - would be something like:

classStore:asyncdefget_items(self,keys:Iterable[str],transform:Optional[Callable]=None   )->list[bytes]:"""Get multiple chunks.   Args:       keys: List of chunk keys.       transform: A function that will be applied to every chunk. The function must take two arguments:               - chunk_id: int               - chunk_data: bytes           And it must return bytes or None (if, for example, the processing function moves data to a final array)   """

Then theArray can ask theStore to perform any arbitrary transform on each chunk (e.g. decompress and copy to the final numpy array). My motivation is twofold: I want processing to start as soon as each chunk is available, using a threadpool (one thread per CPU core). And I wantall processing (decompression, maybe some numerical transform, and copying to the final array) to happen in very quick succession, to maximise that chance that the chunk stays in CPU cache.

So, this kind of dodges the question of "where to put decompression". TheStore can perform any arbitrary computation. But the decompression algorithm(s) wouldn't by defined in theStore.

But I don't know if that's appropriate for Zarr-Python. Nor have I actually implemented this in Rust yet, so I don't know if it's a good idea for Rust, either!

@martindurant
Comment options

Were you followingrfsspec? That does the concurrent reads; since cramjam is rust-python, I thought it could nicely "decompress into" numpy buffers in CPU threads driven by tokio.

Comment options

Hi all, just wanted to offer support for the roadmap and design doc (#1583), kudos for plotting a clear course through some very tricky terrain.

You must be logged in to vote
0 replies
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Labels
enhancementNew features or improvements
6 participants
@jhamman@normanrz@JackKelly@jni@alimanfoo@martindurant

[8]ページ先頭

©2009-2025 Movatter.jp