zarr-developers/zarr-pythonPublic

NotificationsYou must be signed in to change notification settings
Fork366
Star1.8k

RFC: Zarr-Python 3.0 Design Doc#1569

jhamman announced inAnnouncements

jhamman

Nov 14, 2023

· 2 comments· 9 replies

Return to top

Discussion options

jhamman
Nov 14, 2023
Maintainer

As part of the Zarr-Python Refactoring working group (#1480) I have developed adraft proposal for a major block of work in the Zarr-Python project, leading version 3.0 and including support for the V3 spec. This post is both aRequest for Comment and a forum for discussion.

Goals (read the doc for details)

Provide a complete implementation of Zarr V3 through the Zarr-Python API
Align the Zarr-Python array API with thearray API Standard
Clear the way for exciting extensions / ZEPs (i.e. sharding, variable chunking, etc.)
Provide a developer API that can be used to implement and register V3 extensions
Improve the performance of Zarr-Python by streamlining the interface between the Store layer and higher level APIs (e.g. Groups and Arrays)
Clean up the internal and user facing APIs
Improve code quality and robustness (e.g. achieve 100% type hint coverage)

Thanks to@d-v-b,@normanrz,@rabernat,@dcherian,@monodeldiablo,@olimcc,@martindurant, and@JackKelly for their input so far. cc @zarr-developers/python-core-devs.

Note: The next Zarr-Python Refactor working group is slated to meet at 9a PT on November 22, 2023. We'll be discussing this topic in detail then, please feel free to join (#1480).

You must be logged in to vote

Replies: 2 comments 9 replies

Comment options

JackKelly
Nov 15, 2023

Looks awesome!

Working withbatches of chunks when doing IO and/or (de)compression from sharded Zarrs

TL;DR: The current design looks great for handling something like 1,000 chunks per second¹. But, if we want to push Zarr to handle more like one million chunks per second (from a sharded Zarr), then it'd be great to discuss adding support for processingbatches of chunks in one go.

I'd be keen to discuss whether this draft proposal would be an appropriate time to consider adding support for processingbatches of chunks (in a single function call). For example, allowingArray to ask the lower layers to load & decompress a million chunks, spread across a small number of shards, all in a single function call. There are several possible advantages, most of which have been discussed before:

Batched IO would allowStores to apply optimisations such as:
- Merging multiple nearby reads (as proposed by@bilts inAllow stores to be smarter when accessing multiple chunks #547, back in 2020)
- Exploit parallelism in some storage subsystems (e.g. SSDs and cloud storage buckets)
- Exploit batched APIs exposed by the storage subsystem (e.g.vectored IO² likePOSIXreadv() andWindowsReadFileScatter(); andmultipart HTTP range requests, althoughcloud storage APIs don't support multipart range requests)
- Allow theStore to plan properly: to pre-allocate buffers, spin up multiple threads, and distribute the workload across threads without using locks.
Batched (de)compression would make it easier to (de)compress in parallel across multiple CPUs cores or on a GPU (as proposed by@akshaysubr inAllow batched/concurrent (de)compression support #1398)

Yes, in principle, theasync get(key: str) -> bytes: function proposed intheStore API in the current version of the Zarr-Python 3.0 doc does allow for parallel IO (by callingstore.get(key2) beforeget(key1) has completed). But this makes it harder (or, in some cases, impossible) for theStore implementation to apply relevant optimisations³.

And, if we're trying to load one million chunks per second then we'll bump into the issue that function calls are surprisingly slow in Python:each function call (with arguments) takes about 350 ns⁴ (just for the function call: not doing any actual work!) A million function calls would take 0.35 seconds! So, it should be more efficient to do singleget_items function call, with a million chunks as the function argument.

Some reasons why we might want to ignore this suggestion for Zarr-Python:

Maybe Zarr-Python isn't the right place to try to implement "extreme Zarr performance"? Pure-Python code will struggle to max out the hardware (for IO). Although it might be interesting to consider a world in which a Zarr "front-end" is implemented in Python, but someStores are implemented in a compiled language (with a Python API).
Maybe I'm the only one who cares about (or thinks it's sane to consider!) loading one million chunks per second?!?
Perhaps adding support forbatched processing should be a topic of discussion for a future update to the Zarr spec, not this update to Zarr-Python.

Footnotes

I've picked the number "1,000" from thin air, TBH!↩
Although I haven't been able to find any benchmarks to show that vectored I/O is actuallyfaster, especially when using Linux io_uring or Windows I/O Ring. (Because io_uring and I/O Ring allow for multipleread requests to be submitted in a single system call. Which, perhaps, negates the obvious advantage to vectored I/O)↩
Store could collect multipleget() requests by waiting a while, to see if it gets any moreget() requests, before performing any IO. But requiring theStore to wait feels like it defeats the objective of going fast! And it adds complexity, and the time overhead of performing a million function calls in Python.↩
The figure of 350 ns is from an old (2015) blog post. Since then, Python 3.10 and 3.11 sped up function calls a bit (20%??). But I couldn't find any more modern benchmarks.↩

You must be logged in to vote

9 replies

Comment options

martindurant Nov 16, 2023
Maintainer

I would say it's not really the loop or the function call that will make the difference, but (as other benchmarking already found) memcopies, cpython API calls and syscalls. For instance

def g(n):    for i in range(n):        pass

10.2ms

def g(n):    for i in range(n):        b[10:20]

(where b is a bytes object)
61ms

Comment options

JackKelly Nov 17, 2023

Hi@jni, ha! Sorry I didn't "connect the dots" between your blog and your GitHub ID! It's a small world, huh?!? Thanks loads for re-doing the benchmarks!

@martindurant, I completely agree that memcopies are surprisingly slow. As you've mentioned in the past, the ideal case would be that - for uncompressed Zarr chunks - we can do exactly one copy: from storage to the final numpy array (ideally using DMA). If we have batched processing of chunks, then there's at least a chance that, even for chunks that need to be "scattered" throughout the final numpy array, it'd still be possible to do that using DMA (e.g. usingreadv to read different parts of a single chunk into different memory locations, with a single system call).

Comment options

martindurant Nov 17, 2023
Maintainer

I was not really advocating for a low level super fix like that, but by all means try it. I think the python readinto and decompress_into are probably enough, but the point is that the "batch" concept I don't think will make much difference so long as high latency reads are awaited together. I consider this the low-hanging fruit, together with not reading bytes we don't need.

All contingent on contiguous memory blocks. Where striding is at issue, we probably can't do better than numpy copies, unless we write our own decompression wrapper capable of "getting every fourth uncompressed byte" or similar pattern.

Comment options

JackKelly Nov 17, 2023

On the topic of where to put this in the stack...

I feel I'm too new to the Zarr-Python world to have a particularly strong grasp of the pros and cons of where to place this. But I will give it more thought.

One option thatI'm considering for my Zarr Rust experiments, is to have a method which - ported to Python - would be something like:

classStore:asyncdefget_items(self,keys:Iterable[str],transform:Optional[Callable]=None   )->list[bytes]:"""Get multiple chunks.   Args:       keys: List of chunk keys.       transform: A function that will be applied to every chunk. The function must take two arguments:               - chunk_id: int               - chunk_data: bytes           And it must return bytes or None (if, for example, the processing function moves data to a final array)   """

Then theArray can ask theStore to perform any arbitrary transform on each chunk (e.g. decompress and copy to the final numpy array). My motivation is twofold: I want processing to start as soon as each chunk is available, using a threadpool (one thread per CPU core). And I wantall processing (decompression, maybe some numerical transform, and copying to the final array) to happen in very quick succession, to maximise that chance that the chunk stays in CPU cache.

So, this kind of dodges the question of "where to put decompression". TheStore can perform any arbitrary computation. But the decompression algorithm(s) wouldn't by defined in theStore.

But I don't know if that's appropriate for Zarr-Python. Nor have I actually implemented this in Rust yet, so I don't know if it's a good idea for Rust, either!

Comment options

martindurant Nov 17, 2023
Maintainer

Were you followingrfsspec? That does the concurrent reads; since cramjam is rust-python, I thought it could nicely "decompress into" numpy buffers in CPU threads driven by tokio.