Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

supporting the array api in zarr-python#1614

d-v-b started this conversation inGeneral
Discussion options

Supporting the array api in zarr-python

Thepython array api standard is an effort to standardize the API for various NDArray objects across the python ecosystem. In thev3 roadmap, one of the goals is to "Align the Zarr-Python array API with the array API Standard". I would like to use this discussion to consider how we can achieve this goal.

Array Attributes

Here are theattributes defined in the array API standard:

NameDescription
array.dtypeData type of the array elements.
array.deviceHardware device the array data resides on.
array.mTTranspose of a matrix (or a stack of matrices).
array.ndimNumber of array dimensions (axes).
array.shapeArray dimensions.
array.sizeNumber of elements in an array.
array.TTranspose of the array.

Some of these (size,shape) are extremely simple to support, but I'm curious about whatarray.device should be for a zarr array. I don't know much about computing on GPUs, but I am guessing that's wherearray.device is most relevant (thedocs for this attribute on the array API say as much). One interpretation forzarr-python could be that thedevice represents the particular storage backend for the chunks of the array, so for data stored on aws s3, the device would be some python object that represents "data stored on s3", and so on for data on other storage backends. I'm curious to hear any other thoughts on this -- since I never use the.device attribute of a numpy array, my intuitions for what could work here might be way off.

Array methods / functions

The array API defines a LOT offunctions and methods that transform arrays into new arrays or scalars. Besides indexing, I think implementing these routines inzarr-python would be alot of work -- to implement something likex: zarr.Array = zarray.mean(0).std(0) we would need to use or create a lazy graph-based computation system, and I don't see that happening any time soon (although it would be super cool).

So I would suggest that we support operations that select data (i.e., indexing), but not operations that transform data.astype could be an exception here, since calling.astype on some chunks after loading them is pretty cheap. But this would still require breaking ground on a lazy evaluation system (or depending on one, which seems undesirable right now).

How much of the API standard can we support

Without data-transforming functions and methods, not much, in percentage terms! I couldn't find guidelines for libraries that only support a subset of the standard, but maybe this describes most array APIs used today other than numpy. However, I think this is fine. As long as zarr arrays can be coerced to numpy / cupy / ... arrays as needed, users should be able to compute what they need using the numpy / cupy / ... apis.

I'm curious to hear what other people think about this approach.

You must be logged in to vote

Replies: 1 comment

Comment options

Today, in practice, we rely on Dask.Array pretty heavily to defer execution against Zarr-backed arrays. Dask already implements a "lazy graph-based computation system", and we should definitely not try to create a new alternative to that here. We could, however, aim to integrate with other similar libraries, such ascubed. There may be room for a more light-weight deferred-execution Array library (similar to Xarray's duck arrays), but I don't think that belongs in Zarr. We should document better how users can wrap their Zarr arrays in Dask / Cubed arrays in order to obtain deferred execution.

In Zarr, we should focus on implementing operations that can be pushed down to the storage layer to optimize computational pipelines. This is primarily indexing...

So I would suggest that we support operations that select data (i.e., indexing), but not operations that transform data.

...I think this is a very good rule of thumb

astype could be an exception here, since calling.astype on some chunks after loading them is pretty cheap

This is an interesting example and hints at one ambiguity behind the idea that we don't support "operations that transform data". The fact is, this is exactly what codecs do. We already have a dtype codec. You could also imagine a generalized arithmetic codec that operates elementwise on each item--kind of like theFixedScaleOffset filter (which basically implementsa*x + b).

So one idea might be...if we know how to express an array-API operation as a codec, we could push it into the codec pipeline. This is something we could explore incrementally, one operation at a time.

. One interpretation forzarr-python could be that thedevice represents the particular storage backend for the chunks of the array

I think this is a perfectly reasonable idea. My reading of the API is that there is no expectation of cross-library understanding of device, so we are free to define it however we want. Could we use this information some how? Like, if we know two arrays are on the same device, can we use that for any sort of optimization?

You must be logged in to vote
0 replies
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
General
Labels
None yet
2 participants
@d-v-b@rabernat

[8]ページ先頭

©2009-2025 Movatter.jp