zarr-developers/zarr-pythonPublic

NotificationsYou must be signed in to change notification settings
Fork366
Star1.8k

supporting the array api in zarr-python#1614

d-v-b started this conversation inGeneral

d-v-b

Dec 21, 2023

· 1 comment

Return to top

Discussion options

d-v-b
Dec 21, 2023
Maintainer

Supporting the array api in zarr-python

Thepython array api standard is an effort to standardize the API for various NDArray objects across the python ecosystem. In thev3 roadmap, one of the goals is to "Align the Zarr-Python array API with the array API Standard". I would like to use this discussion to consider how we can achieve this goal.

Array Attributes

Here are theattributes defined in the array API standard:

Name	Description
array.dtype	Data type of the array elements.
array.device	Hardware device the array data resides on.
array.mT	Transpose of a matrix (or a stack of matrices).
array.ndim	Number of array dimensions (axes).
array.shape	Array dimensions.
array.size	Number of elements in an array.
array.T	Transpose of the array.

Some of these (size,shape) are extremely simple to support, but I'm curious about whatarray.device should be for a zarr array. I don't know much about computing on GPUs, but I am guessing that's wherearray.device is most relevant (thedocs for this attribute on the array API say as much). One interpretation forzarr-python could be that thedevice represents the particular storage backend for the chunks of the array, so for data stored on aws s3, the device would be some python object that represents "data stored on s3", and so on for data on other storage backends. I'm curious to hear any other thoughts on this -- since I never use the.device attribute of a numpy array, my intuitions for what could work here might be way off.

Array methods / functions

The array API defines a LOT offunctions and methods that transform arrays into new arrays or scalars. Besides indexing, I think implementing these routines inzarr-python would be alot of work -- to implement something likex: zarr.Array = zarray.mean(0).std(0) we would need to use or create a lazy graph-based computation system, and I don't see that happening any time soon (although it would be super cool).

So I would suggest that we support operations that select data (i.e., indexing), but not operations that transform data.astype could be an exception here, since calling.astype on some chunks after loading them is pretty cheap. But this would still require breaking ground on a lazy evaluation system (or depending on one, which seems undesirable right now).

How much of the API standard can we support

Without data-transforming functions and methods, not much, in percentage terms! I couldn't find guidelines for libraries that only support a subset of the standard, but maybe this describes most array APIs used today other than numpy. However, I think this is fine. As long as zarr arrays can be coerced to numpy / cupy / ... arrays as needed, users should be able to compute what they need using the numpy / cupy / ... apis.

I'm curious to hear what other people think about this approach.

You must be logged in to vote

Replies: 1 comment

Comment options

rabernat
Dec 21, 2023
Maintainer

Today, in practice, we rely on Dask.Array pretty heavily to defer execution against Zarr-backed arrays. Dask already implements a "lazy graph-based computation system", and we should definitely not try to create a new alternative to that here. We could, however, aim to integrate with other similar libraries, such ascubed. There may be room for a more light-weight deferred-execution Array library (similar to Xarray's duck arrays), but I don't think that belongs in Zarr. We should document better how users can wrap their Zarr arrays in Dask / Cubed arrays in order to obtain deferred execution.

In Zarr, we should focus on implementing operations that can be pushed down to the storage layer to optimize computational pipelines. This is primarily indexing...

So I would suggest that we support operations that select data (i.e., indexing), but not operations that transform data.

...I think this is a very good rule of thumb

astype could be an exception here, since calling.astype on some chunks after loading them is pretty cheap

This is an interesting example and hints at one ambiguity behind the idea that we don't support "operations that transform data". The fact is, this is exactly what codecs do. We already have a dtype codec. You could also imagine a generalized arithmetic codec that operates elementwise on each item--kind of like theFixedScaleOffset filter (which basically implementsa*x + b).

So one idea might be...if we know how to express an array-API operation as a codec, we could push it into the codec pipeline. This is something we could explore incrementally, one operation at a time.

. One interpretation forzarr-python could be that thedevice represents the particular storage backend for the chunks of the array

I think this is a perfectly reasonable idea. My reading of the API is that there is no expectation of cross-library understanding of device, so we are free to define it however we want. Could we use this information some how? Like, if we know two arrays are on the same device, can we use that for any sort of optimization?

You must be logged in to vote

0 replies

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

supporting the array api in zarr-python#1614

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

d-v-b
Dec 21, 2023
Maintainer

Supporting the array api in zarr-python

Array Attributes

Array methods / functions

How much of the API standard can we support

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

rabernat
Dec 21, 2023
Maintainer

Select a reply

Uh oh!

Movatterモバイル変換

Uh oh!

supporting the array api in zarr-python#1614

Uh oh!

Uh oh!

d-v-bDec 21, 2023 Maintainer

Supporting the array api in zarr-python

Array Attributes

Array methods / functions

How much of the API standard can we support

Replies: 1 comment

Uh oh!

rabernatDec 21, 2023 Maintainer

Uh oh!

d-v-b
Dec 21, 2023
Maintainer

rabernat
Dec 21, 2023
Maintainer