Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

how to fix slicing inzarr-python#1603

d-v-b started this conversation inGeneral
Dec 10, 2023· 6 comments· 16 replies
Discussion options

Slicing is weird inzarr-python

Inzarr-python, slicing a zarr array returns a numpy array (or some other ndarray-flavored thing, depending onmeta_array). From a type perspective, we havezarr.Array.__getitem__(slice) -> np.ndarray But slicing a collection should have the type signatureT.__getitem__(slice) -> T, notT.__getitem__(slice) -> X (i'm omittingself from the function signature here).

I think "slicing a collection should return an instance of the collection" is a pretty simple rule, butzarr-python fails to follow this rule today, which setszarr-python apart from base python collections (list,tuple), or array libraries likenumpy,dask,tensorstore, among others.

I understand thatzarr-python's slicing behavior is consistent withh5py (which probably wanted to be consistent with numpy), but I'm pretty sure that slicing this way is incorrect, and we should fix it.

Why this matters

  • users who want lazy evaluation of slicing must use external libraries likedask, but dask has well documented performance limitations for arrays with many small chunks. This forces zarr users to learn how to use (and debug) dask if they want to do basic IO at scale, which means thatzarr-python is failing those users.
  • eagerly performing IO when an array is sliced is potentially very inefficient. If slicing is lazy, then multiple sliced arrays can be fetched in a batch operation that optimizes IO operation over the entire set of arrays. A trivial example of this: suppose a user requests identical slices from a zarr array. if these requests are batched, then it's possible to infer that the two slices are identical, and only a single IO operation is needed (note that this deduplication is even more efficient than caching the first request). Another simple example is the situation when a user requests two halves of the same chunk in two separate slicing operations, which can be simplified to a single full chunk read (zarr-python today does two full chunks reads + slicing for each chunk).
  • Fetching chunks, decoding chunks, and packing chunks into output actually takes a lot of parameters, but none of those parameters can be given if the__getitem__ or__setitem_ methods are used, because these methods don't take keyword arguments. To work around this problem, we have cluttered the array API with things likesynchronizer,write_empty_chunks,meta_array, etc. These attributes are important for tuning how zarr arrays do IO, but making them array attributes means that these parameters cannot be changed cleanly during the lifetime of a single array. It would be better if users could supply these parameters when they actually need to do IO, which is not necessarily when they are slicing the array.

what we should do

I think we should deprecate this behavior of zarr arrays in version 3 ofzarr-python. In concrete terms, I'm currently looking attensorstore for inspiration for how movezarr-python toward a cleaner slicing story.

It would look something like this:

z=zarr.Array(store='root.zarr/`, path='foo', dtype='uint8',shape=(100,100))z#> zarr.Array<shape=(100,100), store=...># synchronous write, like tensorstorez[:]=100z[:]==z#> Truez0=z[0]# lazy#> zarr.Array<shape=(1, 100), store=....>np.array(z0)# trigger chunk fetching, decoding, etc

what would break

  • ifzarr.Array[0,0,:] returns a new, smaller instance ofzarr.Array, then the.shape attribute ofzarr.Array is no longer just a copy of theshape field in.zarray /zarr.json. Instead, the.shape attribute must take into account the sequence of slicing operations that generated the instance ofzarr.Array.
  • Similarly, in order for slicing to compose,zarr.Array[slice1][slice2] must composeslice1 withslice2. This means thatzarr.Array will end up carrying around its slice, perhaps materialized as something other than a python slice object, to accommodate boolean array indexing and the like. This means that, liketensorstore arrays,zarr.Array will be necessarily endowed with something like coordinates. This has implications I have not fully thought through.
  • Probably lots of other things about today's API will break 😅 .

discussion points

  • should we do this at all? I think wehave to, and it will ultimately be a big win for users and the maintainability of the codebase, but i'm curious to hear other perspectives.
  • should we do this now? I think so, because introducing V3 necessarily entails breaking changes. If now is not the time, then when?
  • If we do this, how can we make the transition smooth for users? Perhaps we could keep the old API around as a legacy layer on top of the underlying lazy-slicing base layer (this is similar to the strategy inzarrita).

very curious to hear everyone's thoughts!

cc@jni (because I think this direction would have implications for napari)
cc@jbms (because tensorstore)

You must be logged in to vote

Replies: 6 comments 16 replies

Comment options

A few thoughts:

  • I think we should provide an API like this. There is enough prior art to suggest that it is a good idea.
  • I don't think we should break the core reading and writing API ofzarr.Array. That meanszarr.Array.__getitem__ andzarr.Array.__setitem__ would return or take numpy arrays. If we break this core function functionality, all users would have to change their code. I think passing around slices of an array is useful but I would think of it more like a power features or for libraries.
  • This API could very well be implemented inzarr.AsyncArray which is a new API, anyways.
  • I think__getitem__ and__setitem__ should behave consistently. That means that when__getitem__ would return a future,__setitem__ would also have to. Everything else would be confusing to me. I don't think__setitem__ can be made awaitable, though.
  • I don't thinkzarr.AsyncArray._getitem__ would need to returnzarr.AsyncArray. It could also be azarr.AsyncArraySlice with a new.shape, an.origin (or bundled as.domain) and a reference to the underlyingzarr.AsyncArray (or to the store path and metadata).
  • I think thiszarr.AsyncArraySlice should contain methods for reading and writing. I think there are many good use case for writing into slices as well (tensorstore also supports that).

This is basically how zarrita works. I had put some thought intothat design. We also have implemented these concepts (albeit not async) on a higher level in thewebknossos package (seewk.dataset.View).

You must be logged in to vote
3 replies
@jbms
Comment options

This could be exposed as what I call a "subscript method", i.e. syntax likearr.lazy[10:20] orarr.lazy.oindex[10:20].

Note that while__setitem__ can return a value, when using__setitem__ in the normal way, asx[a] = b, which is the whole point of__setitem__, there is no way to access the return value. Consequently, it does not make sense for__setitem__ to return aFuture-like type. In TensorStore, while__getitem__ just does lazy indexing,__setitem__ always does a synchronous assignment. However, when using Transactions (https://google.github.io/tensorstore/python/api/tensorstore.Transaction.html#),__setitem__ synchronously assigns within the transaction, but the separate commit operation (which can be done asynchronously) is needed to actually do the write. A similar transaction mechanism implemented in zarr-python may be one way to use__setitem__ even when async behavior is desired.

A natural way to implement this type of composition of indexing operations would be use something similar to the IndexTransform (https://google.github.io/tensorstore/index_space.html#index-transform) concept used in TensorStore.

@ivirshup
Comment options

I would second not changing the core API. I think it's super useful that zarr andh5py are basically interchangeable at the API level, and wouldn't want to break that. Also don't want to introduce a large barrier to upgrading.

@jhamman
Comment options

I also lean toward keeping the behavior as is at this time unless we have a strong argument for how zarr-python could optimize indexing operations at load time.

I would look closely at Xarray's"Named-Array" and Lazy Indexing classes. We have talked for a long time about how these classes would be very useful outside of Xarray and this seems like a great place to test that idea. cc@andersy005 who has been working hard on this topic lately.

Honestly, theNamedArray object is going to fit so well with Zarr V3 that we should think more about how to integrate them directly.

Comment options

Prior discussions + other prior art

Something similar has previously been discussed here:

The xarray team had some thoughts here, especially around their internalLazilyIndexedArray class (cc:@rabernat).

I personally quite like the design of Julia's arrayviews , which accomplish something quite like this in a generic way.

HDF5 also provides some functionality like this with their "HyperSlabs", but I don't believeh5py provides a nice python interface for working with these (or that their C interface is nice for that matter).

You must be logged in to vote
0 replies
Comment options

np.array(z0) # trigger chunk fetching, decoding, etc

This is misleading if you read to something that is not a numpy array (e.g. sparse), or might even error if you read directly to acupy array. An explicit.load seems like a good idea for the user to indicate "load this thing in to memory please" (Xarray implements this)

You must be logged in to vote
5 replies
@d-v-b
Comment options

d-v-bJan 18, 2024
Maintainer Author

The way I was imagining it, users who don't want a numpy array would not be callingnp.array(s0) -- they would instead callmy_non_numpy_library.array(s0), e.g.cupy.array(s0)

@dcherian
Comment options

This would require xarray, dask et al. to look atmeta_array then. Seems OK in principle. This is the relevant line in dask:https://github.com/dask/dask/blob/4eb026cc9c495a9e64130764498f5ff8b61d7602/dask/array/core.py#L106

@d-v-b
Comment options

d-v-bJan 18, 2024
Maintainer Author

zarr-python already usesmeta_array (for chunks that should decompress to cupy arrays), so hopefully xarray and dask are OK with it in practice, too!

@jni
Comment options

jniJan 19, 2024
Maintainer

Yeah +1 to what@d-v-b says and using meta-array. From the perspective of napari and any general array processing libraries, a.load is very frustrating because it may work sometimes or in other cases (e.g. plain numpy arrays).np.array(lazy_or_gpu_or_whatever_array) shouldalways work, imho. Some libraries want aforce= keyword and it's on my perpetual to-do list to make a NEP for that. 😮‍💨

@TomNicholas
Comment options

napari and any general array processing libraries, a .load is very frustrating because it may work sometimes or in other cases (e.g. plain numpy arrays).

If napari used xarray we would paper over this for you 😉

Comment options

users who want lazy evaluation of slicing must use external libraries likedask

If slicing is lazy, then multiple sliced arrays can be fetched in a batch operation that optimizes IO operation over the entire set of arrays.

Similarly, in order for slicing to compose, zarr.Array[slice1][slice2] must compose slice1 with slice2.

Data view /slice of zarr array without loading entire array#980

I'm currently looking attensorstore for inspiration for how move zarr-python toward a cleaner slicing story.

Following this logic to its ultimate conclusion leads to a "VirtualZarrArray" class that implements lazy indexing and concatenation.

Inzarr-developers/zarr-specs#288@jbms,@rabernat and I have proposed a "Virtual Concatenation ZEP", imagining a general implementation of lazy indexing/concatenation/stacking of zarr objects, and how the record of such operations could still be serialized into the store on-disk.


Awkwardly, this class would then have much of the functionality of a conventional "duck array", but not all of it (because it wouldn't support computations like arithmetic or reductions). To summarize some of the above discussion, there are two models for array behaviour we could follow: "duck arrays" and "disk arrays".

"Duck arrays" are normally things you can compute, and anything duck-array-like should endeavour to follow the patterns agreed upon by the python community in thepython array API standard. This includesexplicitly requiring that__getitem__ returns typeSelf, as@d-v-b is advocating.1

Butzarr.Array currently sees itself as a "disk array" (similar to hdf5). Xarray recognizes this concept in it'sbackend infrastructure, where it asks backend developers to subclass fromxarray.backends.BackendArray. (You can also see this distinction emerge inother languages).

I personally think that zarr-python v3 should expose an array type likeVirtualZarrArray that supports lazy indexing and concatenation,the question is whether we shouldalso continue to expose a "disk array"zarr.Array, or if we can get away with only exposing one array class.


I also don't find backwards-compatibility arguments super compelling here... This is the first breaking change of Zarr in how many years? We should improve everything we can whilst we have the chance!

We might also imagine softening the v2->v3 transition for user libraries by providing convenience adapter classes, e.g. anEagerZarrArray which wraps theVirtualZarrArray but triggers loading on any indexing operation.

Footnotes

  1. Annoyingly one thing they haven't agreed upon yet is how to coerce to numpy. Xarray currently papers over this for many array types, but in zarr we should pick an approach (.load or something) and stick to it rather than forcing loading to happen when it doesn't need to.

You must be logged in to vote
3 replies
@jbms
Comment options

To convert an arbitrary type to a numpy array, numpy supportsarray:https://numpy.org/devdocs/user/basics.interoperability.html

For synchronous conversion that is probably adequate, but indeed there is no standard for async conversion. Tensorstore uses read.

@TomNicholas
Comment options

Yes sorry I should have written "in-memory array" rather than numpy array - see Deepak's comment#1603 (comment)

@jbms
Comment options

I see --- so the goal is to be able to somehow configure for aZarrArray object the associated "memory region" (i.e. system memory or the memory of a particular attached accelerator device, or perhaps a particular system memory NUMA node) and possibly associated representation (i.e. regular dense array or particular sparse array representation), and this would constrain the output array forread calls where an existing output array is not specified.

Presumably the associated memory type and representation would also influence how the entire read operation is done (and therefore would require rather deep integration), since if it is just reading into a dense system memory array and then copying at the end, there isn't much benefit.

I can see how this may be useful but I am also wary of trying to solve this problem immediately unless someone has a concrete proposal of how it will work, since it also adds a lot of complexity and the design will likely require careful consideration of how the API might be used. For example when using accelerator devices it is often desired to issue operations asynchronously, in a way that is not necessarily compatible with Python asyncio.

In terms of high-level API, I could imagine that instead of configuring the output array constraints on theZarrArray object itself, the output array constraints are specified when callingread. Whether that is more natural for users and/or implementations is unclear to me, though.

In any case I would agree that it would be valuable to standardize synchronous and asynchronousread andwrite APIs,

Comment options

I just want to provide some feedback as a scientist and user of zarr, h5py/hdf5, pandas, and xarray that deals with very large N-dimensional data. I need an efficient store when reading and writing large datasets that is reasonably straightforward to use and is careful not to use an excessive amount of memory.

I first used pandas and xarray as xarray was built to mimic pandas slicing, processing, and analysis tools for N-dimensional arrays. Gradually, xarray became more frustrating due to the lazy loading/caching that was built in to the package that is quite opaque to the user. Lots of care needed to be taken to ensure that the memory wouldn't top out because xarray gradually loads data into the cache of the original object. There was also no way to create a very large netcdf file by iteratively writing parts of an array to the file. All of my colleagues have just come to expect that they have to create tons of small netcdf files because of this lack of appropriate tooling (which exists for appending to files everywhere else in programming).

That's when I switched to using h5py. It was a breath of fresh air to be able to easily and iteratively write data to an existing array/dataset in an hdf5 file. And reading data was ideal as well. I could slice the dataset like in numpy and it returns a numpy array. No concerns for a gradually increasing hidden cache that I have to worry about. If I want to use another python package for doing my processing and analysis, it's super easy to convert a numpy array to pretty much anything.

I greatly appreciate that zarr currently returns numpy array as h5py does when slicing and consequently it is much more straightforward to know how I should handle the inputs and outputs. If this causes certain use-cases to be less efficient, the I'd still prefer the simplicity.

You must be logged in to vote
4 replies
@d-v-b
Comment options

d-v-bFeb 24, 2024
Maintainer Author

Thanks for the feedback@mullenkamp. I want to emphasize that it will always be easy to convert zarr-backed data into a numpy array. But from my experience, eagerly converting large datasets into numpy arrays is not a design decision that reduces complexity.

I work with multi-TB datasets. In zarr today, I can basically never run this line of code:my_array[1:]. If I run that line, my machine will lock up as zarr attempts to sequentially load the entire multi-TB dataset into memory before converting it to a numpy array. My system will run out of memory before this completes. So I have to use dask, or some other library, for the simple task of slicing my array, whichdrastically increases the complexity of my workflow. That's what we want to fix: zarr users with large arrays should be able to callmy_array[1:] and get back something useful without locking up their computer.

We know that most zarr users (myself included) want an easy path to get a numpy array. That path will always be there. But the current design of zarr puts an undue burden on users with big datasets, and the proposed design actually simplifies the situation considerably for those users.

When we do start writing code to implement this design, it will be extremely useful to get feedback, so I hope you try out what we end up writing and give us your thoughts!

@jhamman
Comment options

@mullenkamp - thanks for sharing your thoughts here.

Putting my Xarray developer hat on briefly, I would love to hear more about the problems you ran into (memory, cache management, etc). Any chance you could open a separate discussion on this set of pain points?

@mullenkamp
Comment options

Hi@d-v-b . I really appreciate your reply and feedback.

Personally, if someone would writemy_array[1:] in their code when they wanted to do some computation on the data, then the code should fail and the user shouldn't write such code when it's obvious that it will return an in-memory numpy array. As with h5py, they should iterate through the chunks and process the data accordingly. Just like with my normal use-cases, they likely would want to then save their results to another persistent store possibly in the same shape as the original. The user would process the data in chunks, then immediately write each chunk to a "file".

I do recognise that zarr (and the developers) might be more Dask-minded in the sense of wanting to let Dask handle the data and processing. So doingmy_array[1:] followed by some computation that dask will handle magically seems to be the route that zarr (and xarray) has chosen. Am I correct? I do want to better understand the data handling and processing coding logic of zarr and xarray.

Thanks!

@mullenkamp
Comment options

Hi@jhamman .
Sure. I'll open a discussion in the xarray repo. I and many of my colleagues use xarray a lot. So I'm very happy to provide feedback if it'll help.

Comment options

With zarr3 this appears to still be an issue?
Is there a recommended way to lazily split an zarr of say (3, 50_000, 150_000) into 3 arrays (50_000, 150_000)?
Was surprised that when using slicing I didn't get zarrs out and everything was loaded into memory and googling pointed me here.

You must be logged in to vote
1 reply
@jhamman
Comment options

xarray and/or dask are still the state of the art in terms of lazy loading.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
General
Labels
None yet
10 participants
@d-v-b@normanrz@jni@jhamman@dcherian@mullenkamp@jbms@ivirshup@TomNicholas@psobolewskiPhD

[8]ページ先頭

©2009-2025 Movatter.jp