Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork366
-
Slicing is weird in |
BetaWas this translation helpful?Give feedback.
All reactions
👍 5❤️ 1
Replies: 6 comments 16 replies
-
A few thoughts:
This is basically how zarrita works. I had put some thought intothat design. We also have implemented these concepts (albeit not async) on a higher level in the |
BetaWas this translation helpful?Give feedback.
All reactions
-
This could be exposed as what I call a "subscript method", i.e. syntax like Note that while A natural way to implement this type of composition of indexing operations would be use something similar to the IndexTransform (https://google.github.io/tensorstore/index_space.html#index-transform) concept used in TensorStore. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 2
-
I would second not changing the core API. I think it's super useful that zarr and |
BetaWas this translation helpful?Give feedback.
All reactions
-
I also lean toward keeping the behavior as is at this time unless we have a strong argument for how zarr-python could optimize indexing operations at load time. I would look closely at Xarray's"Named-Array" and Lazy Indexing classes. We have talked for a long time about how these classes would be very useful outside of Xarray and this seems like a great place to test that idea. cc@andersy005 who has been working hard on this topic lately. Honestly, the |
BetaWas this translation helpful?Give feedback.
All reactions
👀 1
-
Prior discussions + other prior artSomething similar has previously been discussed here: The xarray team had some thoughts here, especially around their internal I personally quite like the design of Julia's arrayviews , which accomplish something quite like this in a generic way. HDF5 also provides some functionality like this with their "HyperSlabs", but I don't believe |
BetaWas this translation helpful?Give feedback.
All reactions
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
This is misleading if you read to something that is not a numpy array (e.g. sparse), or might even error if you read directly to a |
BetaWas this translation helpful?Give feedback.
All reactions
-
The way I was imagining it, users who don't want a numpy array would not be calling |
BetaWas this translation helpful?Give feedback.
All reactions
-
This would require xarray, dask et al. to look at |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
|
BetaWas this translation helpful?Give feedback.
All reactions
-
Yeah +1 to what@d-v-b says and using meta-array. From the perspective of napari and any general array processing libraries, a |
BetaWas this translation helpful?Give feedback.
All reactions
-
If napari used xarray we would paper over this for you 😉 |
BetaWas this translation helpful?Give feedback.
All reactions
🚀 1
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Following this logic to its ultimate conclusion leads to a " Inzarr-developers/zarr-specs#288@jbms,@rabernat and I have proposed a "Virtual Concatenation ZEP", imagining a general implementation of lazy indexing/concatenation/stacking of zarr objects, and how the record of such operations could still be serialized into the store on-disk. Awkwardly, this class would then have much of the functionality of a conventional "duck array", but not all of it (because it wouldn't support computations like arithmetic or reductions). To summarize some of the above discussion, there are two models for array behaviour we could follow: "duck arrays" and "disk arrays". "Duck arrays" are normally things you can compute, and anything duck-array-like should endeavour to follow the patterns agreed upon by the python community in thepython array API standard. This includesexplicitly requiring that But I personally think that zarr-python v3 should expose an array type like I also don't find backwards-compatibility arguments super compelling here... This is the first breaking change of Zarr in how many years? We should improve everything we can whilst we have the chance! We might also imagine softening the v2->v3 transition for user libraries by providing convenience adapter classes, e.g. an Footnotes
|
BetaWas this translation helpful?Give feedback.
All reactions
-
To convert an arbitrary type to a numpy array, numpy supportsarray:https://numpy.org/devdocs/user/basics.interoperability.html For synchronous conversion that is probably adequate, but indeed there is no standard for async conversion. Tensorstore uses read. |
BetaWas this translation helpful?Give feedback.
All reactions
-
Yes sorry I should have written "in-memory array" rather than numpy array - see Deepak's comment#1603 (comment) |
BetaWas this translation helpful?Give feedback.
All reactions
-
I see --- so the goal is to be able to somehow configure for a Presumably the associated memory type and representation would also influence how the entire read operation is done (and therefore would require rather deep integration), since if it is just reading into a dense system memory array and then copying at the end, there isn't much benefit. I can see how this may be useful but I am also wary of trying to solve this problem immediately unless someone has a concrete proposal of how it will work, since it also adds a lot of complexity and the design will likely require careful consideration of how the API might be used. For example when using accelerator devices it is often desired to issue operations asynchronously, in a way that is not necessarily compatible with Python asyncio. In terms of high-level API, I could imagine that instead of configuring the output array constraints on the In any case I would agree that it would be valuable to standardize synchronous and asynchronous |
BetaWas this translation helpful?Give feedback.
All reactions
-
I just want to provide some feedback as a scientist and user of zarr, h5py/hdf5, pandas, and xarray that deals with very large N-dimensional data. I need an efficient store when reading and writing large datasets that is reasonably straightforward to use and is careful not to use an excessive amount of memory. I first used pandas and xarray as xarray was built to mimic pandas slicing, processing, and analysis tools for N-dimensional arrays. Gradually, xarray became more frustrating due to the lazy loading/caching that was built in to the package that is quite opaque to the user. Lots of care needed to be taken to ensure that the memory wouldn't top out because xarray gradually loads data into the cache of the original object. There was also no way to create a very large netcdf file by iteratively writing parts of an array to the file. All of my colleagues have just come to expect that they have to create tons of small netcdf files because of this lack of appropriate tooling (which exists for appending to files everywhere else in programming). That's when I switched to using h5py. It was a breath of fresh air to be able to easily and iteratively write data to an existing array/dataset in an hdf5 file. And reading data was ideal as well. I could slice the dataset like in numpy and it returns a numpy array. No concerns for a gradually increasing hidden cache that I have to worry about. If I want to use another python package for doing my processing and analysis, it's super easy to convert a numpy array to pretty much anything. I greatly appreciate that zarr currently returns numpy array as h5py does when slicing and consequently it is much more straightforward to know how I should handle the inputs and outputs. If this causes certain use-cases to be less efficient, the I'd still prefer the simplicity. |
BetaWas this translation helpful?Give feedback.
All reactions
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Thanks for the feedback@mullenkamp. I want to emphasize that it will always be easy to convert zarr-backed data into a numpy array. But from my experience, eagerly converting large datasets into numpy arrays is not a design decision that reduces complexity. I work with multi-TB datasets. In zarr today, I can basically never run this line of code: We know that most zarr users (myself included) want an easy path to get a numpy array. That path will always be there. But the current design of zarr puts an undue burden on users with big datasets, and the proposed design actually simplifies the situation considerably for those users. When we do start writing code to implement this design, it will be extremely useful to get feedback, so I hope you try out what we end up writing and give us your thoughts! |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
@mullenkamp - thanks for sharing your thoughts here. Putting my Xarray developer hat on briefly, I would love to hear more about the problems you ran into (memory, cache management, etc). Any chance you could open a separate discussion on this set of pain points? |
BetaWas this translation helpful?Give feedback.
All reactions
👍 2
-
Hi@d-v-b . I really appreciate your reply and feedback. Personally, if someone would write I do recognise that zarr (and the developers) might be more Dask-minded in the sense of wanting to let Dask handle the data and processing. So doing Thanks! |
BetaWas this translation helpful?Give feedback.
All reactions
-
Hi@jhamman . |
BetaWas this translation helpful?Give feedback.
All reactions
-
With zarr3 this appears to still be an issue? |
BetaWas this translation helpful?Give feedback.
All reactions
-
xarray and/or dask are still the state of the art in terms of lazy loading. |
BetaWas this translation helpful?Give feedback.
All reactions
❤️ 1