Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

obstore-based Store implementation#1661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged

Conversation

@kylebarron
Copy link
Contributor

@kylebarronkylebarron commentedFeb 8, 2024
edited
Loading

A Zarr store based onobstore, a Python library that uses the Rustobject_store crate under the hood.

object-store is a rust crate for interoperating with remote object stores like S3, GCS, Azure, etc. See thehighlights section of its docs.

obstore maps async Rust functions to async Python functions, and is able to streamGET andLIST requests, which all make it a good candidate for use with the Zarr v3 Store protocol.

You should be able to test this branch with the latest version ofobstore:

pip install --upgrade obstore

TODO:

  • Examples
  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

brews, joshmoore, JackKelly, and alxmrs reacted with thumbs up emojijhamman, dcherian, TomNicholas, norlandrhagen, brews, whatnick, JackKelly, alxmrs, danielgafni, ljstrnadiii, and rupurt reacted with hooray emojivincentsarago, itsgifnotjiff, norlandrhagen, scottyhq, jni, rodrigoalmeida94, alxmrs, ljstrnadiii, and rupurt reacted with heart emoji
@jhamman
Copy link
Member

Amazing@kylebarron! I'll spend some time playing with this today.

@kylebarron
Copy link
ContributorAuthor

Withroeap/object-store-python#9 it should be possible to fetch multiple ranges within a file concurrently with range coalescing (usingget_ranges_async). Note that this object-store API accepts multiple rangeswithin one object, which is still not 100% aligned with the Zarrget_partial_values because that allows fetchesacross multiple objects.

That PR also adds aget_opts function which now supports "offset" and "suffix" ranges, of the sortRange:N- andRange:-N, which would allow removing theraise NotImplementedError on line 37.

@martindurant
Copy link
Member

@normanrz
Copy link
Member

Great work@kylebarron!
What are everbody's thoughts on having this in zarr-python vs. spinning it out as a separate package?

@martindurant
Copy link
Member

What are everbody's thoughts on having this in zarr-python vs. spinning it out as a separate package?

I suggest we see whether it makes any improvements first, so it's author's choice for now.

@kylebarron
Copy link
ContributorAuthor

While@rabernat has seen some impressive perf improvements in some settings when making many requests with Rust's tokio runtime, which would possibly also trickle down to a Python binding, the biggest advantage I see is improved ease of use in installation.

A common hurdle I've seen is handling dependency management, especially around boto3, aioboto3, etc dependencies. Versions need to be compatible at runtime with any other libraries the user also has in their environment. And Python doesn't allow multiple versions of the same dependency at the same time in one environment. With a Python library wrapping a statically-linked Rust binary, you can remove all Python dependencies and remove this class of hardship.

The underlying Rust object-store crate is stable and under open governance via the Apache Arrow project. We'll just have to wait onsome discussion in object-store-python for exactly where that should live.

I don't have an opinion myself on where this should live, but it should be on the order of 100 lines of code wherever it is (unless the v3 store api changes dramatically)

vincentsarago reacted with heart emoji

@jhamman
Copy link
Member

I suggest we see whether it makes any improvements first, so it's author's choice for now.

👍

What are everbody's thoughts on having this in zarr-python vs. spinning it out as a separate package?

I want to keep an open mind about what the core stores provided by Zarr-Python are. My current thinking is that we should just do aMemoryStore and aLocalFilesystemStore. Everything else can be opt-in by installing a 3rd party package. That said, I like having a few additional stores in the mix as we develop the store interface since it helps us think about the design more broadly.

@martindurant
Copy link
Member

A common hurdle I've seen is handling dependency management, especially around boto3, aioboto3, etc dependencies.

This is no longer an issue, s3fs has much more relaxed deps than it used to. Furthermore, it's very likely to be already part of an installation environment.

vincentsarago reacted with eyes emoji

@normanrz
Copy link
Member

I want to keep an open mind about what the core stores provided by Zarr-Python are. My current thinking is that we should just do aMemoryStore and aLocalFilesystemStore. Everything else can be opt-in by installing a 3rd party package.

I agree with that. I think it is beneficial to keep the number of dependencies of core zarr-python small. But, I am open for discussion.

That said, I like having a few additional stores in the mix as we develop the store interface since it helps us think about the design more broadly.

Sure! That is certainly useful.

@jhammanjhamman added the V3 labelFeb 13, 2024
@itsgifnotjiff
Copy link

This is awesome work, thank you all!!!

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
@kylebarron
Copy link
ContributorAuthor

Theobject-store-python package is not very well maintainedroeap/object-store-python#24, so I took a few days to implement my own wrapper around the Rustobject_store crate:https://github.com/developmentseed/object-store-rs

I'd like to update this PR soonish to use that library instead.

TomNicholas and norlandrhagen reacted with heart emoji

@martindurant
Copy link
Member

If the zarr group prefersobject-store-rs, we can move it into the zarr-developers org, if you like. I would like to be involved in developing it, particularly if it can grow more explicit fsspec compatible functionality.

tasansal reacted with heart emoji

@kylebarron
Copy link
ContributorAuthor

kylebarron commentedOct 22, 2024
edited
Loading

I have a few questions because theStore API has changed a bit since the spring.

  • There's a newBufferPrototype object. Is theBufferPrototype chosen by the store implementation or the caller? It would be very nice if this prototype could be chosen by the store implementation, because then we could return aRustBuffer object that implements the Python buffer protocol, but doesn't need to copy the buffer into Python memory.
  • Similarly for puts, isBuffer guaranteed to implement the buffer protocol? Contrary to fetching, we can't do zero-copy puts right now with object-store

I like thatlist now returns anAsyncGenerator. That aligns well with the underlying object-store rust API, but for technical reasons we can't expose that as an async iterable to Python yet (apache/arrow-rs#6587), even though we do expose the readable stream to Python as an async iterable.

@TomAugspurger
Copy link
Contributor

Is the BufferPrototype chosen by the store implementation or the caller? It would be very nice if this prototype could be chosen by the store implementation, because then we could return a RustBuffer object that implements the Python buffer protocol, but doesn't need to copy the buffer into Python memory.

This came up in the discussion athttps://github.com/zarr-developers/zarr-python/pull/2426/files/5e0ffe80d039d9261517d96ce87220ce8d48e4f2#diff-bb6bb03f87fe9491ef78156256160d798369749b4b35c06d4f275425bdb6c4ad. By default, it's passed asdefault_buffer_prototype though I think the user can override at the call site or globally.

Does it look compatible with what you need?

@kylebarron
Copy link
ContributorAuthor

Now I'm just trying to get the tests to pass (re#1661 (comment)) and we should be good. (I can't get the tests to pass locally anyways; I getbotocore.exceptions.ClientError: An error occurred (IllegalLocationConstraintException) on all the fsspec tests)

@kylebarron
Copy link
ContributorAuthor

Ina3afa44 (#1661) I addedintersphinx support, which allows for automatic interlinking with the obstore docs. But restructured text drives me insane, so I hope those docs are good enough

@TomAugspurger
Copy link
Contributor

Planning to merge this tomorrow if there aren't any objections.

TomNicholas, jhamman, kylebarron, dcherian, and maxrjones reacted with rocket emoji

@TomAugspurgerTomAugspurger merged commit9e8b50a intozarr-developers:mainMar 24, 2025
30 checks passed
@github-project-automationgithub-project-automationbot moved this fromIn review toDone inZarr-Python - 3.0Mar 24, 2025
@TomAugspurger
Copy link
Contributor

Thanks for the great work everyone!

kylebarron and maxrjones reacted with heart emojikylebarron, danielgafni, and maxrjones reacted with rocket emoji

@kylebarronkylebarron deleted the kyle/object-store branchMarch 24, 2025 17:10
@kylebarron
Copy link
ContributorAuthor

kylebarron commentedMar 24, 2025
edited
Loading

Thanks all! Justpublished obstore 0.6, which adds easier, automatic-token-refreshing integration with Planetary Computer. And I was able to gettheir zarr example working with this latest main!

importmatplotlib.pyplotaspltimportpystac_clientimportxarrayasxrfromobstore.auth.planetary_computerimportPlanetaryComputerCredentialProviderfromobstore.storeimportAzureStorefromzarr.storageimportObjectStorecatalog=pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1/")collection=catalog.get_collection("daymet-daily-hi")asset=collection.assets["zarr-abfs"]# The PlanetaryComputerCredentialProvider automatically fetches Planetary# Computer SAS tokens as necessary and refreshes them before they expirecredential_provider=PlanetaryComputerCredentialProvider.from_asset(asset)azure_store=AzureStore(credential_provider=credential_provider)zarr_store=ObjectStore(azure_store,read_only=True)ds=xr.open_dataset(zarr_store,consolidated=True,engine="zarr")fig,ax=plt.subplots(figsize=(12,12))ds.sel(time="2009")["tmax"].mean(dim="time").plot.imshow(ax=ax,cmap="inferno")fig
uv pyproject.toml
[project]name ="zarr-obstore-pc"version ="0.1.0"description ="Add your description here"readme ="README.md"requires-python =">=3.12"dependencies = ["matplotlib>=3.10.1","obstore>=0.6.0","pystac-client>=0.8.6","xarray>=2025.3.0","zarr",][tool.uv.sources]zarr = {git ="https://github.com/zarr-developers/zarr-python" }[dependency-groups]dev = ["ipykernel>=6.29.5",]
nishadhka reacted with thumbs up emojijhamman, maxrjones, TomNicholas, and rupurt reacted with hooray emoji

@jhamman
Copy link
Member

Huge props to@kylebarron and@maxrjones for sticking with this PR and getting it in! We'll get this out as part of Zarr 3.1.

👏 👏 👏 👏 👏

kylebarron, maxrjones, ljstrnadiii, TomNicholas, aldenks, wietzesuijker, and joshmoore reacted with rocket emoji

@ilan-gold
Copy link
Contributor

ilan-gold commentedMar 28, 2025
edited
Loading

Hi this PR is very exciting - I am curious, is the performance expected to be better thanfsspec for similar ops (fetching from remote etc)? It could be good to highlight a bit why to choose this store specifically. I would love to understand!

EDIT: I see#1661 (comment) - could be great to highlight this work in the docs!

@kylebarron
Copy link
ContributorAuthor

Yes, I expect it to be significantly faster, but we don't have rigorous benchmarks yet. I'd love to see some Zarr benchmarks, and then maybe we can update the docs to reflect those.

@itsgifnotjiff
Copy link

I am not sure if this is within the scope of your benchmarking but if you can test the single point query times and performance for Zarr store in the 100 Tb range that would be great. Zarr v2 had problems with both number of inodes required and performance in my experience.

The groups/tree addition along with the explosion of large scale data means Zarr stores are either already performant enough or not performant at all for different use cases (geospatial in mine).

@kylebarron
Copy link
ContributorAuthor

I don't personally use Zarr much, so ideally I want to enable other people to do benchmarking. But happy to pair or support in any way I can.

@itsgifnotjiff
Copy link

Makes perfect sense. I hope I get to benchmark it later this year. I will link my potential findings here as well 😊. Thank you so much for your work.

kylebarron reacted with heart emoji

@maxrjones
Copy link
Member

I am not sure if this is within the scope of your benchmarking but if you can test the single point query times and performance for Zarr store in the 100 Tb range that would be great. Zarr v2 had problems with both number of inodes required and performance in my experience.

The groups/tree addition along with the explosion of large scale data means Zarr stores are either already performant enough or not performant at all for different use cases (geospatial in mine).

Hey@itsgifnotjiff Davis Bennett wrote a great blog post for Earthmover that explains the general improvements in Zarr V3 with opening 100TB range datasets (which accounts for much of the time of single point queries) - you can read thathere. The obstore store offers further improvements, as shown below. Full details are available inhttps://github.com/maxrjones/zarr-obstore-performance.

zarr_load_performance_comparison

xarray_query_performance_comparison
xarray_open_performance_comparison

jhamman, kylebarron, TomNicholas, itsgifnotjiff, and danielgafni reacted with hooray emoji

@itsgifnotjiff
Copy link

Thank you very much for this. I can't wait to see if these kind of performance improvements also apply to pseudo zarrs (zarrs backed by our binary files).

@TomNicholas
Copy link
Member

@itsgifnotjiff what are "pseudo zarrs"? It is similar to a virtual Zarr?https://github.com/zarr-developers/VirtualiZarr

@itsgifnotjiff
Copy link

Yes I am trying to see if I can create Icechunk Arrays and/or Zarr stores for Petabytes of binary format data.

I work with Environment and Climate Change Canada where we have a wonderful binary format for NWP model outputs and like all the organisations I've talked to we can not abandon it but if we can build on top of it .... (Bit like gribjump from ECMWF or even some slides from Icechunk).

@TomNicholas
Copy link
Member

like all the organisations I've talked to we can not abandon it but if we can build on top of it

Yes that's exactly the problem VirtualiZarr was built to solve.

even some slides from Icechunk

Those slides are referring to VirtualiZarr, which has facility for writing "virtual" zarr chunks into Icechunk (seevirtualizarr docs oricechunk docs).

gribjump from ECMWF

Interesting - I hadn't heard of this.

But this issue is closed -@itsgifnotjiff let's continue this discussion on theVirtualiZarr repo - perhaps onthis issue (or feel free to open a new one).

itsgifnotjiff reacted with thumbs up emoji

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@TomAugspurgerTomAugspurgerTomAugspurger approved these changes

@jhammanjhammanjhamman left review comments

@martindurantmartindurantmartindurant left review comments

@maxrjonesmaxrjonesmaxrjones left review comments

@dcheriandcheriandcherian approved these changes

Assignees

No one assigned

Labels

None yet

Projects

Status: Done

Milestone

After 3.0.0

Development

Successfully merging this pull request may close these issues.

15 participants

@kylebarron@jhamman@martindurant@normanrz@itsgifnotjiff@TomAugspurger@d-v-b@madsbk@maxrjones@JoshCu@danielgafni@ilan-gold@TomNicholas@dcherian@dstansby

[8]ページ先頭

©2009-2025 Movatter.jp