zarr-developers/zarr-pythonPublic

NotificationsYou must be signed in to change notification settings
Fork366
Star1.8k

ZCollection#995

fbriol started this conversation inShow and tell

ZCollection#995

fbriol

Mar 30, 2022

· 1 comment

Return to top

Discussion options

fbriol
Mar 30, 2022

To process the measurement of the futureSWOT mission, we decided to use the Zarr storage format. This data format allows us to obtain excellent processing performance when parallelizing processing on a laptop or a cluster with Dask to scale our processing.

Zarr allows resizing the shape of the stored tensors to concatenate new data into a tensor without rewriting the entire stored data. On the other hand, if we perform an update that requires reorganizing the data (insertion of new data), it will be necessary to copy the existing data after modifying the tensor shape to update the data we want.

This problem also exists when using the Parquet format to store tabular data. The PyArrow library has solved this problem by introducing apartitioned dataset or multiple files.

TheZCollection library does the same using Zarr as the storage format.

We have implemented partitioning by date (hour, day, month, etc.) or by sequence (to divide the satellite measurement by complete orbit). A collection partitioned by date, with a monthly resolution, may look like on the disk:

    collection/    ├── year=2022    │    ├── month=01/    │    │    ├── time/    │    │    │    ├── 0.0    │    │    │    ├── .zarray    │    │    │    └── .zattrs    │    │    ├── var1/    │    │    │    ├── 0.0    │    │    │    ├── .zarray    │    │    │    └── .zattrs    │    │    ├── .zattrs    │    │    ├── .zgroup    │    │    └── .zmetadata    │    └── month=02/    │         ├── time/    │         │    ├── 0.0    │         │    ├── .zarray    │         │    └── .zattrs    │         ├── var1/    │         │    ├── 0.0    │         │    ├── .zarray    │         │    └── .zattrs    │         ├── .zattrs    │         ├── .zgroup    │         └── .zmetadata    └── .zcollection

It's possible to set the partition update strategy. Now, two options exist:

overwrite the old partition with the new one or,
update the time series only with the updated measurements and keep the old ones to complete the partition data.

It's possible to create views on a reference collection, to add and modify variables contained in a reference collection, accessible in reading only.

This library can store data on POSIX, S3, or any other file system supported by "fsspec."

It has examples of using the library availablehere.

You must be logged in to vote

Replies: 1 comment

Comment options

martindurant
Mar 30, 2022
Maintainer

I wonder how this interacts with the possibility of implementing irregular chunk sizes for zarr (as dask.array does, but without depending on it)? This feature will be required for using zarr as a storage format for awkward and sparse data types. At first glance, I would say that the two are orthogonal and complementary, but I am not certain.

The PyArrow library has solved this problem [for parquet] by introducing a partitioned dataset of multiple files

Clarification: this was done by Hive well before pyarrow existed, and s supported by all parquet readers such as spark and fastparquet .

You must be logged in to vote

0 replies

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ZCollection#995

Uh oh!

{{title}}

Uh oh!

fbriol
Mar 30, 2022

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

martindurant
Mar 30, 2022
Maintainer

Select a reply

Uh oh!

Movatterモバイル変換

Uh oh!

ZCollection#995

Uh oh!

fbriolMar 30, 2022

Replies: 1 comment

Uh oh!

martindurantMar 30, 2022 Maintainer

Uh oh!

fbriol
Mar 30, 2022

martindurant
Mar 30, 2022
Maintainer