Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ZCollection#995

fbriol started this conversation inShow and tell
Mar 30, 2022· 1 comment
Discussion options

To process the measurement of the futureSWOT mission, we decided to use the Zarr storage format. This data format allows us to obtain excellent processing performance when parallelizing processing on a laptop or a cluster with Dask to scale our processing.

Zarr allows resizing the shape of the stored tensors to concatenate new data into a tensor without rewriting the entire stored data. On the other hand, if we perform an update that requires reorganizing the data (insertion of new data), it will be necessary to copy the existing data after modifying the tensor shape to update the data we want.

This problem also exists when using the Parquet format to store tabular data. The PyArrow library has solved this problem by introducing apartitioned dataset or multiple files.

TheZCollection library does the same using Zarr as the storage format.

We have implemented partitioning by date (hour, day, month, etc.) or by sequence (to divide the satellite measurement by complete orbit). A collection partitioned by date, with a monthly resolution, may look like on the disk:

    collection/    ├── year=2022    │    ├── month=01/    │    │    ├── time/    │    │    │    ├── 0.0    │    │    │    ├── .zarray    │    │    │    └── .zattrs    │    │    ├── var1/    │    │    │    ├── 0.0    │    │    │    ├── .zarray    │    │    │    └── .zattrs    │    │    ├── .zattrs    │    │    ├── .zgroup    │    │    └── .zmetadata    │    └── month=02/    │         ├── time/    │         │    ├── 0.0    │         │    ├── .zarray    │         │    └── .zattrs    │         ├── var1/    │         │    ├── 0.0    │         │    ├── .zarray    │         │    └── .zattrs    │         ├── .zattrs    │         ├── .zgroup    │         └── .zmetadata    └── .zcollection

It's possible to set the partition update strategy. Now, two options exist:

  • overwrite the old partition with the new one or,
  • update the time series only with the updated measurements and keep the old ones to complete the partition data.

It's possible to create views on a reference collection, to add and modify variables contained in a reference collection, accessible in reading only.

This library can store data on POSIX, S3, or any other file system supported by "fsspec."

It has examples of using the library availablehere.

You must be logged in to vote

Replies: 1 comment

Comment options

I wonder how this interacts with the possibility of implementing irregular chunk sizes for zarr (as dask.array does, but without depending on it)? This feature will be required for using zarr as a storage format for awkward and sparse data types. At first glance, I would say that the two are orthogonal and complementary, but I am not certain.

The PyArrow library has solved this problem [for parquet] by introducing a partitioned dataset of multiple files

Clarification: this was done by Hive well before pyarrow existed, and s supported by all parquet readers such as spark and fastparquet .

You must be logged in to vote
0 replies
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Labels
None yet
2 participants
@fbriol@martindurant

[8]ページ先頭

©2009-2025 Movatter.jp