Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork366
ZCollection#995
-
To process the measurement of the futureSWOT mission, we decided to use the Zarr storage format. This data format allows us to obtain excellent processing performance when parallelizing processing on a laptop or a cluster with Dask to scale our processing. Zarr allows resizing the shape of the stored tensors to concatenate new data into a tensor without rewriting the entire stored data. On the other hand, if we perform an update that requires reorganizing the data (insertion of new data), it will be necessary to copy the existing data after modifying the tensor shape to update the data we want. This problem also exists when using the Parquet format to store tabular data. The PyArrow library has solved this problem by introducing apartitioned dataset or multiple files. TheZCollection library does the same using Zarr as the storage format. We have implemented partitioning by date (hour, day, month, etc.) or by sequence (to divide the satellite measurement by complete orbit). A collection partitioned by date, with a monthly resolution, may look like on the disk: It's possible to set the partition update strategy. Now, two options exist:
It's possible to create views on a reference collection, to add and modify variables contained in a reference collection, accessible in reading only. This library can store data on POSIX, S3, or any other file system supported by "fsspec." It has examples of using the library availablehere. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
Replies: 1 comment
-
I wonder how this interacts with the possibility of implementing irregular chunk sizes for zarr (as dask.array does, but without depending on it)? This feature will be required for using zarr as a storage format for awkward and sparse data types. At first glance, I would say that the two are orthogonal and complementary, but I am not certain.
Clarification: this was done by Hive well before pyarrow existed, and s supported by all parquet readers such as spark and fastparquet . |
BetaWas this translation helpful?Give feedback.