Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Incomplete zarr chunks are filled with filler#3035

Unanswered
bluppfisk asked this question inQ&A
Discussion options

We use zarr in a regular program (i.e. not for data analysis but for saving massive, multi-dimensional time series). While time is constant and always-growing, the lengths of the other dimensions may vary from store to store.

Let's take a simple time-by-position store. It makes sense to chunk by 12 hours and 1000 positions, because most queries will be in time, but some queries will also be along the position axis.

Some stores may have 1000 positions, others may have 1500 positions, and we don't know this until the data is ready to be written. Now I thought there was no harm in chunking 1000 positions together, thinking that in the case a store has 1500 positions, the last chunk would be 12 hrs x 500 positions, meaning that my chunk would not be larger than required to fit exactly 12 hrs x 500 positions.

However, it turns out that setting the position chunksize to 500 actually has a massive positive effect on the size of the last chunk. They are much smaller than the 1000-position chunks with only 500 elements in them. I can only imagine this means that the remaining 500 positions are in fact written, but then filled with empties?. Compression does mean they're not exactly twice the size, but they aren't far off.

I wouldn't have thought that zarr fills chunks with filler. Since an array cannot be larger than its dimensions, is it really necessary to waste that (disk) space? Or am I doing something wrong?

You must be logged in to vote

Replies: 2 comments 3 replies

Comment options

I wouldn't have thought that zarr fills chunks with filler. Since an array cannot be larger than its dimensions, is it really necessary to waste that (disk) space?

Yes, this is the current implemented behavior. And it makes sense for the following reason: if every chunk is exactly the same size, then we caneasily resize the array without ever having to rewrite chunks. Otherwise, Zarr would have to keep explicitly keep track of the size of the chunks somewhere.

Can you explain what exactly you're trying to optimize for? Is your goal truly to minimize disk space? Or some other performance metric? Why not just use smaller chunks?

If the goal is to minimize disk space, I'm confident that the right choice of compression algorithm can effectively compress away the missing data from these final chunks.

You must be logged in to vote
1 reply
@bluppfisk
Comment options

Hello Ryan, appreciate the rapid response.

I am indeed trying to minise disk space (while retaining the performance of zarr). We're looking at storing over 100 terabytes in a continuous monitoring application with high-frequency spatial measurements (distributed acoustic sensing). I've modified the directorystore backend so that older data can be trimmed while appending newer data to the end of the zarr store.

Some stores are even processed and are given extra dimensions (3D, 4D). Here the 'waste' of disk space is even more obvious. Smaller chunks would indeed be the solution, but I'm trying to limit the number of files as well (we aren't confident we can migrate to zarr v3 just yet to take advantage of sharding).

We do use the blosc compression algorithm and while it does seem to have an effect, files are still much larger than what they should be.

Another example, store which contains time x asset x metric (aggregating metrics from all our assets in time) had been assigned chunksizes of 72 x 1000 x 500. I thought that would easily cover all our applications as I cannot imagine having more than 1000 assets and 500 metrics in one system, even though most will only have 2 assets and 10 metrics.

A full chunk of said store is 572 kB, whereas if I reduce it to 72 x 2 x 10 to match a more realistic scenario, a full chunk is now 5 kB. That's 100 times smaller.

I actually thought it wasn't possible to resize a zarr array without rewriting the chunks. The implementation makes sense, but it does leave me with a much higher disk space consumption than initially expected. Smaller chunks seem the solution, but that means rewriting the existing stores :)

Comment options

Hi@rabernat, I wonder if you have a suggestion for a compression algorithm that could make this less of an issue?

You must be logged in to vote
2 replies
@rabernat
Comment options

Default compression (zstd) seems to handle this perfectly fine. Here's an example

importzarrimportnumpyasnpgroup=zarr.create_group(zarr.storage.MemoryStore())# two arrays of the same shape (100_000)# the first is chunked with chunks too biga1=group.create(name='a1',shape=100_000,chunks=2_000_000,dtype='f4')# the second is chunked exactly to the array shapea2=group.create(name='a2',shape=100_000,chunks=100_000,dtype='f4')# fill each with random data# for a1, this requires writing the entire 2M element chunka1[:]=np.random.rand(100_000)# for a2, this just writes a 100_000 element chunka2[:]=np.random.rand(100_000)print("--- a1 ---\n",a1.info_complete())print("--- a2 ---\n",a2.info_complete())

output

--- a1 ---  Type               : ArrayZarr format        : 3Data type          : DataType.float32Shape              : (100000,)Chunk shape        : (2000000,)Order              : CRead-only          : FalseStore type         : MemoryStoreFilters            : ()Serializer         : BytesCodec(endian=<Endian.little: 'little'>)Compressors        : (ZstdCodec(level=0, checksum=False),)No. bytes          : 400000 (390.6K)No. bytes stored   : 358690Storage ratio      : 1.1Chunks Initialized : 1--- a2 ---  Type               : ArrayZarr format        : 3Data type          : DataType.float32Shape              : (100000,)Chunk shape        : (100000,)Order              : CRead-only          : FalseStore type         : MemoryStoreFilters            : ()Serializer         : BytesCodec(endian=<Endian.little: 'little'>)Compressors        : (ZstdCodec(level=0, checksum=False),)No. bytes          : 400000 (390.6K)No. bytes stored   : 358590Storage ratio      : 1.1Chunks Initialized : 1

Both arrays are nearly identical size on disk, despitea1 using chunks that are 20x bigger.

@bluppfisk
Comment options

Thank you - did Zstd become the default in version 3? I'm on zarr 2 and here it seems to be Blosc+lz4.

I'll leave some comparisons here:

import numcodecsimport numpy as npimport zarrcompressor = numcodecs.Blosc(cname="lz4", clevel=5, shuffle=1, blocksize=0)import zarrimport numpy as npgroup = zarr.group(zarr.storage.MemoryStore())# two arrays of the same shape (100_000)# the first is chunked with chunks too biga1 = group.create(name='a1', shape=(200, 100), chunks=(200, 100_000), dtype='<f4', compressor=compressor)# the second is chunked exactly to the array shapea2 = group.create(name='a2', shape=(200, 100), chunks=(200, 2_000_000), dtype='<f4', compressor=compressor)# fill each with random data # for a1, this requires writing the entire 2M element chunka1[:] = np.random.rand(200, 100)# for a2, this just writes a 100_000 element chunka2[:] = np.random.rand(200, 100)print("--- a1 --- \n", a1.info)print("--- a2 --- \n", a2.info)

results:

--- a1 ---  Name               : /a1Type               : zarr.core.ArrayData type          : float32Shape              : (200, 100)Chunk shape        : (200, 100000)Order              : CRead-only          : FalseCompressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)Store type         : zarr.storage.MemoryStoreNo. bytes          : 80000 (78.1K)No. bytes stored   : 409057 (399.5K)Storage ratio      : 0.2Chunks initialized : 1/1--- a2 ---  Name               : /a2Type               : zarr.core.ArrayData type          : float32Shape              : (200, 100)Chunk shape        : (200, 2000000)Order              : CRead-only          : FalseCompressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)Store type         : zarr.storage.MemoryStoreNo. bytes          : 80000 (78.1K)No. bytes stored   : 6543749 (6.2M)Storage ratio      : 0.0Chunks initialized : 1/1

With Zlib indeed, I get better results:

--- a1 ---  Name               : /a1Type               : zarr.core.ArrayData type          : float32Shape              : (200, 100)Chunk shape        : (200, 100000)Order              : CRead-only          : FalseCompressor         : Zlib(level=5)Store type         : zarr.storage.MemoryStoreNo. bytes          : 80000 (78.1K)No. bytes stored   : 161363 (157.6K)Storage ratio      : 0.5Chunks initialized : 1/1--- a2 ---  Name               : /a2Type               : zarr.core.ArrayData type          : float32Shape              : (200, 100)Chunk shape        : (200, 2000000)Order              : CRead-only          : FalseCompressor         : Zlib(level=5)Store type         : zarr.storage.MemoryStoreNo. bytes          : 80000 (78.1K)No. bytes stored   : 1648111 (1.6M)Storage ratio      : 0.0Chunks initialized : 1/1

Blosc/zstd:

--- a1 ---  Name               : /a1Type               : zarr.core.ArrayData type          : float32Shape              : (200, 100)Chunk shape        : (200, 100000)Order              : CRead-only          : FalseCompressor         : Blosc(cname='zstd', clevel=5, shuffle=SHUFFLE, blocksize=0)Store type         : zarr.storage.MemoryStoreNo. bytes          : 80000 (78.1K)No. bytes stored   : 88584 (86.5K)Storage ratio      : 0.9Chunks initialized : 1/1--- a2 ---  Name               : /a2Type               : zarr.core.ArrayData type          : float32Shape              : (200, 100)Chunk shape        : (200, 2000000)Order              : CRead-only          : FalseCompressor         : Blosc(cname='zstd', clevel=5, shuffle=SHUFFLE, blocksize=0)Store type         : zarr.storage.MemoryStoreNo. bytes          : 80000 (78.1K)No. bytes stored   : 279798 (273.2K)Storage ratio      : 0.3Chunks initialized : 1/1
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
Q&A
Labels
None yet
2 participants
@bluppfisk@rabernat

[8]ページ先頭

©2009-2025 Movatter.jp