zarr-developers/zarr-pythonPublic

NotificationsYou must be signed in to change notification settings
Fork366
Star1.8k

Are trailing partial chunks padded?#3056

Answeredbyd-v-b

maxrjones asked this question inQ&A

maxrjones

May 13, 2025

· 2 comments· 3 replies

Answeredbyd-v-bReturn to top

Discussion options

maxrjones
May 13, 2025
Maintainer

I'm having difficulties with the defaultArrayBytesCodec for the last partial chunk of a dataset. Upon inspection, it seems like Zarr-Python pads any trailing partial chunks. Is this interpretation correct? If so, is it intentional? I'm asking because this expectation for all chunks to be complete seems to cause issues for virtualization from other file formats and will also presumable cause errors for any future virtual array concatenation of Zarr stores (e.g.,zarr-developers/zarr-specs#288).

Here's an example:

importzarrfromzarr.storageimportLocalStore# Create an array with one full chunk of shape (3,4) and one partial chunk (1,4)shape= (4,4)chunks= (3,4)new_dtype="uint8"overwrite=Truezarr_format=3store=LocalStore(root=".vscode/zarr-data/example.zarr",read_only=False)arr=zarr.create_array(store,name="0",shape=shape,chunks=chunks,dtype=new_dtype,zarr_format=3,compressors=None,filters=None,overwrite=overwrite)arr[:]=42

Inspect the size of the two chunks on disk

ls -l .vscode/zarr-data/example.zarr/0/c/0/0| awk'{print $5}'# 12 (expected)ls -l .vscode/zarr-data/example.zarr/0/c/1/0| awk'{print $5}'# 12 (I would expect 4)

You must be logged in to vote

Answered by d-v-b

May 13, 2025

Is this interpretation correct? If so, is it intentional?

for thedefault chunk grid, yes to both questions.

This can still work for virtualization but only if byte ranges are addressible in the virtualization scheme, and the byte range for all the boundary chunks has been calculated.

View full answer

Replies: 2 comments 3 replies

Comment options

d-v-b
May 13, 2025
Maintainer

Is this interpretation correct? If so, is it intentional?

for thedefault chunk grid, yes to both questions.

This can still work for virtualization but only if byte ranges are addressible in the virtualization scheme, and the byte range for all the boundary chunks has been calculated.

You must be logged in to vote

3 replies

Comment options

maxrjones May 13, 2025
Maintainer Author

Interesting, thanks for the link and explanation.

This can still work for virtualization but only if byte ranges are addressible in the virtualization scheme, and the byte range for all the boundary chunks has been calculated.

Hmm I think that alone is not sufficient. For example, I have the addressable byte ranges for boundary chunks but encounter a reshape error in the ArrayBytes codec because the buffer length is less than a full chunk's buffer length. IIUC one option is to add a BytesBytes (e.g.,Pad) codec in between any compressors and the ArrayBytes codec that pads the buffer to match a full chunk and make sure that thePad codec aligns well with the defined ArrayBytes codec so that the correct bytes are truncated. Seems a bit risky though. Do you know off-hand where in the codec pipeline the truncation happens?

An alternative would be to propose a chunk grid extension, but that risks limiting interoperability.

Comment options

maxrjones May 13, 2025
Maintainer Author

I guess we could also define a customArrayBytes codec that pads before reshaping, again not sure if this is way too hacky though.

Comment options

d-v-b May 13, 2025
Maintainer

ah and I was wrong about the byte addressing thing -- on the encoding side, Ithink partial chunks are padded to full size before the codec pipeline runs? and if so, there's no way a byte range can be helpful, because the entire padded chunk will be compressed.

Answer selected bymaxrjones

Comment options

rabernat
May 13, 2025
Maintainer

Note that this is the same issue as#3035.

Here's the explanation for this behavior that I provided there.

it makes sense for the following reason: if every chunk is exactly the same size, then we can easily resize the array without ever having to rewrite chunks. Otherwise, Zarr would have to keep explicitly keep track of the size of the chunks somewhere.

For the default chunk grid, every chunk is identical in terms of how it is stored. There is nothing special about the final chunk.

You must be logged in to vote

0 replies

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Are trailing partial chunks padded?#3056

Uh oh!

{{title}}

Uh oh!

maxrjones
May 13, 2025
Maintainer

Replies: 2 comments 3 replies

Uh oh!

{{title}}