In [1]: import zarrIn [2]: import numpy as npIn [3]: store = {}In [4]: z = zarr.open_array(store, mode="w", dtype="i4", shape=(10, 10), chunks=[[1, 2, 2, 5], 5], compression=None)In [5]: z[:] = (np.arange(100)).reshape((10, 10))In [6]: store  # notice the different numbers of data bytesOut[6]:{'.zarray': b'{\n    "chunks": [\n        [\n            1,\n            2,\n            2,\n            5\n        ],\n        5\n    ],\n    "compressor": null,\n    "dimension_separator": ".",\n    "dtype": "<i4",\n    "fill_value": 0,\n    "filters": null,\n    "order": "C",\n    "shape": [\n        10,\n        10\n    ],\n    "zarr_format": 2\n}', '0.0': b'\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00', '0.1': b'\x05\x00\x00\x00\x06\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00', '1.0': b'\n\x00\x00\x00\x0b\x00\x00\x00\x0c\x00\x00\x00\r\x00\x00\x00\x0e\x00\x00\x00\x14\x00\x00\x00\x15\x00\x00\x00\x16\x00\x00\x00\x17\x00\x00\x00\x18\x00\x00\x00', '1.1': b'\x0f\x00\x00\x00\x10\x00\x00\x00\x11\x00\x00\x00\x12\x00\x00\x00\x13\x00\x00\x00\x19\x00\x00\x00\x1a\x00\x00\x00\x1b\x00\x00\x00\x1c\x00\x00\x00\x1d\x00\x00\x00', '2.0': b'\x1e\x00\x00\x00\x1f\x00\x00\x00 \x00\x00\x00!\x00\x00\x00"\x00\x00\x00(\x00\x00\x00)\x00\x00\x00*\x00\x00\x00+\x00\x00\x00,\x00\x00\x00', '2.1': b"#\x00\x00\x00$\x00\x00\x00%\x00\x00\x00&\x00\x00\x00'\x00\x00\x00-\x00\x00\x00.\x00\x00\x00/\x00\x00\x000\x00\x00\x001\x00\x00\x00", '3.0': b'2\x00\x00\x003\x00\x00\x004\x00\x00\x005\x00\x00\x006\x00\x00\x00<\x00\x00\x00=\x00\x00\x00>\x00\x00\x00?\x00\x00\x00@\x00\x00\x00F\x00\x00\x00G\x00\x00\x00H\x00\x00\x00I\x00\x00\x00J\x00\x00\x00P\x00\x00\x00Q\x00\x00\x00R\x00\x00\x00S\x00\x00\x00T\x00\x00\x00Z\x00\x00\x00[\x00\x00\x00\\\x00\x00\x00]\x00\x00\x00^\x00\x00\x00', '3.1': b'7\x00\x00\x008\x00\x00\x009\x00\x00\x00:\x00\x00\x00;\x00\x00\x00A\x00\x00\x00B\x00\x00\x00C\x00\x00\x00D\x00\x00\x00E\x00\x00\x00K\x00\x00\x00L\x00\x00\x00M\x00\x00\x00N\x00\x00\x00O\x00\x00\x00U\x00\x00\x00V\x00\x00\x00W\x00\x00\x00X\x00\x00\x00Y\x00\x00\x00_\x00\x00\x00`\x00\x00\x00a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00'}In [7]: z[:]Out[7]:array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]], dtype=int32)

Very early POC

e8be25e

github-actionsbot added the needs release notesAutomatically applied to PRs which haven't added release notes label

Aug 1, 2023

pre-commit-cibotand others added4 commits

August 1, 2023 17:34

style: pre-commit fixes

c50ee29

Fix slices with steps, add tests

4f890ec

More indexing test cases

b9ba867

Fix indexing with offsets and steps

e12a849

ivirshup reviewed

Aug 7, 2023

View reviewed changes

Copy link

ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for getting started on this!

What would you like to do here, finish up for v2, then do v3? Or try to just go for v3?

Btw, I noticed a bug in the indexing and have opened a PR here:martindurant#21 to fix.

zarr/indexing.py Outdated

		nelem= (slice_end-slice_start)//step
		self.projections.append(
		ChunkDimProjection(
		i,slice(slice_start,slice_end,step),slice(nfilled,nfilled+nelem)

Copy link

ivirshupAug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I believe this will only work if the chunks boundaries are multiples of the step size.

importnumpyasnp,zarrz=zarr.array(np.arange(10),chunks=[[1,2,2,5]])z[::2]# array([6, 8], dtype=int32)

I've opened a PR into your branch fixing this plus adding some tests:martindurant#21

Copy link

MemberAuthor

martindurant commentedAug 7, 2023

The point was to get feedback and show that what was proposed in ZEP0003 is very achievable if we can get some buy-in.

I would be happy to get this working for v2, since that alone solves the kerchunk case. However, itshould be part of v3, and the implementation would presumably we identical except for how the chunks are written to/from metadata; of and the correct API for creating such arrays,

As things stand here, bool indexing and fancy (int) indexing haven't been done yet, where I expect the former to be very easy. Also, plenty of things that access.chunks are broken if there are var chunks, such asarr.info.

Copy link

ivirshup commentedAug 7, 2023•
edited
Loading

So, broadly, this code is relevant in any path forward and it would be good to flesh out so we can show off use cases?

I would be up for collaborating some more here, either via PRs to your branch or whatever.

fancy (int) indexing haven't been done yet

I think I just got this working. It was basically replacingdim_sel_chunk = dim_sel // dim_chunk_len withdim_sel_chunk = np.digitize(dim_sel, self.offsets[1:]).

Also boolean, which was indeed quite easy. Essentiallynp.add.reduceat. Both cases did end up with a fair amount of copied code though. However, I did run into some poor support for boolean masks (#1490), so not sure how commonly used this is.

Get boolean + integer indexing working

4672937

Copy link

Contributor

agoodm commentedAug 7, 2023

Great to see this happening,@martindurant !@alex-s-gardner and I are very interested in seeing the progress of this (zarr-developers/zarr-specs#138).

martindurantand others added4 commits

August 9, 2023 10:16

Merge pull request#21from ivirshup/varchunk-fixslices

f5ec764

Fix indexing with steps, add tests

style: pre-commit fixes

8033a0a

types

1124500

Merge branch 'varchunk' ofhttps://github.com/martindurant/zarrinto …

d90b43c

…varchunk

Copy link

MemberAuthor

martindurant commentedAug 9, 2023

@ivirshup , the current code does break a significant number of tests with standard indexing; seems to happen always in the last chunk. Probably happening in_process_chunk I think.

Merge branch 'main' into varchunk

46b4d82

Copy link

Member

joshmoore commentedAug 14, 2023

Just a quick note before I dive in more. In ZEP0003 there's the line " It would be reasonable to wish to backport the feature to v2." and this is on v2. I'll just point out wereally don't have the mechanisms to introduce something like this in v2: no process for updating the spec and no way for implementations to know they're getting something they can't support.

Copy link

MemberAuthor

martindurant commentedAug 14, 2023

@joshmoore - the index code should be exactly the same. It's in v2 exactly because we didn't have to update the spec/metadata handling code to get it to work. Actually, it would be useful to kerchunk as-is, given that it would be for datasets that simply cannot otherwise be represented by zarr. But yes, the aim is to make this an implementation for v3. I still think it would be useful for v2 to be able to read such datasets, however.

Merge branch 'main' into varchunk

a35252b

Copy link

Member

joshmoore commentedAug 14, 2023

I still think it would be useful for v2 to be able to read such datasets, however.

Definitely no objections that it would be useful. But I'm concerned about the cost on all the other implementations. Correct me if I'm wrong, but they'd fall over spectacularly, no?

Copy link

MemberAuthor

martindurant commentedAug 14, 2023

Correct me if I'm wrong, but they'd fall over spectacularly, no?

They would fall over, yes. It would be an early error before loading any data. However, if this were a kerchunk thing, they probably can't open the store object anyway.

martindurant added2 commits

August 14, 2023 11:10

some fix

2831521

fix more

0d10ccb

Copy link

codecovbot commentedAug 14, 2023•
edited
Loading

Codecov Report

Attention: Patch coverage is97.09302% with5 lines in your changes missing coverage. Please review.

Project coverage is 99.96%. Comparing base(f542fca) to head(f8d9bf2).
Report is 163 commits behind head on main.

Files with missing lines	Patch %	Lines
zarr/indexing.py	96.42%	5 Missing⚠️

Additional details and impacted files

@@             Coverage Diff             @@##              main    #1483      +/-   ##===========================================- Coverage   100.00%   99.96%   -0.04%===========================================  Files           37       37                Lines        14729    14889     +160     ===========================================+ Hits         14729    14884     +155- Misses           0        5       +5

Files with missing lines	Coverage Δ
zarr/core.py	`100.00% <100.00%> (ø)`
zarr/tests/test_indexing.py	`100.00% <100.00%> (ø)`
zarr/util.py	`100.00% <100.00%> (ø)`
zarr/indexing.py	`99.25% <96.42%> (-0.75%)`	⬇️

... and1 file with indirect coverage changes

Copy link

MemberAuthor

martindurant commentedAug 14, 2023•
edited
Loading

Just to point out: this works transparently with V3, we apparentlyDO NOT VALIDATE the chunks property.

store_ = {}store = zarr._storage.v3.KVStoreV3(store_)z = zarr.open_array(store, mode="w", dtype="i4", shape=(10, 10), chunks=([1, 2, 2, 5], 5), compression=None)# z = zarr.open_array(store_, mode="w", dtype="i4", shape=(10, 10), chunks=([1, 2, 2, 5], 5), compression=None, zarr_version=3) # same as abovez[:] = (np.arange(100)).reshape((10, 10))assert (z[:] == np.arange(100).reshape((10, 10))).all()

metadata:

>>> print(store_["meta/root/array.array.json"].decode()){    "attributes": {},    "chunk_grid": {        "chunk_shape": [            [                1,                2,                2,                5            ],            5        ],        "separator": "/",        "type": "regular"    },    "chunk_memory_layout": "C",    "data_type": "<i4",    "dimension_separator": "/",    "extensions": [],    "fill_value": 0,    "shape": [        10,        10    ]}

martindurantand others added2 commits

August 14, 2023 15:12

some coverage

882f301

style: pre-commit fixes

f8d9bf2

JSKenyon mentioned this pull request

Aug 17, 2023

Addrechunk_by_size functionalityratt-ru/dask-ms#284

Merged

2 tasks

sanketverma1704 mentioned this pull request

Sep 2, 2023

can zarr support irregularity chunk size ?#466

Closed

martindurant mentioned this pull request

Sep 15, 2023

Passing multiple kerchunk sideload files toopen_mfdataset, not possible with intakeintake/intake-xarray#135

Open

Copy link

okz commentedSep 15, 2023

Does this implementation already work with kerchunk ?

Copy link

MemberAuthor

martindurant commentedSep 15, 2023

Does this implementation already work with kerchunk ?

In principle, the implementation works, but there is no code currently in kerchunk to produce zarr metadata that would use it

Copy link

pbranson commentedOct 2, 2023

+1 to this PR, would be great if this worked in V2

Copy link

MemberAuthor

martindurant commentedOct 4, 2023

would be great if this worked in V2

It does work for V2! Some things will break as given here, e.g., array.info(), but dask.array.from_zarr should work as it.

meggart mentioned this pull request

Oct 6, 2023

POC: Implementation of ZEP0003JuliaIO/Zarr.jl#126

Open

Copy link

Member

meggart commentedOct 6, 2023

This feature would be very interesting to have also from the Julia side and I would be very much in favor even for having this as some kind of patch for v2 to have it in a usable state earlier. I drafted an implementation for Zarr.jlJuliaIO/Zarr.jl#126 , but I think it is not compatible with this implementation. The main detail I stumbled across was that without ZEP0003 there was a guarantee that every chunk in a store would represent a chunk of exactly the same shape when decompressed. Even for incomplete chunks at the array boundaries this was achieved through padding of fillvalues that are ignored during reading, but allow zero-cost resizing.

This allowed an easy re-use of compression/decompression buffers when reading/writing multiple chunks sequentially. My current Julia implementation keeps this behavior, in that it always compresses chunks of the maximum chunk size by padding fillvalues, so that the invariant mentioned above is maintained.

Maybe it would be a good idea to clarify in the ZEP text, the consequences this has on the uniformity of an uncompressed chunk. To illustrate this in a small example:

importzarrz1=zarr.create((10,),chunks=(3,),dtype='i4',fill_value=1,store="./chunktests.zarr",compressor=None)z2=zarr.create((10,),chunks=([3,3,3,1],),dtype='i4',fill_value=1,store="./chunktests.zarr",compressor=None)

The question is if these 2 arrays should be equivalent and store the same binary information. I think with this current implementation they would not, because in the last chunkz1 would pad fillvalues whilez2 would not.

Copy link

Member

meggart commentedOct 6, 2023

Thinking more about this I realized that your non-padding implementation is the only one that would work well together with kerchunk, so this is definitely the way to go. We might still want to mention this point somewhere in the zep draft.

Copy link

MemberAuthor

martindurant commentedOct 6, 2023

Indeed, the kerchunk workflow is very important to me, if not everyone. Furthermore, we read multiple chunks concurrently and in the future will decompress in parallel too. That means you can't easily reuse buffers. In python's memory model, the buffer will not actually be released to the OS for a while anyway, so maybe it's no win for python at all. In the final, best implementation, we would even like to read or decompress directly into the target array memory buffer for contiguous cases.

ivirshup mentioned this pull request

Oct 16, 2023

Concatenate arrays with varchunksfsspec/kerchunk#374

Draft

9 tasks

Copy link

ivirshup commentedOct 16, 2023•
edited
Loading

I've started a POC on top of this POC forkerchunk.combine.concatenate_arrays atfsspec/kerchunk#374

I'm also pretty sure I can only get this working without padding.

sanketverma1704 requested review fromd-v-b,jakirkham andrabernat

October 19, 2023 15:17

Copy link

Member

sanketverma1704 commentedOct 24, 2023•
edited
Loading

Hi@martindurant, thanks for sending the PR.

I have requested reviews from the Zarr-Python devs. Additionally, if anyone from @zarr-developers/python-core-devs can review the PR, I'd be thankful to them.

Also,@normanrz, if you could review this PR, that'd be great.

martindurant mentioned this pull request

Oct 27, 2023

AssertionError: Found chunk size mismatchfsspec/kerchunk#218

Closed

martindurant mentioned this pull request

Nov 30, 2023

Unexpected behavior with MultiZarrToZarr with partial chunks.fsspec/kerchunk#400

Open

jhamman mentioned this pull request

Dec 7, 2023

Implement Variable Chunking (ZEP003) in V3#1595

Open

Copy link

Member

jhamman commentedDec 7, 2023

Meta comment - we are working on a fresh Array implementation that covers both V2 and V3 arrays (see#1583 for details). We are actively seeing input/participation from those invested in ZEP003. Variable chunking is likely to require some changes to the codec api and we want to get your input as we roll out the new design. cc@d-v-b and@normanrz.

xref:#1595

martindurant mentioned this pull request

Jan 9, 2024

Control chunksize of the underlying zarrdatafsspec/kerchunk#406

Open

martindurant mentioned this pull request

Feb 5, 2024

Refactor MultiZarrToZarr into multiple functionsfsspec/kerchunk#377

Open

Copy link

Member

jhamman commentedOct 11, 2024

@martindurant and others - now that v3 has come together, I'd be very interested to see this move to that branch. Who is interested in trying out an implementation on top of the new array api?

jhamman added the V2Affects the v2 branch label

Oct 11, 2024

Copy link

Member

TomNicholas commentedOct 12, 2024

I would be - this is something that@abarciauskas-bgse and I want to get funding to work on... Don't let that stop anyone else having a go in the meantime though!

Copy link

MemberAuthor

martindurant commentedOct 13, 2024

It's nice to see people wanting to see this move ahead. I'm not familiar enough with the v3 code to know how easy it is to port the partial implementation here.

Before progressing, do we need action on the original ZEP? It has not been accepted, and has a specific prescription on how to store the chunk sizes.

Copy link

Member

jhamman commentedJan 10, 2025

Now would be a great time to start thinking about how this will work with Zarr-Python 3.

Copy link

MemberAuthor

martindurant commentedJan 10, 2025

@jhamman : some things are still up for general discussion, particularly how to store the chunk sizes, which is a change to the spec (https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#regular-grids - I suppose we need a new grid class ?).

Copy link

Contributor

d-v-b commentedJan 10, 2025

how about{"name": "rectilinear", "configuration": {"chunk_shapes": [[2, 5], [1, 3]]}}

Copy link

MemberAuthor

martindurant commentedJan 10, 2025

@d-v-b : I'm OK with that, and it closely matches this draft code. Probably any of those lists can be replaced by a single int in the case that it happens to be regular on that axis. I suppose it's enough to get going anyway.