Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

POC implementation of ZEP003#1483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
martindurant wants to merge16 commits intozarr-developers:main
base:main
Choose a base branch
Loading
frommartindurant:varchunk

Conversation

@martindurant
Copy link
Member

Implemented on v2, since the array metadata is not validated (!).

Example:

In [1]: import zarrIn [2]: import numpy as npIn [3]: store = {}In [4]: z = zarr.open_array(store, mode="w", dtype="i4", shape=(10, 10), chunks=[[1, 2, 2, 5], 5], compression=None)In [5]: z[:] = (np.arange(100)).reshape((10, 10))In [6]: store  # notice the different numbers of data bytesOut[6]:{'.zarray': b'{\n    "chunks": [\n        [\n            1,\n            2,\n            2,\n            5\n        ],\n        5\n    ],\n    "compressor": null,\n    "dimension_separator": ".",\n    "dtype": "<i4",\n    "fill_value": 0,\n    "filters": null,\n    "order": "C",\n    "shape": [\n        10,\n        10\n    ],\n    "zarr_format": 2\n}', '0.0': b'\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00', '0.1': b'\x05\x00\x00\x00\x06\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00', '1.0': b'\n\x00\x00\x00\x0b\x00\x00\x00\x0c\x00\x00\x00\r\x00\x00\x00\x0e\x00\x00\x00\x14\x00\x00\x00\x15\x00\x00\x00\x16\x00\x00\x00\x17\x00\x00\x00\x18\x00\x00\x00', '1.1': b'\x0f\x00\x00\x00\x10\x00\x00\x00\x11\x00\x00\x00\x12\x00\x00\x00\x13\x00\x00\x00\x19\x00\x00\x00\x1a\x00\x00\x00\x1b\x00\x00\x00\x1c\x00\x00\x00\x1d\x00\x00\x00', '2.0': b'\x1e\x00\x00\x00\x1f\x00\x00\x00 \x00\x00\x00!\x00\x00\x00"\x00\x00\x00(\x00\x00\x00)\x00\x00\x00*\x00\x00\x00+\x00\x00\x00,\x00\x00\x00', '2.1': b"#\x00\x00\x00$\x00\x00\x00%\x00\x00\x00&\x00\x00\x00'\x00\x00\x00-\x00\x00\x00.\x00\x00\x00/\x00\x00\x000\x00\x00\x001\x00\x00\x00", '3.0': b'2\x00\x00\x003\x00\x00\x004\x00\x00\x005\x00\x00\x006\x00\x00\x00<\x00\x00\x00=\x00\x00\x00>\x00\x00\x00?\x00\x00\x00@\x00\x00\x00F\x00\x00\x00G\x00\x00\x00H\x00\x00\x00I\x00\x00\x00J\x00\x00\x00P\x00\x00\x00Q\x00\x00\x00R\x00\x00\x00S\x00\x00\x00T\x00\x00\x00Z\x00\x00\x00[\x00\x00\x00\\\x00\x00\x00]\x00\x00\x00^\x00\x00\x00', '3.1': b'7\x00\x00\x008\x00\x00\x009\x00\x00\x00:\x00\x00\x00;\x00\x00\x00A\x00\x00\x00B\x00\x00\x00C\x00\x00\x00D\x00\x00\x00E\x00\x00\x00K\x00\x00\x00L\x00\x00\x00M\x00\x00\x00N\x00\x00\x00O\x00\x00\x00U\x00\x00\x00V\x00\x00\x00W\x00\x00\x00X\x00\x00\x00Y\x00\x00\x00_\x00\x00\x00`\x00\x00\x00a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00'}In [7]: z[:]Out[7]:array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]], dtype=int32)

ivirshup, tomwhite, ap--, and TomNicholas reacted with heart emojisanketverma1704, jhamman, and TomNicholas reacted with eyes emoji
@github-actionsgithub-actionsbot added the needs release notesAutomatically applied to PRs which haven't added release notes labelAug 1, 2023
Copy link

@ivirshupivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for getting started on this!

What would you like to do here, finish up for v2, then do v3? Or try to just go for v3?

Btw, I noticed a bug in the indexing and have opened a PR here:martindurant#21 to fix.

nelem= (slice_end-slice_start)//step
self.projections.append(
ChunkDimProjection(
i,slice(slice_start,slice_end,step),slice(nfilled,nfilled+nelem)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I believe this will only work if the chunks boundaries are multiples of the step size.

importnumpyasnp,zarrz=zarr.array(np.arange(10),chunks=[[1,2,2,5]])z[::2]# array([6, 8], dtype=int32)

I've opened a PR into your branch fixing this plus adding some tests:martindurant#21

@martindurant
Copy link
MemberAuthor

The point was to get feedback and show that what was proposed in ZEP0003 is very achievable if we can get some buy-in.

I would be happy to get this working for v2, since that alone solves the kerchunk case. However, itshould be part of v3, and the implementation would presumably we identical except for how the chunks are written to/from metadata; of and the correct API for creating such arrays,

As things stand here, bool indexing and fancy (int) indexing haven't been done yet, where I expect the former to be very easy. Also, plenty of things that access.chunks are broken if there are var chunks, such asarr.info.

@ivirshup
Copy link

ivirshup commentedAug 7, 2023
edited
Loading

So, broadly, this code is relevant in any path forward and it would be good to flesh out so we can show off use cases?

I would be up for collaborating some more here, either via PRs to your branch or whatever.

fancy (int) indexing haven't been done yet

I think I just got this working. It was basically replacingdim_sel_chunk = dim_sel // dim_chunk_len withdim_sel_chunk = np.digitize(dim_sel, self.offsets[1:]).

Also boolean, which was indeed quite easy. Essentiallynp.add.reduceat. Both cases did end up with a fair amount of copied code though. However, I did run into some poor support for boolean masks (#1490), so not sure how commonly used this is.

@agoodm
Copy link
Contributor

Great to see this happening,@martindurant !@alex-s-gardner and I are very interested in seeing the progress of this (zarr-developers/zarr-specs#138).

@martindurant
Copy link
MemberAuthor

@ivirshup , the current code does break a significant number of tests with standard indexing; seems to happen always in the last chunk. Probably happening in_process_chunk I think.

@joshmoore
Copy link
Member

Just a quick note before I dive in more. In ZEP0003 there's the line " It would be reasonable to wish to backport the feature to v2." and this is on v2. I'll just point out wereally don't have the mechanisms to introduce something like this in v2: no process for updating the spec and no way for implementations to know they're getting something they can't support.

WardF reacted with thumbs up emoji

@martindurant
Copy link
MemberAuthor

@joshmoore - the index code should be exactly the same. It's in v2 exactly because we didn't have to update the spec/metadata handling code to get it to work. Actually, it would be useful to kerchunk as-is, given that it would be for datasets that simply cannot otherwise be represented by zarr. But yes, the aim is to make this an implementation for v3. I still think it would be useful for v2 to be able to read such datasets, however.

@joshmoore
Copy link
Member

I still think it would be useful for v2 to be able to read such datasets, however.

Definitely no objections that it would be useful. But I'm concerned about the cost on all the other implementations. Correct me if I'm wrong, but they'd fall over spectacularly, no?

@martindurant
Copy link
MemberAuthor

Correct me if I'm wrong, but they'd fall over spectacularly, no?

They would fall over, yes. It would be an early error before loading any data. However, if this were a kerchunk thing, they probably can't open the store object anyway.

@codecov
Copy link

codecovbot commentedAug 14, 2023
edited
Loading

Codecov Report

Attention: Patch coverage is97.09302% with5 lines in your changes missing coverage. Please review.

Project coverage is 99.96%. Comparing base(f542fca) to head(f8d9bf2).
Report is 163 commits behind head on main.

Files with missing linesPatch %Lines
zarr/indexing.py96.42%5 Missing⚠️
Additional details and impacted files
@@             Coverage Diff             @@##              main    #1483      +/-   ##===========================================- Coverage   100.00%   99.96%   -0.04%===========================================  Files           37       37                Lines        14729    14889     +160     ===========================================+ Hits         14729    14884     +155- Misses           0        5       +5
Files with missing linesCoverage Δ
zarr/core.py100.00% <100.00%> (ø)
zarr/tests/test_indexing.py100.00% <100.00%> (ø)
zarr/util.py100.00% <100.00%> (ø)
zarr/indexing.py99.25% <96.42%> (-0.75%)⬇️

... and1 file with indirect coverage changes

@martindurant
Copy link
MemberAuthor

martindurant commentedAug 14, 2023
edited
Loading

Just to point out: this works transparently with V3, we apparentlyDO NOT VALIDATE the chunks property.

store_ = {}store = zarr._storage.v3.KVStoreV3(store_)z = zarr.open_array(store, mode="w", dtype="i4", shape=(10, 10), chunks=([1, 2, 2, 5], 5), compression=None)# z = zarr.open_array(store_, mode="w", dtype="i4", shape=(10, 10), chunks=([1, 2, 2, 5], 5), compression=None, zarr_version=3) # same as abovez[:] = (np.arange(100)).reshape((10, 10))assert (z[:] == np.arange(100).reshape((10, 10))).all()

metadata:

>>> print(store_["meta/root/array.array.json"].decode()){    "attributes": {},    "chunk_grid": {        "chunk_shape": [            [                1,                2,                2,                5            ],            5        ],        "separator": "/",        "type": "regular"    },    "chunk_memory_layout": "C",    "data_type": "<i4",    "dimension_separator": "/",    "extensions": [],    "fill_value": 0,    "shape": [        10,        10    ]}

@okz
Copy link

okz commentedSep 15, 2023

Does this implementation already work with kerchunk ?

@martindurant
Copy link
MemberAuthor

Does this implementation already work with kerchunk ?

In principle, the implementation works, but there is no code currently in kerchunk to produce zarr metadata that would use it

@pbranson
Copy link

+1 to this PR, would be great if this worked in V2

TomNicholas reacted with thumbs up emoji

@martindurant
Copy link
MemberAuthor

would be great if this worked in V2

It does work for V2! Some things will break as given here, e.g., array.info(), but dask.array.from_zarr should work as it.

@meggart
Copy link
Member

This feature would be very interesting to have also from the Julia side and I would be very much in favor even for having this as some kind of patch for v2 to have it in a usable state earlier. I drafted an implementation for Zarr.jlJuliaIO/Zarr.jl#126 , but I think it is not compatible with this implementation. The main detail I stumbled across was that without ZEP0003 there was a guarantee that every chunk in a store would represent a chunk of exactly the same shape when decompressed. Even for incomplete chunks at the array boundaries this was achieved through padding of fillvalues that are ignored during reading, but allow zero-cost resizing.

This allowed an easy re-use of compression/decompression buffers when reading/writing multiple chunks sequentially. My current Julia implementation keeps this behavior, in that it always compresses chunks of the maximum chunk size by padding fillvalues, so that the invariant mentioned above is maintained.

Maybe it would be a good idea to clarify in the ZEP text, the consequences this has on the uniformity of an uncompressed chunk. To illustrate this in a small example:

importzarrz1=zarr.create((10,),chunks=(3,),dtype='i4',fill_value=1,store="./chunktests.zarr",compressor=None)z2=zarr.create((10,),chunks=([3,3,3,1],),dtype='i4',fill_value=1,store="./chunktests.zarr",compressor=None)

The question is if these 2 arrays should be equivalent and store the same binary information. I think with this current implementation they would not, because in the last chunkz1 would pad fillvalues whilez2 would not.

@meggart
Copy link
Member

Thinking more about this I realized that your non-padding implementation is the only one that would work well together with kerchunk, so this is definitely the way to go. We might still want to mention this point somewhere in the zep draft.

@martindurant
Copy link
MemberAuthor

Indeed, the kerchunk workflow is very important to me, if not everyone. Furthermore, we read multiple chunks concurrently and in the future will decompress in parallel too. That means you can't easily reuse buffers. In python's memory model, the buffer will not actually be released to the OS for a while anyway, so maybe it's no win for python at all. In the final, best implementation, we would even like to read or decompress directly into the target array memory buffer for contiguous cases.

@ivirshup
Copy link

ivirshup commentedOct 16, 2023
edited
Loading

I've started a POC on top of this POC forkerchunk.combine.concatenate_arrays atfsspec/kerchunk#374

I'm also pretty sure I can only get this working without padding.

sanketverma1704 reacted with thumbs up emoji

@sanketverma1704
Copy link
Member

sanketverma1704 commentedOct 24, 2023
edited
Loading

Hi@martindurant, thanks for sending the PR.

I have requested reviews from the Zarr-Python devs. Additionally, if anyone from @zarr-developers/python-core-devs can review the PR, I'd be thankful to them.

Also,@normanrz, if you could review this PR, that'd be great.

@jhamman
Copy link
Member

Meta comment - we are working on a fresh Array implementation that covers both V2 and V3 arrays (see#1583 for details). We are actively seeing input/participation from those invested in ZEP003. Variable chunking is likely to require some changes to the codec api and we want to get your input as we roll out the new design. cc@d-v-b and@normanrz.

xref:#1595

TomNicholas reacted with rocket emoji

@jhamman
Copy link
Member

@martindurant and others - now that v3 has come together, I'd be very interested to see this move to that branch. Who is interested in trying out an implementation on top of the new array api?

@jhammanjhamman added the V2Affects the v2 branch labelOct 11, 2024
@TomNicholas
Copy link
Member

I would be - this is something that@abarciauskas-bgse and I want to get funding to work on... Don't let that stop anyone else having a go in the meantime though!

jhamman reacted with thumbs up emoji

@martindurant
Copy link
MemberAuthor

It's nice to see people wanting to see this move ahead. I'm not familiar enough with the v3 code to know how easy it is to port the partial implementation here.

Before progressing, do we need action on the original ZEP? It has not been accepted, and has a specific prescription on how to store the chunk sizes.

@jhamman
Copy link
Member

Now would be a great time to start thinking about how this will work with Zarr-Python 3.

TomNicholas reacted with eyes emoji

@martindurant
Copy link
MemberAuthor

@jhamman : some things are still up for general discussion, particularly how to store the chunk sizes, which is a change to the spec (https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#regular-grids - I suppose we need a new grid class ?).

@d-v-b
Copy link
Contributor

how about{"name": "rectilinear", "configuration": {"chunk_shapes": [[2, 5], [1, 3]]}}

@martindurant
Copy link
MemberAuthor

@d-v-b : I'm OK with that, and it closely matches this draft code. Probably any of those lists can be replaced by a single int in the case that it happens to be regular on that axis. I suppose it's enough to get going anyway.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@rabernatrabernatAwaiting requested review from rabernat

@jakirkhamjakirkhamAwaiting requested review from jakirkham

@d-v-bd-v-bAwaiting requested review from d-v-b

1 more reviewer

@ivirshupivirshupivirshup left review comments

Reviewers whose approvals may not affect merge requirements

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

needs release notesAutomatically applied to PRs which haven't added release notesV2Affects the v2 branch

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

11 participants

@martindurant@ivirshup@agoodm@joshmoore@okz@pbranson@meggart@sanketverma1704@jhamman@TomNicholas@d-v-b

[8]ページ先頭

©2009-2025 Movatter.jp