Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

refactor v3 data types#2874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
d-v-b merged 164 commits intozarr-developers:mainfromd-v-b:feat/fixed-length-strings
Jun 16, 2025

Conversation

@d-v-b
Copy link
Contributor

As per#2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.

Inmain,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.

This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.

norlandrhagen and Jan-Willem reacted with hooray emoji
@github-actionsgithub-actionsbot added the needs release notesAutomatically applied to PRs which haven't added release notes labelFeb 28, 2025
@d-v-b
Copy link
ContributorAuthor

copying a comment@nenb made inthis zulip discussion:

The first thing that caught my eye was that you are using numpy character codes. What was the motivation for this? numpy character codes are not extensible in their current format, and lead to issues like:jax-ml/ml_dtypes#41.

A feature of the character code is that it provides a way to distinguish parametric types likeU* from parametrized instances of those types (likeU3). Defining a class with the character codeU means instances of the class can be initialized with a "length" parameter, and then we can makeU2,U3, etc, as instances of the same class. If instead we bind a concrete numpy dtype as class attributes, we need a separate class forU2,U3, etc, which is undesirable. I do think I can work around this, but I figured the explanation might be helpful.

@nenb
Copy link

nenb commentedMar 3, 2025

Summarising from azulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Makeendianness an attribute on in the dtype instance. This will be an implementation detail used byzarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

@d-v-b
Copy link
ContributorAuthor

Summarising from azulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Makeendianness an attribute on in the dtype instance. This will be an implementation detail used byzarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

Thanks for the summary! I have implemented the proposed solution.

nenb reacted with thumbs up emoji

@d-v-bd-v-b mentioned this pull requestMar 5, 2025
6 tasks
@d-v-bd-v-b merged commit6798466 intozarr-developers:mainJun 16, 2025
30 checks passed
@TomAugspurger
Copy link
Contributor

Nice work@d-v-b!

TomNicholas reacted with hooray emoji

@jhamman
Copy link
Member

Huge props to@d-v-b for pulling this across the finish line. This was a huge and highly-impactful lift. 🙌 🙌 🙌 🙌

For those that are following this thread, expect this to come out in Zarr 3.1.


@ngoldbaum - thanks for all your work on NumPy. We'd love open the conversation about supporting writing to Zarr directly from NumPy. Can you advise on where the best place to do that is?

TomNicholas reacted with rocket emojiTomNicholas reacted with eyes emoji

@ilan-gold
Copy link
Contributor

@d-v-b Not sure if it's a bug or not, but here is something interesting.

Run this using, say, zarr 3.0.8:

importzarr,numcodecsz=zarr.open("foo.zarr",zarr_format=2)z.create_array("bar", (10,),dtype=object,filters=[numcodecs.VLenUTF8()],fill_value="")z["bar"][...]# array(['', '', '', '', '', '', '', '', '', ''], dtype=object)

Then after this PR:

importzarr,numcodecsz=zarr.open("foo.zarr")z["bar"][...]# array(['', '', '', '', '', '', '', '', '', ''], dtype=StringDType())

i.e., the dtype is now a string dtype. For us it was a simple fix, but not sure if it is (a) intentional or (b) indicative of some other issue.

@d-v-b
Copy link
ContributorAuthor

@d-v-b Not sure if it's a bug or not, but here is something interesting.

Run this using, say, zarr 3.0.8:

importzarr,numcodecsz=zarr.open("foo.zarr",zarr_format=2)z.create_array("bar", (10,),dtype=object,filters=[numcodecs.VLenUTF8()],fill_value="")z["bar"][...]# array(['', '', '', '', '', '', '', '', '', ''], dtype=object)

Then after this PR:

importzarr,numcodecsz=zarr.open("foo.zarr")z["bar"][...]# array(['', '', '', '', '', '', '', '', '', ''], dtype=StringDType())

i.e., the dtype is now a string dtype. For us it was a simple fix, but not sure if it is (a) intentional or (b) indicative of some other issue.

Numpy 2.0 introduced a new data type specifically for variable length strings. that's theStringDtype() you are seeing in the numpy array. I suspect that prior to my PR, zarr-python was not usingStringDtype when reading variable-length strings in zarr v2. But it should have been, IMO, so I think the current behavior is an improvement.

@ilan-gold
Copy link
Contributor

Agreed. The only reason it came up was that we were implicitly relying on the nullability of the object dtype in strings. So we'll transition now away from that, I think

@d-v-b
Copy link
ContributorAuthor

I think the numpy string dtype hasnullability semantics, that might be worth investigating

ilan-gold reacted with heart emoji

@dstansby
Copy link
Contributor

Was this intentionally merged into main, and not the 3.1.0 branch?

@d-v-b
Copy link
ContributorAuthor

Was this intentionally merged into main, and not the 3.1.0 branch?

Yes, but we can revisit that decision. I was under the impression that we were going to merge the 3.1 prs directly into main, but if we want to do a 3.0.9 release first, then we can revert this merge.

@dstansby
Copy link
Contributor

👍 - I'll delete the 3.1.0 branch then, and create a 3.0.x branch from before when this was merged.

@ngoldbaum
Copy link

thanks for all your work on NumPy. We'd love open the conversation about supporting writing to Zarr directly from NumPy. Can you advise on where the best place to do that is?

Sorry for the delay on this. We can schedule a call, maybe?

We have a biweekly NumPy community call you could drop into or we come to as well but unfortunately I need to miss the first half hour of that. Seethe NumPy community calendar for meeting links.

There's also a NumPy developer slack I can send you an invite to.

jhamman reacted with thumbs up emoji

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@joshmoorejoshmoorejoshmoore left review comments

@jnijnijni left review comments

@jhammanjhammanjhamman left review comments

@maxrjonesmaxrjonesmaxrjones left review comments

@rabernatrabernatrabernat approved these changes

@dstansbydstansbyAwaiting requested review from dstansby

+4 more reviewers

@ianhiianhiianhi approved these changes

@TomNicholasTomNicholasTomNicholas left review comments

@ilan-goldilan-goldilan-gold left review comments

@nenbnenbnenb left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

None yet

Projects

Status: Done

Milestone

3.1.0

Development

Successfully merging this pull request may close these issues.

16 participants

@d-v-b@nenb@joshmoore@dstansby@normanrz@LDeakin@rabernat@jhamman@christine-e-smit@TomNicholas@ianhi@ilan-gold@ngoldbaum@TomAugspurger@jni@maxrjones

[8]ページ先頭

©2009-2025 Movatter.jp