Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Prototype of new DType interface#2750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Closed

Conversation

@nenb
Copy link

This is a PoC for a new standalone interface for managing dtypes in thezarr-python codebase.

Among other things, it tries to provide support for adding dtype extensions, as outlined by the Zarr V3 spec.

I have provided several examples below of how it is currently intended to be used. I will also add a number of comments to the files in this PR to help further clarify intent.

Examples

Core dtype registration

The following example demonstrates how to register a built-indtype in the core codebase:

fromzarr.core.dtypeimportZarrDTypefromzarr.registryimportregister_v3dtypeclassFloat16(ZarrDType):zarr_spec_format="3"experimental=Falseendianness="little"byte_count=2to_numpy=np.dtype('float16')register_v3dtype(Float16)

Entrypoint extension

The following example demonstrates how users can register a newbfloat16 dtype for Zarr. This approach adheres to the existing Zarr entrypoint pattern as much as possible, ensuring consistency with other extensions. The code below would typically be part of a Python package that specifies the entrypoints for the extension:

importml_dtypesimportnumpyasnpfromzarr.core.dtypeimportZarrDType# User inherits from ZarrDType when creating their dtypeclassBfloat16(ZarrDType):zarr_spec_format="3"experimental=Trueendianness="little"byte_count=2to_numpy=np.dtype('bfloat16')# Enabled by importing ml_dtypesconfiguration_v3= {"version":"example_value","author":"example_value","ml_dtypes_version":"example_value"    }

dtype lookup

The following examples demonstrate how to perform a lookup for the relevantZarrDType inside thezarr-python codebase, given a string that matches the Zarr specification ID for the dtype, or anumpy dtype object:

fromzarr.registryimportget_v3dtype_class,get_v3dtype_class_from_numpyget_v3dtype_class('complex64')# returns little-endian Complex64 ZarrDTypeget_v3dtype_class('not_registered_dtype')# ValueErrorget_v3dtype_class_from_numpy('>i2')# returns big-endian Int16 ZarrDTypeget_v3dtype_class_from_numpy(np.dtype('float32'))# returns little-endian Float32 ZarrDTypeget_v3dtype_class_from_numpy('i10')# ValueError

String dtypes

The following indicates one possibility for supporting variable-length strings. It is via the entrypoint mechanism as in a previous example. The Apache Arrow specification does not currently include a dtype for fixed-length strings (only for fixed-length bytes) and so I am using string here to implicitly refer to a variable-length string data (there may be some subtleties with codecs that means this needs to be refined further):

importnumpyasnpfromzarr.core.dtypeimportZarrDType# User inherits from ZarrDType when creating their dtypetry:to_numpy=np.dtypes.StringDType()exceptAttributeError:to_numpy=np.dtypes.ObjectDType()classString(ZarrDType):zarr_spec_format="3"experimental=Trueendianness='little'byte_count=None# None is defined to mean variableto_numpy=to_numpy

int4 dtype

There is currently considerable interest in the AI community in 'quantising' models - storing models at reduced precision, while minimising loss of information content. There are a number of sub-byte dtypes that the community are using e.g. int4. Unfortunatelynumpy does not currently have much support for sub-byte dtypes. However, raw bits/bytes can still be held in anumpy array and then passed (in a zero-copy way) to something likepytorch which can handle more conveniently:

importnumpyasnpfromzarr.core.dtypeimportZarrDType# User inherits from ZarrDType when creating their dtypeclassInt4(ZarrDType):zarr_spec_format="3"experimental=Trueendianness='little'byte_count=1# this is ugly, but I could change this from byte_count to bit_count if there was consensusto_numpy=np.dtype('B')# could also be np.dtype('V1'), but this would prevent bit-twiddlingconfiguration_v3= {"version":"example_value","author":"example_value",    }

alxmrs reacted with rocket emojid-v-b reacted with eyes emoji
@@ -0,0 +1,204 @@
"""
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Mostly the same information as what is included in the PR conversation.

importnumpyasnp


# perhaps over-complicating, but I don't want to allow the attributes to be patched
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Implementation detail: I decided that to try and freeze the class attributes after a dtype has been created. I used metaclasses for this. It's not essential.

)


classZarrDType(metaclass=FrozenClassVariables):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The most important thing here IMO is thatZarrDType should contain all attributes required when introspecting dtypes at runtime.

I would like to replace all statements likenp.dtype.kind in ["S", "U"] ornp.dtype.itemsize > 0 in the codebase with statements like ifZarrDType.byte_count > 0 etc. Basically, replacing thenumpy dtype API with a new zarr-specific API.

I have included the attributes that I currently believe are necessary. But some may be unnuecesary, and I may have forgotten others. It's a first attempt!


zarr_spec_format:Literal["2","3"]# the version of the zarr spec used
experimental:bool# is this in the core spec or not
endianness:Literal[
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Zarr V3 has made the decision to use a codec for endianness. Endianness is not to be attached to the dtype.
This creates some problems for the Zarr API, which is still linked tonumpy's API in a number of ways, including
the ability to create in-memory arrays of arbitrary endianness.

Currently, I think that the practical solution is forzarr-python to have dtypes that distinguish between big and little endianess in memory, but that when serialised to disk, always serialise the little endian dtype.

I can elaborate on this with examples if helpful, but basically, endianness would just be an implementation detail forzarr-python that would allow it to track the endianness of an object in memory, and it wouldn't actually be used when serialising to disk.

"big","little",None
]# None indicates not defined i.e. single byte or byte strings
byte_count:int|None# None indicates variable count
to_numpy:np.dtype[
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

See thebfloat16 example about how this might require new packages to be installed.

Any
]# may involve installing a a numpy extension e.g. ml_dtypes;

configuration_v3: (
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Wasn't clear to me how this is intended to be used in the spec...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Basically, dtypes can be represented in the json metadata as a short-hand (str) or dict ({ "name": str, "configuration": None | { ... } }. Theconfiguration key is optional and could be used for dtypes that need additional configuration. If there is noconfiguration key, the short-hand version is equivalent to the dict with just aname key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For example,bfloat16 is equivalent to{"name":"bfloat16"}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, interesting, thanks!

My current thinking is that every dtype that is not in the core specshould include aconfiguration key. I would like to introduce a convention where extension dtypes also provide metadata like 'author', 'extension_version', etc., to give the best chance of reproducibility/re-use in the future. At least, until an extension dtype becomes a core dtype.

Is theconfiguration key an appropriate location for such metadata?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Assigning names for extensions, such as dtypes, is something that the Zarr core spec should define to coordinate between the different Zarr implementations. However, there are some gaps in the spec that provide clear guidance for doing that. In the Zarr steering council, we are currently evaluating different options that we will propose to the community, shortly. Our goal is to achieve a naming mechanism that avoids naming conflicts. Our current favorite is to have 2 types of names:

  • URI-based names, e.g.https://nenb.github.io/bfloat16, which can be freely used by anybody who reasonably controls the URI. The URI doesn't need to resolve to anything; it is just a name. However, it makes sense to have some useful information under the URI, e.g. a spec document.
  • Raw names, e.g.bfloat16, which would be assigned through a centralized registry (e.g. a git repo) through a to-be-defined process. This will entail a bit more process than the URI-based names and will come with some expectations w.r.t. backwards compatibility.

Is the configuration key an appropriate location for such metadata?

Not necessarily. I think this information would be better placed in specification documents of the dtypes.


_zarr_spec_identifier:str# implementation detail used to map to core spec

def__init_subclass__(# enforces all required fields are set and basic sanity checks
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Implementation detail: I thought it would be helpful to prevent class creation unless all attributes were defined.

ifnothasattr(cls,"configuration_v3"):
cls.configuration_v3=None

cls._zarr_spec_identifier= (
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Again, the way I am proposing endianness is just as an implementation detail forzarr-python to track the endianness of in-memory objects. When serialised to disk, thisbig_ prefix would always be removed.


# TODO: add further checks
@classmethod
def_validate(cls):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just an example of the sort of validation that could happen.

fromimportlib.metadataimportentry_pointsasget_entry_points
fromtypingimportTYPE_CHECKING,Any,Generic,TypeVar

importnumpyasnp
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The changes in this module are a bit hacky. They are mainly just to show how something might work - the actual details can certainly be changed quite a bit.

self.register(e.load())
cls=e.load()

if (hasattr(cls,"_zarr_spec_identifier")andhasattr(cls,"to_numpy")):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The motivation here is that I don't want a user to be able to add a dtype whosenumpy representation overrides an already registered dtype e.g. MyFakeBool should not be able to override the core Bool dtype.

The exceptions are for the V and B dtypes, because these are raw byte strings, and multiple dtypes might be mapped to this e.g.int2 andint4 might both be mapped to the 'B' dtype innumpy becausenumpy has no support for sub-byte dtypes.

ifclass_registry_keyisNone:
self[fully_qualified_name(cls)]=cls
else:
ifclass_registry_keyinself:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Again, a bit hacky, but I don't want a user to be able to override a core dtype.

__v3_dtype_registry.register(cls,class_registry_key=cls._zarr_spec_identifier)


defregister_v2dtype(cls:type[ZarrDType])->None:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ideally, we could also backport this to Zarr V2? I'll admit that I haven't really researched this much though. Putting it here just to start a discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think there is a strong desire in the community to further evolve the v2 spec.

nenb reacted with thumbs up emoji

# TODO: merge the get_vXdtype_class_ functions
# these can be used instead of the various parse_X functions (hopefully)
defget_v3dtype_class(dtype:str)->type[ZarrDType]:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

See the examples for how these can be used. As in the comment, the idea is for these to replace the variousparse_X functions in the codebase.

@d-v-b
Copy link
Contributor

@nenb thank you so much for working on this. I really like the ideas in here -- it looks like a great addition to the library, and I will try to give this close examination, and I recommend the other @zarr-developers/python-core-devs do the same. This is a big, forward-looking change so I imagine we will have a lot of opinions on multiple scales here.

Our weekly dev meeting is tomorrow (here's thecalendar info, the particular meeting I'm referring to is thezarr-python developers meeting), if the timing works for you, would it be possible for you to attend and give an overview of this PR?


classZarrDType(metaclass=FrozenClassVariables):

zarr_spec_format:Literal["2","3"]# the version of the zarr spec used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
zarr_spec_format:Literal["2","3"]# the version of the zarr spec used
zarr_format:Literal["2","3"]# the version of the zarr spec used

nenb reacted with thumbs up emoji
dict|None
)# TODO: understand better how this is recommended by the spec

_zarr_spec_identifier:str# implementation detail used to map to core spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I guess this would be mandatory? Or, how would the identifier for the metadata be specified otherwise?

Copy link
Author

@nenbnenbJan 23, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

TL;DR
This isn't an essential detail, and the implementation can probably be considerably improved if there is consensus around adding a new dtype interface. I've included some details below about how I ended up with this attribute if you are interested though.

Details
At the moment, this is generated from the class name (see the logic in__init_subclass__). The user doesn't specify it, rather the user specifies the class name and the identifier is generated from the class name (potentially prefixed withbig_).

Example

classFloat16(ZarrDType):zarr_spec_format="3"experimental=Falseendianness="big"byte_count=2to_numpy=np.dtype('float16')

This would generatebig_float16 for_zarr_spec_identifier.


It's probably a bit clumsy in it's current implementation. I ended up with this pattern i) as a way of tracking the in-memory endianness of the dtype and ii) making sure the class name stays consistent with the identifier - the class name pops up a few times in the entrypoint registration logic, and I didn't want to have a situation where a class name could have a different value to the spec identifier.

Obviously, the convention is that a class needs to be named according to how the dtype is identified in the spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

As I wrote in my other comment, we generally want to follow some guidelines around how and what metadata goes into thezarr.json files to achieve interoperability with other libraries/languages that implement Zarr. Unfortunately, some of these guidelines still need to be specced out.

In any case, maybe I missed it, why do we need to persist the endianness as part of the dtype? Shouldn't that be handled by the appropriate array-to-bytes codec (e.g.bytes)? The runtime endianness might need to be handled through runtime configuration, see theArrayConfig class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The confusion is my fault - I should probably have made it more clear what metadata I was proposing to serialise to disk.

I am proposing to serialise metadata to disk exactly as the current V3 specs outline i.e. only the dtype identifier and theconfiguration object.

(But see our discussion above onconfiguration, I am still learning how it is intended to be used.)

All other attributes here (endianness,experimental, etc) are implementation details to express intent tozarr-python at runtime. This is already done by thenumpy dtype API inzarr-python e.g. the use of statements likenp.dtype.kind in ["U", "S"]. But thisnumpy API has limitations e.g. it can't recognise new dtypes likebfloat16 correctly. Which is a large reason why I am proposing thatzarr-python have its own dtype interface in this PoC.

In any case, maybe I missed it, why do we need to persist the endianness as part of the dtype?

This was a suggestion on my part, and may not actually turn out to be helpful. A lot ofzarr code does things likedtype='>i2' - a goodexample from the current issues. There will need to be a way of tracking this runtime endianness in Zarr V3.

As you pointed out, it seems likely that this could be handled through a runtime configuration (ArrayConfig). But it felt more natural (to me) to keep track of this information on the dtype itself.

It might be the case that I need to flesh out a complete implementation to try and see what both options look like. But I think it seems likely that there will need to besome way to keep track of the runtime endianness.

And to be clear, in this dtype implementation, I'm not proposing to serialise the information in theendianness attribute to disk in the zarr metadata. It would purely be an implementation detail thatzarr-python would use to keep track of runtime endianness.

@nenb
Copy link
Author

@nenb thank you so much for working on this. I really like the ideas in here -- it looks like a great addition to the library, and I will try to give this close examination, and I recommend the other @zarr-developers/python-core-devs do the same. This is a big, forward-looking change so I imagine we will have a lot of opinions on multiple scales here.

Our weekly dev meeting is tomorrow (here's thecalendar info, the particular meeting I'm referring to is thezarr-python developers meeting), if the timing works for you, would it be possible for you to attend and give an overview of this PR?

@d-v-b Thanks for being so receptive! I'm happy to attend and give an overview.

What's likely the best format to present at the meeting (I haven't attended before). If I give a 5 minute overview (including some motivation from the AI-side of things), and then leave 10 minutes for questions, would that work? Is there anything specific that you think it's worth focusing on e.g. extension mechanism?

@d-v-b
Copy link
Contributor

@nenb that plan sounds good. I'm looking forward to it!

nenb reacted with thumbs up emoji

@nenb
Copy link
Author

@d-v-b@normanrz Thanks so much for the support during today's meeting, I'm very excited about the potential forzarr to help accelerate the AI field!

My current thinking is to close this PR (as it was just intended as an illustration of what I was hoping to gain support for), and then sketch out how to proceed elsewhere. Would you agree?

If so, could you give me a few pointers about how best to support coordination of this work please e.g. where to document a proposed list of tasks, who needs to be tagged etc.?

As mentioned, I'm very flexible as to how the work proceeds! My main interests are i) allowing users to be able to register their own dtypes, ii) decoupling thezarr-python internal dtype API from thenumpy dtype API (because I thinknumpy dtype API is not appropriate for AI work at the moment) and iii) trying to achieve these things in the next couple of weeks.

I know this last point is a little artificial, but as mentioned during the meeting earlier, a lot of this field moves fast at the moment!

joshmoore reacted with heart emoji

@joshmoore
Copy link
Member

A heads up of a discussion aroundZEP9 & extension naming:

https://ossci.zulipchat.com/#narrow/channel/423692-Zarr/topic/ZEP9.3A.20extension.20points

/me reminds himself yet again to look for a bot to do these cross-linkings...

nenb reacted with eyes emoji

@d-v-b
Copy link
Contributor

d-v-b commentedFeb 21, 2025
edited
Loading

@nenb could you explain the motivation for centering the design around classes that have no instance attributes? E.g., you have

classFloat16(ZarrDType):zarr_spec_format="3"experimental=Falseendianness="little"byte_count=2to_numpy=np.dtype('float16')

With this design,__init__ is a no-op, so from what I can tell there's not really a difference betweenFloat16 andFloat16()

An alternative implementation would be:

float16=ZarrDType(zarr_spec_format=3,experimental=False, ...)

I'm guessing there was a reason why you didn't opt for the "specify the dtype in__init__" approach?

nenb reacted with eyes emoji

@nenb
Copy link
Author

@d-v-b
TL;DR
I don't remember there being anything important behind this. From memory, I was just trying to hack something together quickly, and I think this is what I ended up with. I had a couple of 'requirements' in mind, which are described in details below, and satisfying these were more important to me than any of the (throwaway) code that I wrote here.

Details
Requirements that I had:

  • once a dtype has been created, don't allow patching the attributes later
  • don't allow creation of a dtype instance with invalid attributes or with missing required attributes
  • make sure that the dtype can be discovered via the current entrypoint mechanism

For the first point, I thought it would lead to weird and confusing situations if users were able to (for example) change thebyte_count to a different value after creation. I thought it best to make it as difficult as possible to modify the dtypes after creation, because they are fundamental to a lot of behaviour. (Personal choice)

For the second point, this was just my attempt at re-attempting Pydantic with metaclasses and convoluted code. 🤡 I think it's useful though to prevent creation/importing bad dtype classes, for the same reason I mentioned in the previous paragraph.

I didn't want to make many changes to the entrypoint code, as I felt it would distract from what I was trying to show here about the dtypes. This is probably why I ended up with the weird design for classes that you mentioned i.e. I just kept hacking until I got everything to work with the current entrypoint mechanism. Obviously, this is not a requirement when doing this work properly.

I would personally be happy with any code that can satisfy the above three requirements. And I'm happy to discuss any of these requirements if they considered unhelpful for whatever reason!

d-v-b reacted with thumbs up emoji

@d-v-b
Copy link
Contributor

@nenb what are you thoughts on the following things:

  • handling parametric dtypes, like [datetime64[ms],datetime64[s], ...], [S1,S2, ...]? For the datettimes the set of units is small enough that these can be created in advance, but the fixed length types like<S | V><length> are harder to pre-generate. If a user shows up with a concrete numpy dtype likeS2, it's possible to create the corresponding class at runtime with something liketype(f"FixedLengthString{length}", (ZarrDType,), {...kwargs}), but this feels a bit clunky compared to creating an instance of a variable-length string class, e.g.VarLenString(length=2).
  • I'm not sure we should flag experimental vs non experimental dtypes. Any experimentalness should be represented by the context in which the dtype is used, not the dtype itself, because e.g. extension dtypes will mature, and then cease to be experimental, and that should not require changing the API of the dtype object itself.
  • I'm not sure the zarr format should be part of the dtype metadata. We should instead have a zarr-format-independent representation of dtypes, and treat the zarr format as a serialization issue.
  • Endianness is tricky because it isn't part of the dtype model of zarr v3 -- in v3,endianness is specified by a codec, not the dtype itself. So from a v3 POV, little-endian uint16 and big-endian uint16 specify two arrays with the same dtype but different codecs. This might mean that any dtype object has to takeendianness as a parameter when serializing or generating numpy dtypes.

@nenb
Copy link
Author

nenb commentedFeb 26, 2025
edited
Loading

@d-v-b (I'll answer the rest of this message tomorrow)

handling parametric dtypes ...

Looking at the Arrowdata types (Arrow is reasonably mature at this point, I think that it's probably reasonable to adopt their design decisions, unless there is a clear reason not to?), it looks like having a type parameter is sensible. I don't think I gave this enough thought previously. I think an explicit reference to type parameters in the Zarr v3 spec is also lacking. What are your thoughts about type parameters?

A potential challenge that I see would be in naming these parametric dtypes. Do you see issues with assigning the parameters in theconfiguration field e.g.

{  "data_type": {    "name": "DateTime",    "configuration: {        "unit": "ms",        "timezone": "UTC"        }  }}

Something similar could happen for fixed-width strings. Is it critical that a dtype be identified by its name, or is it allowed to stuff identifying information inconfiguration like this?

The alternative seems a rather crude naming scheme where we attach the parameters to the type name.

it's possible to create the corresponding class at runtime ...

For (numpy) data types such as fixed-width strings and bytestrings (codesU andS respectively), I was leaning towards creating them at runtime. I hadn't considered how to handle more parametric dtypes more generally (see previous paragraph), which was an oversight on my part.

I personally don't see an issue with creating parametric dtypes at runtime, but open to change my mind on this.

@d-v-b
Copy link
Contributor

Looking at the Arrowdata types (Arrow is reasonably mature at this point, I think that it's probably reasonable to adopt their design decisions, unless there is a clear reason not to?), it looks like having a type parameter is sensible. I don't think I gave this enough thought previously. I think an explicit reference to type parameters in the Zarr v3 spec is also lacking. What are your thoughts about type parameters?

You're right that the Zarr V3 spec is not really clear about parameters for data types. On one hand it says that externally-defined data types can be identified with a{"name": <name>, "configuration": {...}} JSON object, and as you note the configuration is a natural place to put type parameters. On the other hand, the v3 spec already defines a parametric data type (r*, fixed-size bytes, nearly the same as the numpyV<num_bytes> dtype), and in this case the string name contains the parameters. So it seems like "stringly typed" and typed are both on the table.

In terms of personal preference I would lean toward putting parameters in aconfiguration object, and it seems like we are aligned there. But if someone really wants superficial similarity to numpy, then this would argue for baking parameters in the string name of the dtype.

From an API POV, each variation of a parametrized type could be a separate class, or instances of a class:

classS1:capacity:ClassVar[Literal[1]]=1classS2:capacity:ClassVar[Literal[2]]=2s1=S1()s2=S2()...

vs

@dataclassclassS:capacity:ints1=S(capacity=1)s2=S(capacity=2)

I'm leaning toward the second formulation, but I'd be open to feedback for what the first is better / preferred

@nenb
Copy link
Author

nenb commentedFeb 26, 2025
edited
Loading

I'm leaning toward the second formulation, but I'd be open to feedback for what the first is better / preferred

I vote for this as well. If there is more than one type parameter, the first formulation could be confusing.

I'm not sure we should flag experimental vs non experimental dtypes. Any experimentalness should be represented by the context in which the dtype is used, not the dtype itself, because e.g. extension dtypes will mature, and then cease to be experimental, and that should not require changing the API of the dtype object itself.

I had in mind that this would be a private part of the dtype API, so that someone could inspect a dtype at runtime and know whether or not it was part of the spec. Let's say I wanted to write code that only applied to experimental dtypes (e.g. emit warnings). How would code be able to infer if a dtype was experimental or not at runtime? Remember that the definition of what is a core type can change from one day to the next, but the (zarr-python) code itself will still be the same. The only way I can solve this is for the creator to signal the intent via an attribute likeexperimental. Perhaps what I'm saying is that experimentalness is a function of how the dtype iscreated rather than how it isused? This might need more discussion though, I'm not terribly confident in what I've written.

I'm not sure the zarr format should be part of the dtype metadata. We should instead have a zarr-format-independent representation of dtypes, and treat the zarr format as a serialization issue.

I think this is another variation of the last paragraph. How do I know what format a dtype belongs to at runtime? I could imagine this being useful at some point in the future (but maybe in the very distant future 😆). I don't have strong opinions on this though, and if we were able to infer the format of a dtype at runtime another way, I would also be happy with that.

Endianness is tricky because it isn't part of the dtype model of zarr v3 -- in v3,endianness is specified by a codec, not the dtype itself. So from a v3 POV, little-endian uint16 and big-endian uint16 specify two arrays with the same dtype but different codecs. This might mean that any dtype object has to take endianness as a parameter when serializing or generating numpy dtypes.

Yeah, I remember this causing me quite a few headaches. I think given all thezarr-python code currently out there, folks are going to be trying to do things like<i2 for quite a while into the future. I added theendianess attribute as a (private) part of the API, to try and helpzarr-python developers support this use-case. Basically,zarr-python could use this attribute to track the endianess of the dtypein memory. So, when I create my array inzarr-python using>i2,zarr-python is still able to support this use-case by creating a big-endian dtypein memory. But, when serialising to disk, it would follow the spec and ultimately store this information in the codec. It would basically be an implementation detail ofzarr-python.

I think I remember you pointing out in earlier issues, that using codecs in this way diverges from thenumpy dtype model, and this makes it difficult forzarr-python, give its close historical relationship withnumpy. Adding the attribute on the dtype was my way of trying to support this. Again, it would ultiamtely be serialised to disk as part of the codec, but it would be part of the (Python) dtype private API as well, to continue support use cases like creating arrays with>i2 and<i2.

Let me know if this isn't clear. Given the amount of issues related to endianness for v3 inzarr-python, I would love to try and solve as many of these as possible with a new dtype API.

@d-v-b
Copy link
Contributor

with thespiritual successor to this pr merged, can we close this?

nenb reacted with thumbs up emoji

@nenb
Copy link
Author

Closing after merge of#2874

@nenbnenb closed thisJul 16, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@normanrznormanrznormanrz left review comments

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@nenb@d-v-b@joshmoore@normanrz

[8]ページ先頭

©2009-2025 Movatter.jp