Movatterモバイル変換

		@@ -0,0 +1,204 @@
		"""

Copy link

Author

nenbJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Mostly the same information as what is included in the PR conversation.

nenb commented

		importnumpyasnp


		# perhaps over-complicating, but I don't want to allow the attributes to be patched

Copy link

Author

nenbJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Implementation detail: I decided that to try and freeze the class attributes after a dtype has been created. I used metaclasses for this. It's not essential.

nenb commented

		)


		classZarrDType(metaclass=FrozenClassVariables):

Copy link

Author

nenbJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The most important thing here IMO is thatZarrDType should contain all attributes required when introspecting dtypes at runtime.

I would like to replace all statements likenp.dtype.kind in ["S", "U"] ornp.dtype.itemsize > 0 in the codebase with statements like ifZarrDType.byte_count > 0 etc. Basically, replacing thenumpy dtype API with a new zarr-specific API.

I have included the attributes that I currently believe are necessary. But some may be unnuecesary, and I may have forgotten others. It's a first attempt!

nenb commented


		zarr_spec_format:Literal["2","3"]# the version of the zarr spec used
		experimental:bool# is this in the core spec or not
		endianness:Literal[

Copy link

Author

nenbJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Zarr V3 has made the decision to use a codec for endianness. Endianness is not to be attached to the dtype.
This creates some problems for the Zarr API, which is still linked tonumpy's API in a number of ways, including
the ability to create in-memory arrays of arbitrary endianness.

Currently, I think that the practical solution is forzarr-python to have dtypes that distinguish between big and little endianess in memory, but that when serialised to disk, always serialise the little endian dtype.

I can elaborate on this with examples if helpful, but basically, endianness would just be an implementation detail forzarr-python that would allow it to track the endianness of an object in memory, and it wouldn't actually be used when serialising to disk.

nenb commented

		"big","little",None
		]# None indicates not defined i.e. single byte or byte strings
		byte_count:int\|None# None indicates variable count
		to_numpy:np.dtype[

Copy link

Author

nenbJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

See thebfloat16 example about how this might require new packages to be installed.

nenb commented

		Any
		]# may involve installing a a numpy extension e.g. ml_dtypes;

		configuration_v3: (

Copy link

Author

nenbJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Wasn't clear to me how this is intended to be used in the spec...

Copy link

Member

normanrzJan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Basically, dtypes can be represented in the json metadata as a short-hand (str) or dict ({ "name": str, "configuration": None | { ... } }. Theconfiguration key is optional and could be used for dtypes that need additional configuration. If there is noconfiguration key, the short-hand version is equivalent to the dict with just aname key.

Copy link

Member

normanrzJan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For example,bfloat16 is equivalent to{"name":"bfloat16"}

Copy link

Author

nenbJan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, interesting, thanks!

My current thinking is that every dtype that is not in the core specshould include aconfiguration key. I would like to introduce a convention where extension dtypes also provide metadata like 'author', 'extension_version', etc., to give the best chance of reproducibility/re-use in the future. At least, until an extension dtype becomes a core dtype.

Is theconfiguration key an appropriate location for such metadata?

Copy link

Member

normanrzJan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Assigning names for extensions, such as dtypes, is something that the Zarr core spec should define to coordinate between the different Zarr implementations. However, there are some gaps in the spec that provide clear guidance for doing that. In the Zarr steering council, we are currently evaluating different options that we will propose to the community, shortly. Our goal is to achieve a naming mechanism that avoids naming conflicts. Our current favorite is to have 2 types of names:

URI-based names, e.g.https://nenb.github.io/bfloat16, which can be freely used by anybody who reasonably controls the URI. The URI doesn't need to resolve to anything; it is just a name. However, it makes sense to have some useful information under the URI, e.g. a spec document.
Raw names, e.g.bfloat16, which would be assigned through a centralized registry (e.g. a git repo) through a to-be-defined process. This will entail a bit more process than the URI-based names and will come with some expectations w.r.t. backwards compatibility.

Is the configuration key an appropriate location for such metadata?

Not necessarily. I think this information would be better placed in specification documents of the dtypes.

nenb commented


		_zarr_spec_identifier:str# implementation detail used to map to core spec

		def__init_subclass__(# enforces all required fields are set and basic sanity checks

Copy link

Author

nenbJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Implementation detail: I thought it would be helpful to prevent class creation unless all attributes were defined.

nenb commented

		ifnothasattr(cls,"configuration_v3"):
		cls.configuration_v3=None

		cls._zarr_spec_identifier= (

Copy link

Author

nenbJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Again, the way I am proposing endianness is just as an implementation detail forzarr-python to track the endianness of in-memory objects. When serialised to disk, thisbig_ prefix would always be removed.

nenb commented


		# TODO: add further checks
		@classmethod
		def_validate(cls):

Copy link

Author

nenbJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just an example of the sort of validation that could happen.

nenb commented

		fromimportlib.metadataimportentry_pointsasget_entry_points
		fromtypingimportTYPE_CHECKING,Any,Generic,TypeVar

		importnumpyasnp

Copy link

Author

nenbJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The changes in this module are a bit hacky. They are mainly just to show how something might work - the actual details can certainly be changed quite a bit.

nenb commented

		self.register(e.load())
		cls=e.load()

		if (hasattr(cls,"_zarr_spec_identifier")andhasattr(cls,"to_numpy")):

Copy link

Author

nenbJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The motivation here is that I don't want a user to be able to add a dtype whosenumpy representation overrides an already registered dtype e.g. MyFakeBool should not be able to override the core Bool dtype.

The exceptions are for the V and B dtypes, because these are raw byte strings, and multiple dtypes might be mapped to this e.g.int2 andint4 might both be mapped to the 'B' dtype innumpy becausenumpy has no support for sub-byte dtypes.

nenb commented

		ifclass_registry_keyisNone:
		self[fully_qualified_name(cls)]=cls
		else:
		ifclass_registry_keyinself:

Copy link

Author

nenbJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Again, a bit hacky, but I don't want a user to be able to override a core dtype.

nenb commented

		__v3_dtype_registry.register(cls,class_registry_key=cls._zarr_spec_identifier)


		defregister_v2dtype(cls:type[ZarrDType])->None:

Copy link

Author

nenbJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ideally, we could also backport this to Zarr V2? I'll admit that I haven't really researched this much though. Putting it here just to start a discussion.

Copy link

Member

normanrzJan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think there is a strong desire in the community to further evolve the v2 spec.

nenb commented


		# TODO: merge the get_vXdtype_class_ functions
		# these can be used instead of the various parse_X functions (hopefully)
		defget_v3dtype_class(dtype:str)->type[ZarrDType]:

Copy link

Author

nenbJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

See the examples for how these can be used. As in the comment, the idea is for these to replace the variousparse_X functions in the codebase.

nenb mentioned this pull request

Extend data type support (forbfloat16 in particular)#2656

Open

Copy link

Contributor

d-v-b commentedJan 23, 2025

@nenb thank you so much for working on this. I really like the ideas in here -- it looks like a great addition to the library, and I will try to give this close examination, and I recommend the other @zarr-developers/python-core-devs do the same. This is a big, forward-looking change so I imagine we will have a lot of opinions on multiple scales here.

Our weekly dev meeting is tomorrow (here's thecalendar info, the particular meeting I'm referring to is thezarr-python developers meeting), if the timing works for you, would it be possible for you to attend and give an overview of this PR?

normanrz reviewed

Jan 23, 2025


		classZarrDType(metaclass=FrozenClassVariables):

		zarr_spec_format:Literal["2","3"]# the version of the zarr spec used

Copy link

Member

normanrzJan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	zarr_spec_format:Literal["2","3"]# the version of the zarr spec used
	zarr_format:Literal["2","3"]# the version of the zarr spec used

normanrz reviewed

Jan 23, 2025