Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

add a runtime type checker for metadata objects#3400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
d-v-b wants to merge32 commits intozarr-developers:main
base:main
Choose a base branch
Loading
fromd-v-b:feat/type-checker

Conversation

@d-v-b
Copy link
Contributor

@d-v-bd-v-b commentedAug 24, 2025
edited
Loading

This PR adds a runtime type checker specifically for checking JSON-like data against a type definition. It's currently a draft while I get the test suite happy and refine the API, but it's also ready for people to look at and try out. I'm pretty convinced of it's utility, but I also think we should have a good discussion about whether this feature is a good idea.

Demo

The basic API looks like this:

# /// script# requires-python = ">=3.11"# dependencies = [#   "zarr@git+https://github.com/d-v-b/zarr-python.git@feat/type-checker",# ]# ///fromzarr.core.type_checkimportcheck_typefromtypingimportTypedDict,Literal,NotRequiredclassMyChunkGridConfig(TypedDict):prop_a:intprop_b:strprop_c:str|intoptional_prop:NotRequired[bool]classMyChunkGrid(TypedDict):name:Literal["awesome_chunk_grid"]configuration:MyChunkGridConfigvalid_data= [   {"name":"awesome_chunk_grid","configuration": {"prop_a":10,"prop_b":"string","prop_c":10}},   {"name":"awesome_chunk_grid","configuration": {"prop_a":10,"prop_b":"string","prop_c":"10"}},   {"name":"awesome_chunk_grid","configuration": {"prop_a":10,"prop_b":"string","prop_c":"10","optional_prop":True}}   ]fordinvalid_data:result=check_type(d,MyChunkGrid)print(result.success,result.errors)"""True []True []True []"""# prop_b should be a string, but it's an intinvalid_data= [# invalid type   {"name":"awesome_chunk_grid","configuration": {"prop_a":10,"prop_b":10,"prop_c":10}},# missing key   {"name":"awesome_chunk_grid","configuration": {"prop_a":10,"prop_b":"10"}}   ]fordininvalid_data:result=check_type(d,MyChunkGrid)print(result.success,result.errors)"""False ["value['configuration']['prop_b'] expected an instance of <class 'str'> but got 10 with type <class 'int'>"]False ["value['configuration'] missing required key 'prop_c'"]"""

Some aspects might evolve while this is a draft, like the nature of the error messages.

Supported types

This is not a general-purpose type checker. It is targeted for the types relevant for Zarr metadata documents, and so It supports the following narrow set of types

  • int, bool, float, str
  • union
  • tuple
  • list
  • sequence
  • mapping
  • typeddict

cost

maintenance burden

Thetype checker itself is ~530 lines of commented code, broken up into functions which are mostly easy to understand. The typeddict part, and the logic for resolving generic types, is convoluted and potentially sensitive to changes in how python exposes type annotations at runtime. Many type annotation features have been designed for static type checkers and not use within a python program, so some of this is rather fiddly. But I don't think we are relying on any brittle or private APIs here.

performance

As currrently implemented, the type checker will report all detectable errors:

>>>check_type(tuple(range(4)),tuple[str, ...])TypeCheckResult(success=False,errors=["value[0] expected an instance of <class 'str'> but got 0 with type <class 'int'>","value[1] expected an instance of <class 'str'> but got 1 with type <class 'int'>","value[2] expected an instance of <class 'str'> but got 2 with type <class 'int'>","value[3] expected an instance of <class 'str'> but got 3 with type <class 'int'>"])

This is wasted compute when we don't care exactly how mismatched the data is, but itis a better user experience. We might need to tune this if performance becomes a problem, e.g. by introducing a"fail_fast" option that returns on the first error.

benefit

We can instantly remove a lot of special-purpose functions. Most of thefunctions namedparse_* (~30+ functions) and essentially all of the functions named*check_json* (~30 functions) could be replaced or simplified with thecheck_type function.

We can also make our JSON loading routines type-safe:

# /// script# requires-python = ">=3.11"# dependencies = [#   "zarr@git+https://github.com/d-v-b/zarr-python.git@feat/type-checker",# ]# ///fromzarr.core.type_checkimportguard_typefromzarr.core.commonimportArrayMetadataJSON_V3unknown_data:dict[str,object]= {"zarr_format":3,"node_type":"array","shape": (10,10),"data_type":"uint8","chunk_grid": {"name":"regular","configuration": {"chunk_shape": (10,10)}},"chunk_key_encoding": {"name":"default","configuration": {"separator":"/"}},"codecs": ({"name":"bytes"}, ),"fill_value":0    }ifguard_type(unknown_data,ArrayMetadataJSON_V3):reveal_type(unknown_data)

mypy could not infer the type correctly, but basedpyright does:

zarr-python git:(feat/type-checker) ✗ uvx basedpyright test.py/Users/d-v-b/dev/zarr-python/test.py  /Users/d-v-b/dev/zarr-python/test.py:25:17 - information: Type of"unknown_data" is"ArrayMetadataJSON_V3"0 errors, 0 warnings, 1 note

While wecould write a bespoke function that specifically checks all the possibilities for zarr v3 metadata. But then we would need to painfully modify that function by hand to support something like this:

# /// script# requires-python = ">=3.11"# dependencies = [#   "zarr@git+https://github.com/d-v-b/zarr-python.git@feat/type-checker",# ]# ///fromcollections.abcimportMappingfromtyping_extensionsimportTypedDictfromzarr.core.commonimportArrayMetadataJSON_V3fromzarr.core.type_checkimportcheck_typeclassGeoZarrAttrs(TypedDict):geozarr:Mapping[str,object]classGeoZarrArray(ArrayMetadataJSON_V3):attributes:GeoZarrAttrsunknown_data:dict[str,object]= {"zarr_format":3,"node_type":"array","shape": (10,10),"data_type":"uint8","chunk_grid": {"name":"regular","configuration": {"chunk_shape": (10,10)}},"chunk_key_encoding": {"name":"default","configuration": {"separator":"/"}},"codecs": ({"name":"bytes"}, ),"fill_value":0,"attributes": {"not_geozarr":"bar"}    }result=check_type(unknown_data,GeoZarrArray)print(result)"""TypeCheckResult(success=False, errors=["value['attributes'] missing required key 'geozarr'"])"""

alternatives

we could use an external JSON validation library / type checking like pydantic, attrs, msgspec, beartype, etc. But I would rather not add a dependency. With the approach in this PR, we keep control in-house, and because this PR just adds functions, it composes with the rest of our codebase at the moment. (FWIW right now this type checker doesn't do any parsing, it only validates. If you think we shouldparse instead of just validating, then IMO that's a job for our array metadata classes)

we could also do nothing, and continue writing JSON parsing code by hand. But I would rather not do that, because this invites bugs and makes it hard to keep up withsneaky spec changes. Specifically, I'm planning on writing a lot of new types to model the codecs defined in#3376, and I would ratherjust write the type and get the type checking (and type safety) for free.

closes#3285

@github-actionsgithub-actionsbot added the needs release notesAutomatically applied to PRs which haven't added release notes labelAug 24, 2025
@d-v-bd-v-b changed the titleadd a runttime type checker for metadata objectsadd a runtime type checker for metadata objectsAug 24, 2025
@github-actionsgithub-actionsbot removed the needs release notesAutomatically applied to PRs which haven't added release notes labelAug 25, 2025
@d-v-bd-v-b marked this pull request as ready for reviewAugust 25, 2025 13:09
@d-v-b
Copy link
ContributorAuthor

d-v-b commentedAug 25, 2025
edited
Loading

this is pretty substantial so I would appreciate a lot of eyes @zarr-developers/python-core-devs

if anyone has concerns about whether we should do any runtime type checking at all, maybe send those thoughts to theissue this PR closes

I'm going to keep working on tests for the type checker, but so far it's working great.

This PR does violating liskov for a few subclasses of ourMetadata ABC, because that class requires thatto_dict andfrom_dict usedict[st, JSON], which is not very accurate. After this PR we need to make changes to class so that it supports type safety, then we won't be violating liskov any more.

Similarly, there are lots of# type: ignores added in various places. As you might guess, those were added because without them, mypy flagging type errors. Many of these checks will go away when we fix the lax typing of theMetadata ABC, and other classes.

@TomAugspurger I think you in particular will appreciate some of the effects of this PR. Since we can annotate methods likeArrayV3Metadata.from_dict() as takingArrayMetadataJSON_V3, we don't need to do any runtime validation insidefrom_dict. The assumption is that the caller has already checked the input. This is not possible without the types defined in this PR, and it's notpractical without the type checker. Once we extrapolate the type safe style, we can push almost all our type checking to the IO boundary.

That being said, I think the ArrayMetadata class will still need to do some internal consistency checks, like ensuring that the number of dimension names matches the length ofshape. I don't think we want our type checker to be smart enough to catch that kind of thing.

@codecov
Copy link

codecovbot commentedAug 25, 2025
edited
Loading

Codecov Report

❌ Patch coverage is98.01980% with8 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.90%. Comparing base (7d43f32) to head (c7096b1).

Files with missing linesPatch %Lines
src/zarr/core/type_check.py97.95%5 Missing⚠️
src/zarr/core/array.py86.66%2 Missing⚠️
src/zarr/core/metadata/v3.py96.15%1 Missing⚠️
Additional details and impacted files
@@            Coverage Diff             @@##             main    #3400      +/-   ##==========================================- Coverage   94.92%   94.90%   -0.03%==========================================  Files          79       80       +1       Lines        9503     9752     +249     ==========================================+ Hits         9021     9255     +234- Misses        482      497      +15
Files with missing linesCoverage Δ
src/zarr/abc/codec.py98.50% <100.00%> (-0.22%)⬇️
src/zarr/api/asynchronous.py90.00% <100.00%> (-0.04%)⬇️
src/zarr/core/common.py95.40% <100.00%> (+2.06%)⬆️
src/zarr/core/dtype/common.py70.58% <100.00%> (-14.96%)⬇️
src/zarr/core/dtype/npy/bool.py100.00% <100.00%> (ø)
src/zarr/core/dtype/npy/bytes.py99.49% <100.00%> (-0.01%)⬇️
src/zarr/core/dtype/npy/complex.py98.80% <100.00%> (+0.01%)⬆️
src/zarr/core/dtype/npy/float.py98.91% <100.00%> (+0.01%)⬆️
src/zarr/core/dtype/npy/int.py99.37% <100.00%> (+<0.01%)⬆️
src/zarr/core/dtype/npy/string.py97.79% <100.00%> (+0.01%)⬆️
... and10 more
🚀 New features to boost your workflow:
  • ❄️Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-v-b
Copy link
ContributorAuthor

In the dev meeting today@jhamman rightly challenged me to justify writing our own in-house runtime type checker instead of bringing in pydantic. I had a few answers:

  • The type checker here is small (~500 LoC, with docstrings) and it has a very simple structure -- a bunch of functions. Pydantic is huge and complex. It relies on dynamic class creation, and bugs in in pydantic can be hard to debug.
  • pydantic is class-centric. I think afunctional API for type checking is a better fit for our codebase. Replacing all our classes withBaseModel is relatively disruptive comparing to bringing in a few functions to help with a type guard.
  • Pydantic contains a lot of stuff we don't need -- For zarr, we only need to check the types that could fit inzarr.json, but pydantic covers alot more. The scale of pydantic is also an upside -- it means that new features will get added, and bugs will be squashed, by a large number of developers. But as long as we test our in-house type checker extensively against the small set of types that matter to us, I don't think we would gain much from adding pydantic as a dependency.
  • Pydantic makes breaking changes. If we depend on pydantic, we then become responsible for managing how those breaking changes impact our users. This is a cost that I think we can avoid.
  • The changes here are very non-invasive (because it's just functions). We can always switch to pydantic later, if this type checker sucks. We could also revert and replace all instances of this type checker with a hand-written type checking function.

My calculation was, 500 LoC for an API that does exactly what we need, (and nothing else), was a better deal than bringing in a big, celebrity dependency that would require pretty substantial changes to basic Zarr Python classes. Other folks might see this calculation differently.

@d-v-b
Copy link
ContributorAuthor

as illustration, consider this diff enabled by this PR:

@@ -286,14 +283,7 @@ class NullTerminatedBytes(ZDType[np.dtypes.BytesDType[int], np.bytes_], HasLengt             True if the input is a valid representation of this class in Zarr V3, False             otherwise.         """-        return (-            isinstance(data, dict)-            and set(data.keys()) == {"name", "configuration"}-            and data["name"] == cls._zarr_v3_name-            and isinstance(data["configuration"], dict)-            and "length_bytes" in data["configuration"]-            and isinstance(data["configuration"]["length_bytes"], int)-        )+        return check_type(data, NullTerminatedBytesJSON_V3).success

I worry that if type safety requires writing tedious, error-prone functions like what we have in main, we might not do it, and then we lose type safety. Boiling it down to a much simpler function makes type safety look a lot easier.

Copy link
Contributor

@TomAugspurgerTomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Gave a Quick Look, but haven't gone through in detail. At a high level:

  1. I'd prefer to avoid a dependency on pydantic (or any other runtime validators)
  2. At a glance, the implementation looks reasonable, but I haven't had a chance to go through it in detail (and likely won't for a while)
  3. I'd say anything we can do to limit scope, with the goal of limiting complexity, is worth considering.

Comment on lines -43 to -44
classCodecJSON_V2(TypedDict,Generic[TName]):
"""The JSON representation of a codec for Zarr V2"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Probably should reexport these types here for backwards compatibility.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

done in30d48a8

If the dictionary data is invalid or incompatible with either Zarr format 2 or 3 array creation.
"""
metadata=parse_array_metadata(data)
metadata=parse_array_metadata(data)# type: ignore[call-overload]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is thetype: ignore needed due to the presence ofArrayMetadata in the signature for__init__?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

we need the overload because thedata input tofrom_dict is typed asdict[str, JSON]. I could change that to the union of the two typeddict metadata types, but this will incompatibly override the base class implementation offrom_dict, and so I will need another# type: ignore there. The only way to clean all of this up is to redo our baseMetadata ABC, which I deemed out of scope for this PR

"""

kind:Literal["inline"]
must_understand:Literal["false"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is this correct for v2 consolidated metadata? IIRC, it didn't havemust_understand and maybe had aversion field? I guess maybe I'm mixing up consolidated metadata as written by zarr-python 2.x, and the v3 spec.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

wasn't v2 consolidated metadata exclusively via an external.zmetadata file, with no change to.zattrs?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

you're right, the.zmetadata file hadversion and"metadata" keys:

out= {"zarr_consolidated_format":1,"metadata": {key:json_loads(store[key])forkeyinstoreifis_zarr_key(key)},    }

But the above typeddict is for modelling the use of our inline consolidated metadata model for Zarr V2 arrays. I would need a totally separate type for the zmetadata contents.

returntp


defcheck_type(obj:Any,expected_type:Any,path:str="value")->TypeCheckResult:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we narrow the type onexpected_type here? And do we gain anything if we do?

I'd like to minimize complexity. We know we have a relatively constrained set of expected objects we're going to be passing here (node metadata, codec configuration, ...). If there's anything we can leverage to make this simpler, let's do it.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

i think you're kind of asking if we could define a "type-checkable type", i.e. a union of all the types we support runtime type checking for. I'd like that, but I haven't looked into setting that up yet. I don't know what the blockers would be, and I think it would be potentially helpful to more clearly define the scope of this checker.

although, in its current implementation, any type that isn't associated with a specific checking routine falls back toisinstance, which does work for pretty much all user-defined classes...

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

have a look atc7096b1, It wasn't too painful to narrow fromAny totype | types.UnionType | ForwardRef | None

T=TypeVar("T")


defensure_type(obj:object,expected_type:type[T],path:str="value")->T:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

path is unused?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

it's passed tocheck_type a few lines down

"array":ArrayV3Metadata.from_dict(
{
**array_metadata,
**array_metadata,# type: ignore[typeddict-item]
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

mypy doesn't know aboutstructural assignability in typeddicts, i.e. the fact that typeddicts tolerate extra keys

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@TomAugspurgerTomAugspurgerTomAugspurger left review comments

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

unified runtime type checking for our json data

2 participants

@d-v-b@TomAugspurger

[8]ページ先頭

©2009-2025 Movatter.jp