Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork366
Fix fill_value serialization issues#2802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Fix fill_value serialization issues#2802
Uh oh!
There was an error while loading.Please reload this page.
Conversation
The bad news is, this issue is going to be slightly more than I've said here. The good news is that the property-based tests caught some edge cases. |
Nice, I've been meaning to add this to Zarr: @st.compositedefv3_array_metadata(draw:st.DrawFn)->bytes:fromzarr.codecs.bytesimportBytesCodecfromzarr.core.chunk_gridsimportRegularChunkGridfromzarr.core.chunk_key_encodingsimportDefaultChunkKeyEncodingfromzarr.core.metadata.v3importArrayV3Metadata# separator = draw(st.sampled_from(['/', '\\']))shape=draw(array_shapes)ndim=len(shape)chunk_shape=draw(npst.array_shapes(min_dims=ndim,max_dims=ndim))dtype=draw(zrst.v3_dtypes())fill_value=draw(npst.from_dtype(dtype))dimension_names=draw(st.none()|st.lists(st.none()|simple_text,min_size=ndim,max_size=ndim) )metadata=ArrayV3Metadata(shape=shape,data_type=dtype,chunk_grid=RegularChunkGrid(chunk_shape=chunk_shape),fill_value=fill_value,attributes=draw(simple_attrs),dimension_names=dimension_names,chunk_key_encoding=DefaultChunkKeyEncoding(separator="/"),# FIXMEcodecs=[BytesCodec()],storage_transformers=(), )returnmetadata.to_buffer_dict(prototype=default_buffer_prototype())["zarr.json"] What do you think of a |
moradology commentedFeb 5, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I love the idea. I was thinking the other day that the obvious path out of the bugs that are currently popping up would be property-based testing, so I was pretty pleased to see that there's already some work in that direction Out of curiosity, do we have something like json schema that we could apply against the outputs to at least verify structure? We'd still need to define all the rules that exist in terms of value/type dependencies etc. but that's an easy win if it exists somewhere |
I'm not aware of a JSON schema definition for the array metadata. If one existed, it would necessarily only support partial validation, because JSON schema can't express certain invariants in the metadata document, like the requirement that dimensional attributes (shape, chunk_shape, etc) be consistent. |
A fairly easy alternative way to handle this would be to simply write a test that takes the I still think a generic metadata strategy is probably useful. |
Yeah, you'd definitely need json-schemaand custom validation rules that encode relationships among different fields |
Did some refactoring and organization of property-based testing code and added round trip testing for |
Hye@moradology I added an
|
moradology commentedFeb 18, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Definitely - do you mind my pulling the generators up into their own submodule as done here, given their similar and unique functionality? |
I think testing/strategies.py is right place. Everything in there already handles the v2 vs v3 complexity |
moradology commentedFeb 18, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@dcherian Maybe I got out ahead of myself. Do you think the small amount of extra organization I added is undesirable? Basically, I broke My thought was just that continuing to add property-based tests is probably a good idea and the single |
dcherian commentedFeb 18, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
It feels a bit premature to me. In any case, I find it nice to keep such refactoring PRs separate so that we can review (much) smaller diffs. In my experience, keeping the diffs small is the best way to keep open-source contributions easy to merge. How about just keeping them in strategies for now and we can refactor later as needed. |
Sounds like a plan. As a relatively new contributor to this library, I appreciate the firm opinions! |
Thankyou for considering the opinions in a constructive manner! |
9cd0ccd to251253eCompareOK, so to get the property-based tests working I had to make some decisions about what serialization strategies looked like. I tried to follow the instructions availablehere, but other issues that have come to the surface in the history of |
A bit confused about the code coverage going down. The tested constraints are definitely a bit tighter than they were, if anything |
Uh oh!
There was an error while loading.Please reload this page.
tests/test_properties.py Outdated
| @given(npst.from_dtype(dtype=np.dtype("float64"),allow_nan=True,allow_infinity=True)) | ||
| deftest_v2meta_nan_and_infinity(fill_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This could look more likehttps://github.com/zarr-developers/zarr-python/pull/2847/files#diff-d318cba7c9e4a6983338cf21df1db66aab796137a2fb4a76ce48c0afa17de2f9
We could abstract out aassert_valid_v2_json_dict and similarly for v3.
moradologyFeb 19, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Makes sense to me - would you like me to wait for that branch to go in and then add these validations to the broader test you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Let's just copy the test over and you can expand them here. I can handle the merge conflicts. It may very well be that this one goes in first :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
🫡
moradologyFeb 19, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
OK, so I'm curious about something. See this line:https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/metadata/v3.py#L413C1-L414C1
Is that type correct? If so, it seems that numpy types should be serialized afterto_dict. As of now, they're not (this PR changes that forv2 but not yet forv3). So what's the desired behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
don't know. ping@d-v-b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
no, that type is wrong and it bugs me! for v3 metadatato_dict returns an instance ofDataType for thedata_type key, and I think we registered that type with a custom JSON encoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't remember why this decision was made but we should definitely fix it. fwiw,dict[str, JSON] is also sub-optimal, given that the keys of the metadata document are (almost) entirely static, and so we could do typeddicts here
moradologyFeb 20, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
OK, so the immediately obvious thing is that the serialization logic I've added toto_dict needs to be pushed down intoto_buffer_dict andto_dict should retain its (potentially) not-directly-serializable python types. I assumed serialization should happen into_dict because ofJSON in the type sig but also because there's some serialization happening in that function already:https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/metadata/v2.py#L199-L204. I will address that, too
From there, a typed-dict can be implemented forto_dict output.
moradologyFeb 24, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Alright, after a fair bit of back and forth I've gotten to a better (but imperfect) place in terms of serialization/deserialization. A couple of high points:
- fill value serialization is happening in exactly one function and inside of the
to_buffer_dictmethod rather thanto_dict.https://github.com/zarr-developers/zarr-python/pull/2802/files#diff-535b8bc19e7bf0cd5b07eb0d481ba8676f21b41cd352ddcbfc92c4667f124804R112-R133 - deserialization/parsing happens entirely inside of one function rather than in a few different places. Hopefully this makes things a bit easier to follow:https://github.com/zarr-developers/zarr-python/pull/2802/files#diff-535b8bc19e7bf0cd5b07eb0d481ba8676f21b41cd352ddcbfc92c4667f124804R318-R392
I also played around with removing thedict[str, JSON type signature from the to_dict/from_dict methods but upon further inspection, those seem to be expected/baked in for otherMetadata subclasses. In particular,ChunkKeyEncoding,GroupMetadata, andChunkGrid. It seems to me that really disentangling this type signature might be out of scope here and probably deserves a separate PR.
404213a to935ac71Compare807470f to6301b15Compare@d-v-b or@dcherian - thoughts on this set of changes? No pressure, I just don't want this to get too stale if possible. The PR is now non-trivial, but I think it does a decent job of organizing and defining serialization/deserialization behaviors that were previously hard to see at a glance or else not well defined. It also verifies round tripping with property-based tests, which is how certain undefined behaviors were uncovered (e.g. 'NaT' for |
| f"fill_value{fill_value!r} is not valid for dtype{dtype}; must be a unicode string" | ||
| ) | ||
| elifdtype.kindin"SV"andisinstance(fill_value,str): | ||
| fill_value=base64.standard_b64decode(fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
out of scope but does anyone know why we base64 encode scalars with dtypeS ?S is ascii-encoded strings, which shouldn't need any encoding / decoding.
this looks good to me, I'm going to merge soon unless there are objections. good to keep things moving. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This looks great overall - I think there's some improvements that could be made in terms of code structuring/readability, I left some comments inline.
There's also lots of lines that codecov is claiming aren't covered here - are these real, or is codecov not working?
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
moradology commentedMar 5, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Glad I got the pushback to go through things again because the property-based tests came across an error related to floating point precision at really big numbers that (in extremely rare cases of integers near |
428ddb3 to58b4070Compare58b4070 to1388a3bComparemoradology commentedMar 5, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@dstansby Bumping this again just to make sure I don't lose track of it. If you all need some help doing PR reviews etc (obviously there's a lot going on) I'd be happy to lend a hand on smaller/less controversial PRs! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I spotted a wildprint() statement that should probably be removed 😄 - otherwise looks good to me now, thanks for making the changes!
Uh oh!
There was an error while loading.Please reload this page.
Co-authored-by: David Stansby <dstansby@gmail.com>
3b6565b intozarr-developers:mainUh oh!
There was an error while loading.Please reload this page.
* Fix fill_value serialization of NaN* Round trip serialization for array metadata v2/v3* Unify metadata v2 fill value parsing* Test structured fill_value parsing* Remove redundancies, fix integral handling* Reorganize structured fill parsing* Bump up hypothesis deadline* Remove hypothesis deadline* Update tests/test_v2.pyCo-authored-by: David Stansby <dstansby@gmail.com>---------Co-authored-by: Deepak Cherian <deepak@cherian.net>Co-authored-by: David Stansby <dstansby@gmail.com>

Uh oh!
There was an error while loading.Please reload this page.
The current serialization of
fill_valueinArrayV2Metadatadoes not fully conform to the spec, particularly for:Changes
allow_nantoFalseonjson.dumpsResolves:#2741
TODO:
docs/user-guide/*.rstchanges/