Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fill_value serialization of NaN; add property-based tests #2802

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

moradology
Copy link
Contributor

The current serialization of fill_value in ArrayV2Metadata does not fully conform to the spec, particularly for:

  • NaN and Infinity values, which must be serialized as strings ("NaN", "Infinity", "-Infinity").
  • Complex numbers (np.complex64, np.complex128), which must be stored as two-element arrays [real, imag] according to the above Nan/Inf rules.

Changes

  • Updated _sanitize_fill_value() to enforce correct JSON serialization.
  • Fixed test_v2meta_fill_value_serialization() to compare expected and actual JSON using a normalized representation.
  • Introduced property-based testing with Hypothesis to generate valid input cases and verify compliance.
  • Enforce compliance (to some degree) by setting allow_nan to False on json.dumps

Resolves: #2741

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 5, 2025
@moradology
Copy link
Contributor Author

The bad news is, this issue is going to be slightly more than I've said here. The good news is that the property-based tests caught some edge cases.

@dcherian
Copy link
Contributor

dcherian commented Feb 5, 2025

Nice, I've been meaning to add this to Zarr:

@st.composite
def v3_array_metadata(draw: st.DrawFn) -> bytes:
    from zarr.codecs.bytes import BytesCodec
    from zarr.core.chunk_grids import RegularChunkGrid
    from zarr.core.chunk_key_encodings import DefaultChunkKeyEncoding
    from zarr.core.metadata.v3 import ArrayV3Metadata

    # separator = draw(st.sampled_from(['/', '\\']))
    shape = draw(array_shapes)
    ndim = len(shape)
    chunk_shape = draw(npst.array_shapes(min_dims=ndim, max_dims=ndim))
    dtype = draw(zrst.v3_dtypes())
    fill_value = draw(npst.from_dtype(dtype))
    dimension_names = draw(
        st.none() | st.lists(st.none() | simple_text, min_size=ndim, max_size=ndim)
    )

    metadata = ArrayV3Metadata(
        shape=shape,
        data_type=dtype,
        chunk_grid=RegularChunkGrid(chunk_shape=chunk_shape),
        fill_value=fill_value,
        attributes=draw(simple_attrs),
        dimension_names=dimension_names,
        chunk_key_encoding=DefaultChunkKeyEncoding(separator="/"),  # FIXME
        codecs=[BytesCodec()],
        storage_transformers=(),
    )

    return metadata.to_buffer_dict(prototype=default_buffer_prototype())["zarr.json"]

What do you think of a array_metadata_json(zarr_formats...) strategy that just returns the JSON and we can test whether that satisfies the spec for V2 and V3?

@moradology
Copy link
Contributor Author

moradology commented Feb 5, 2025

I love the idea. I was thinking the other day that the obvious path out of the bugs that are currently popping up would be property-based testing, so I was pretty pleased to see that there's already some work in that direction

Out of curiosity, do we have something like json schema that we could apply against the outputs to at least verify structure? We'd still need to define all the rules that exist in terms of value/type dependencies etc. but that's an easy win if it exists somewhere

@dcherian
Copy link
Contributor

dcherian commented Feb 5, 2025

Out of curiosity, do we have something like json schema that we could apply against the outputs to at least verify structure?

Don't know. ping @jhamman @d-v-b

@d-v-b
Copy link
Contributor

d-v-b commented Feb 6, 2025

I'm not aware of a JSON schema definition for the array metadata. If one existed, it would necessarily only support partial validation, because JSON schema can't express certain invariants in the metadata document, like the requirement that dimensional attributes (shape, chunk_shape, etc) be consistent.

@dcherian
Copy link
Contributor

dcherian commented Feb 6, 2025

A fairly easy alternative way to handle this would be to simply write a test that takes the arrays strategy, extracts the metadata, converts to JSON, and then asserts that the JSON meets spec (as best as we can).

I still think a generic metadata strategy is probably useful.

@moradology
Copy link
Contributor Author

Yeah, you'd definitely need json-schema and custom validation rules that encode relationships among different fields

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Broken NaN encoding when writing v2 storage format from v3 library
3 participants