Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow extra fields other than "@context" #266

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

candleindark
Copy link
Member

@candleindark candleindark commented Nov 18, 2024

This PR closes #75. It addresses #75 in the follow manner.

  1. Make the models disallow data instances that have extra fields other than the "@context" field in both Pydantic and JSON level.
  2. The Pydantic models continue to not have a context field.
  3. Data instances with an "@context" field can be validated against a Pydantic model but ignored.
The solution implemented in this PR is based on the following script.
from typing import Any
import json

from pydantic import BaseModel, ConfigDict, model_validator
from pydantic.json_schema import JsonDict, JsonValue
from jsonschema import validate, Draft202012Validator
import jsonschema


def get_dict_without_context(d: Any) -> Any:
    """
    If a given object is a dictionary, return a copy of it without the
    `@context` key. Otherwise, return the input object as is.

    :param d: The given object
    :return: If the object is a dictionary, a copy of it without the `@context` key;
             otherwise, the input object as is.
    """
    if isinstance(d, dict):
        return {k: v for k, v in d.items() if k != "@context"}
    return d


def add_context(json_schema: JsonDict) -> None:
    """
    Add the `@context` key to the given JSON schema

    :param json_schema: The dictionary representing the JSON schema

    raises: ValueError if the `@context` key is already present in the given dictionary
    """
    context_key = "@context"
    context_key_title = "@Context"
    properties: JsonDict = json_schema.get("properties", {})
    required: list[JsonValue] = json_schema.get("required", [])

    if context_key in properties or context_key in required:
        msg = f"The '{context_key}' key is already present in the given JSON schema."
        raise ValueError(msg)

    properties[context_key] = {
        "format": "uri",
        "minLength": 1,
        "title": context_key_title,
        "type": "string",
    }
    # required.append(context_key)  # Uncomment this line to make `@context` required

    # Update the schema
    # This is needed to handle the case in which the keys are newly created
    json_schema["properties"] = properties
    json_schema["required"] = required


class Foo(BaseModel):
    x: int

    # Model validator to remove the `"@context"` key from data instance before
    # "base" validation is performed.
    _remove_context_key = model_validator(mode="before")(get_dict_without_context)

    model_config = ConfigDict(extra="forbid", json_schema_extra=add_context)


json_schema_ = Foo.model_json_schema()
print(json.dumps(json_schema_, indent=2))
"""
{
  "additionalProperties": false,
  "properties": {
    "x": {
      "title": "X",
      "type": "integer"
    },
    "@context": {
      "format": "uri",
      "minLength": 1,
      "title": "@Context",
      "type": "string"
    }
  },
  "required": [
    "x",
    "@context"
  ],
  "title": "Foo",
  "type": "object"
}
"""

instance_json_str = '{"x": 1}'
instance_json_str_with_context = '{"@context": "not a valid URI", "x": 1}'
instance_json_str_with_extra = '{"x": 1, "e": 42}'

vv = Foo.model_validate_json(instance_json_str)
print("\n====================================")
print(f"vv: {vv!r}")
"vv: Foo(x=1)"

# Ignore the context field in Pydantic level
vv_with_context = Foo.model_validate_json(instance_json_str_with_context)
print("\n====================================")
print(f"vv_with_context: {vv_with_context!r}")
"vv_with_context: Foo(x=1)"

# Disallow other extra fields in Pydantic level
try:
    Foo.model_validate_json(instance_json_str_with_extra)
except ValueError as e:
    print("\n====================================")
    print(e)
    """
    1 validation error for Foo
    e
      Extra inputs are not permitted [type=extra_forbidden, input_value=42, input_type=int]
        For further information visit https://errors.pydantic.dev/2.9/v/extra_forbidden
    """

instance = {"@context": "https://schema.org", "x": 1}
instance_with_invalid_context = {"@context": "invalid context", "x": 1}
instance_missing_context = {"x": 1}
instance_with_extra = {"@context": "https://schema.org", "x": 1, "e": 42}

# Validate an instance with valid context and x field
validate(instance, json_schema_, format_checker=Draft202012Validator.FORMAT_CHECKER)

# Instance with invalid context fails validation
try:
    validate(
        instance_with_invalid_context,
        json_schema_,
        format_checker=Draft202012Validator.FORMAT_CHECKER,
    )
except jsonschema.exceptions.ValidationError as e:
    print("\n====================================")
    print(e)
    """
    'invalid context' is not a 'uri'
    Failed validating 'format' in schema['properties']['@context']:
        {'format': 'uri', 'minLength': 1, 'title': '@Context', 'type': 'string'}
    On instance['@context']:
        'invalid context'
    """

# The context field is optional
validate(
    instance_missing_context,
    json_schema_,
    format_checker=Draft202012Validator.FORMAT_CHECKER,
)
print("\n====================================")
print("Instance without the `@context` key is valid")
"Instance without the `@context` key is valid"


# Instance with extra field fails validation
try:
    validate(
        instance_with_extra,
        json_schema_,
        format_checker=Draft202012Validator.FORMAT_CHECKER,
    )
except jsonschema.exceptions.ValidationError as e:
    print("\n====================================")
    print(e)
    """
    Additional properties are not allowed ('e' was unexpected)
    Failed validating 'additionalProperties' in schema:
        {'additionalProperties': False,
         'properties': {'x': {'title': 'X', 'type': 'integer'},
                        '@context': {'format': 'uri',
                                     'minLength': 1,
                                     'title': '@Context',
                                     'type': 'string'}},
         'required': ['x', '@context'],
         'title': 'Foo',
         'type': 'object'}
    On instance:
        {'@context': 'https://schema.org', 'x': 1, 'e': 42}
    """

TODOs

Copy link

codecov bot commented Nov 18, 2024

Codecov Report

Attention: Patch coverage is 85.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 92.21%. Comparing base (6fd04fb) to head (5ca9e5e).

Files with missing lines Patch % Lines
dandischema/models.py 85.00% 3 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (6fd04fb) and HEAD (5ca9e5e). Click for more details.

HEAD has 217 uploads less than BASE
Flag BASE (6fd04fb) HEAD (5ca9e5e)
unittests 240 23
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #266      +/-   ##
==========================================
- Coverage   97.54%   92.21%   -5.34%     
==========================================
  Files          16       16              
  Lines        1753     1772      +19     
==========================================
- Hits         1710     1634      -76     
- Misses         43      138      +95     
Flag Coverage Δ
unittests 92.21% <85.00%> (-5.34%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…Dandiset`

This allows validation of data instances of `Asset` and
`Dandiset` to contain the `"@context"` key
…t` and `Dandiset`

This requires `"@context"` field at the
JSON level
This also sets the `"additionalProperties"` of each
corresponding model in JSON schema to `false`
@candleindark
Copy link
Member Author

@satra @mvandenburgh This solution will generated JSON schemas that have "@context" as a property. Currently, I made "@context" as a required property, but I can make it optional. Will the additional "@context" property, required or optional, be a problem for the web UI? Can we make the web UI ignore this property if needed?

@yarikoptic
Copy link
Member

are failing tests expected?

@mvandenburgh please point @candleindark to where to change URL to the json schema so some alternative one could be tested.

@mvandenburgh
Copy link
Member

@candleindark the json schema used by the frontend is set here - https://github.com/dandi/dandi-archive/blob/master/dandiapi/api/views/info.py#L11-L14
If you want to use a custom schema, you could serve out the json file locally and point it at that url.

@candleindark
Copy link
Member Author

are failing tests expected?

Yes. Because I made "@context" as a required property in the JSON schemas, the tests need to be updated.

These are not made available in `__all__`. Use of them
triggers complains in the IDE.
@@ -1815,6 +1867,12 @@ class Asset(BareAsset):
json_schema_extra={"readOnly": True, "nskey": "schema"}
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look here for an example on how we add extra to json schema.

May be you could add extra for your @context which would state something like "includeInUI": False

This is needed to handle the case in which the keys are newly created
dandischema/models.py Outdated Show resolved Hide resolved
@candleindark
Copy link
Member Author

candleindark commented Dec 9, 2024

@candleindark the json schema used by the frontend is set here - https://github.com/dandi/dandi-archive/blob/master/dandiapi/api/views/info.py#L11-L14 If you want to use a custom schema, you could serve out the json file locally and point it at that url.

@mvandenburgh I completed the initial setup, started the API server using docker compose up, and brought up the web app using yarn run dev. Both the API server and the web app seemed to be running normally. However, whenever I modified the value of schema_url in https://github.com/dandi/dandi-archive/blob/dc454f44b9e814b8a09781e9b798662e6cb8528f/dandiapi/api/views/info.py#L11-L14 to a URL of a locally served schema I got the "Connection to server failed." error. This was true even for the unmodified dandiset schema at 0.6.8, which seems to be the current default version (as schema_url = "http://localhost:9999/0.6.8/dandiset.json" or schema_url = "https://localhost:8443/0.6.8/dandiset.json").

Screenshot 2024-12-08 at 7 54 45 PM

On a related note, I was wondering if the web app/UI depends on the published-dandiset.json, asset.json, published-asset.json as well as the dandiset.json?

@candleindark
Copy link
Member Author

I just tried the native development setup with dandiset.json hosted at a public URL using ngrok ( and set schema_url = "https://4ed0-2600-6c50-60f0-6500-9dec-627c-49b1-f7a0.ngrok-free.app/0.6.8/dandiset.json"). I got the same result as I mentioned in the last post.

@mvandenburgh
Copy link
Member

On a related note, I was wondering if the web app/UI depends on the published-dandiset.json, asset.json, published-asset.json as well as the dandiset.json

The web UI only depends on dandiset.json.

@mvandenburgh I completed the initial setup, started the API server using docker compose up, and brought up the web app using yarn run dev. Both the API server and the web app seemed to be running normally. However, whenever I modified the value of schema_url in dandi/dandi-archive@dc454f4/dandiapi/api/views/info.py#L11-L14 to a URL of a locally served schema I got the "Connection to server failed." error. This was true even for the unmodified dandiset schema at 0.6.8, which seems to be the current default version (as schema_url = "http://localhost:9999/0.6.8/dandiset.json" or schema_url = "https://localhost:8443/0.6.8/dandiset.json").

Screenshot 2024-12-08 at 7 54 45 PM

On a related note, I was wondering if the web app/UI depends on the published-dandiset.json, asset.json, published-asset.json as well as the dandiset.json?

Does the web UI always display that "Connection to server failed" message, or just when you modify schema_url? If the former, then that indicates something is going wrong with the local Django server.

@candleindark
Copy link
Member Author

Does the web UI always display that "Connection to server failed" message, or just when you modify schema_url? If the former, then that indicates something is going wrong with the local Django server.

It is the latter. I got the attached landing page if nothing no code is modified. In fact, there is no problem even if I hardcode the schema_url to a particular version. For example,

schema_url = (
    'https://raw.githubusercontent.com/dandi/schema/master/'
    f'releases/0.6.8/dandiset.json'
)

or even

schema_url = (
    'https://raw.githubusercontent.com/dandi/schema/master/'
    f'releases/0.6.7/dandiset.json'
)

The problem occurred when schema_url was set to the URLs I mentioned in previous posts (though the localhost ones are problematic if I run the Django service in a container).

Screenshot 2024-12-11 at 9 30 21 AM

To reflect that `@context` is not set to
be required
@candleindark
Copy link
Member Author

candleindark commented Jan 20, 2025

This post provides the analysis of Pydantic validation diff reports on the manifests against the latest version of dandischema at pypi and this PR.

Causes of Pydantic validation errors for Dandiset or PublishedDandiset instances as depicted in https://github.com/dandi/dandi-schema-status/blob/main/reports/diff_reports/dandiset/pydantic_errs_summary.md#pydantic-errs-diff are the following.

  1. ('extra_forbidden', 'Extra inputs are not permitted', ('contributor', '[*]', 'Person', 'affiliation', '[*]', 'contactPoint'))
    - Some data instances include an undefined attribute of contactPoint for Affiliation as an element of Person.affiliation.

  2. ('extra_forbidden', 'Extra inputs are not permitted', ('contributor', '[*]', 'Person', 'affiliation', '[*]', 'includeInCitation'))
    - Some data instances include an undefined attribute of includeInCitation for Affiliation as an element of Person.affiliation.

  3. ('extra_forbidden', 'Extra inputs are not permitted', ('contributor', '[*]', 'Person', 'affiliation', '[*]', 'roleName'))
    - Some data instances include an undefined attribute of roleName for Affiliation as an element of Person.affiliation.

  4. ('extra_forbidden', 'Extra inputs are not permitted', ('datePublished',))

    • The draft version of some dandisets instances contains the datePublished attribute which is not specified in Dandiset but PublishedDandiset. This result is to be expected according to @yarikoptic's previous analysis.
  5. ('extra_forbidden', 'Extra inputs are not permitted', ('doi',))

    • The draft version of some dandisets instances contains the doi attribute which is not specified in Dandiset but PublishedDandiset. This result can be as acceptable as the one in the previous point.
  6. ('extra_forbidden', 'Extra inputs are not permitted', ('publishedBy',))

    • The draft version of some dandisets instances contains the publishedBy attribute which is not specified in Dandiset but PublishedDandiset. This result can be as acceptable as the ones in the previous two points.
  7. ('value_error', 'Value error, Contact person must have an email address.', ('contributor', '[*]', 'Person'))

    • This error is removed for some dandiset instances because this "after" model validation was not executed after the "core" validation of the model failed. For example, in the following script, the "after" validator, check_x, is not executed since x fails to be validated as an int.
    After model validator
    from pydantic import BaseModel, model_validator, ValidationError
    
    class A(BaseModel):
        x: str
    
        @model_validator(mode="after")
        def check_x(self):
            print("I am executed")
            return self
    
    
    try:
        a = A.model_validate({"x": 1})
    except ValidationError as e:
        print(e.json())
        """
        [{"type":"string_type","loc":["x"],"msg":"Input should be a valid string","input":1,"url":"https://errors.pydantic.dev/2.9/v/string_type"}]
        """

Causes of Pydantic validation errors for Asset or PublishedAsset instances as depicted in https://raw.githubusercontent.com/dandi/dandi-schema-status/refs/heads/main/reports/diff_reports/asset/pydantic_errs_summary.md are the following.

  1. ('extra_forbidden', 'Extra inputs are not permitted', ('datePublished',))
    • The draft version of some asset instances contains the datePublished attribute which is not specified in Asset but PublishedAsset. This result is to be expected according to @yarikoptic's previous analysis.
  2. ('extra_forbidden', 'Extra inputs are not permitted', ('publishedBy',))
    • The draft version of some asset instances contains the publishedBy attribute which is not specified in Asset but PublishedAsset. This result can be as acceptable as the one in the previous point.
  3. ('extra_forbidden', 'Extra inputs are not permitted', ('xxxx',))
    - As depicted in the last table of https://raw.githubusercontent.com/dandi/dandi-schema-status/refs/heads/main/reports/diff_reports/asset/pydantic_errs_summary.md, some versions of the 000029 dataset contain an asset instance with an undefined "xxxx" @yarikoptic note: he inserted "manually" somehow, so it is legit "bad metadata"

Summary

Among the categories of errors listed above, the bolded items result from data instances containing undefined attributes. The remaining errors are either expected outcomes or have reasonable justifications. At the Pydantic validation level, there is no indication that the changes introduced in this PR result in any unintended behavior in the schema.

Bring in up-to-date changes from the base branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add "unevaluatedProperties": false, (or "additionalProperties": false,) to jsonschema dump
3 participants