Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust ItemProperties Validation. #131

Conversation

eseglem
Copy link
Contributor

@eseglem eseglem commented Dec 7, 2023

Fixes #130.

Rather than using python to parse / validate the values first, this lets pydantic do its thing and then ensures the fields are set properly. This means it no longer modifies the input values in place and is significantly faster.

Previously it was set as Union[dt, str] which I believe is not actually valid based on the spec. And does not match the other datetime fields in the library. As with the others, using Optional[dt] allows for the input null value, while requiring any string to be a datetime.

Before:
image

After:
image

Yes, the test_pydantic_model_validate is "slower" between the two, but that is because in the initial benchmarks it was modifying the input datetime in place and essentially cheating by already having it as a datetime for all the benchmarks.

Otherwise the test_pydantic_orjson_model_validate and test_pydantic_model_validate_json are much faster than they were before.

Benchmarks gist is here: https://gist.github.com/eseglem/ddae063eefab545c122c93a2afc6cb86

if not self.datetime and (not self.start_datetime or not self.end_datetime):
raise ValueError(
"start_datetime and end_datetime must be specified when datetime is null"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eseglem 👋
The issue with this is that pydantic will allow a lot of datetime format while the stac specification says they should conform with https://datatracker.ietf.org/doc/html/rfc3339#section-5.6

I've updated this portion in #133

I didn't touch the overall logic of the validation but I think we could totally have:

  • before validator that convert string to datetime object
  • after check the datetime/start_datetime/end_datetime conflic

Note: because of the specification, the serialized is required

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My primary purpose here was to avoid messing with the input data. Also, note that StacCommonMetadata just uses Optional[datetime] as well. So I was just matching up to it. Looks like #75 was where that got changed.

But yes, when parsing input data, Pydantic does allow some flexibility. I think the most important part is that the output is valid RFC3339, and allowing some flexibility on input is not a bad thing. And I would just use Pydantic for it. As there is a lot of efficiency gain there not passing values back and forth. And I figured they were already being parsed elsewhere too. But if others feel strongly about not supporting it, that is not the focus of this PR.

I believe the best way to integrate your changes would be something like this:

# Put this in shared maybe?
rfc3339_datetime = Annotated[datetime, BeforeValidator(parse_rfc3339)]

class ItemProperties(StacCommonMetadata):
    datetime: Optional[rfc3339_datetime] = Field(..., alias="datetime")

That would avoid the issue and force the format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a middle ground possibility: Pydantic has AwareDatetime which would avoid inputs without any timezone info that could possibly cause some output issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yes, when parsing input data, Pydantic does allow some flexibility. I think the most important part is that the output is valid RFC3339, and allowing some flexibility on input is not a bad thing.

well if you use stac-pydantic to validate item, IMO it should fails if the datetime is not a valid rfc3339 string (per specification)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I improved the logic a bit since the original comment, but there are still a few formats that are accepted but are not actually to spec. I was aiming for "be permissive for inputs, strict on outputs" with this iteration. It is a middle ground between where things were (before your updates) and completely strict.

As of right now, this will allow space and underscore in addition to T that RFC3339 specifies. While anything other than T is out of spec, it is not the end of the world. Its still 100% clear what the meaning of the string is. And in all likelihood the string will have it since that is the standard way for it to be. It will also allow for seconds, and not just ms, to be optional in the input. Again, its clear what the meaning of that string is.

The thing not addressed by any of these things so far is that RFC3339 does allow for any timezone, though says you probably should use UTC. And the STAC spec specifies it must be in UTC. Which, to me, feels excessive. As long as it follows includes the timezone the machine does not care if the string says "2024-02-16T03:00:00+00:00" or "2024-02-15T22:00:00-05:00" they are the same thing.

So, sure these things violate the spec, but not in any meaningful way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I kindly disagree (but I'm not using stac-pydantic a lot so I'll defer to you for the final 👍 😄)

I'm just worry that people are going to use stac-pydantic to validate Items then put them in a database or whatever and then 2 cases:

  • the database will throw an error because the datetime is invalid
  • serve the items to the user with the incorrect datetime format

Again I appreciate the time you're spending on this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not much of a user either, so I would prefer to defer to others as well. So, I went towards the permissive input strict output direction.

Those are totally valid concerns, which I also had. I tried to address them in how I implemented it. And I do not mind spending the time on it to make sure things are done well. And, if possible, in a manner that makes everyone comfortable.


the database will throw an error because the datetime is invalid

Once the input data has been validated by pydantic and the object is created, they are all python datetime objects with timezone awareness which have been cast to UTC. At that point it is up to the user and whatever library they are using to interact with the database to get things correct. I am primarily experienced with sqlalchemy and psycopg2, and both of them will have no issues. And even if they are, for some reason, interacting with the database in a timezone naive manner, the datetime object is in UTC, so they should still be fine.

UtcDatetimeAdapter = TypeAdapter(UtcDatetime)
EXPECTED = datetime(2024, 1, 1, 0, 0, 0, tzinfo=timezone.utc)
all(
    UtcDatetimeAdapter.validate_strings(value) == EXPECTED
    for value in (
        # Valid RFC3339 datetime strings in UTC
        "2024-01-01T00:00:00Z",
        "2024-01-01t00:00:00z",
        "2024-01-01T00:00:00.0Z",
        "2024-01-01t00:00:00.00000z",
        "2024-01-01T00:00:00+00:00",
        "2024-01-01t00:00:00+00:00",
        # Other valid RFC3339 datetime strings
        "2023-12-31T19:00:00-05:00",
        "2024-01-01T01:00:00+01:00",
        # Other formats pydantic accepts
        "2024-01-01T00:00Z",  # No seconds
        "2024-01-01 00:00:00Z",  # Space
        "2024-01-01_00:00:00Z",  # Underscore
    )
)
# True

It will not accept these:

"2024-01-0100:00:00Z",  # No T
"2024-01-01T00:00:00",  # No timezone
"20240101T000000Z",  # No separators
"20240101000000+0000",  # No separators
"2024-01-01T00Z",  # No minutes

It is worth noting, there is a bug / regression in pydantic 2.6 with "2024-01-01" incorrectly being accepted, but that was patched yesterday (2024-02-15). I believe it only affects 2.6.0 and 2.6.1, but I can dig up exact versions if there is a concern about that.

And the pydantic docs only really mention T so the space and underscore are either undocumented feature, or a bug. I'll see about digging into that more as well.


serve the items to the user with the incorrect datetime format

Under normal usage, I do not see a way for that to happen. If pydantic validated the value then the datetime object is in UTC. And either option for serializing produces valid a valid RFC3339 string in UTC. As required by the STAC spec.

EXPECTED.isoformat()
# '2024-01-01T00:00:00+00:00'
TypeAdapter(datetime).dump_json(EXPECTED)
# b'"2024-01-01T00:00:00Z"'

Because pydantic does not validate on assignment by default, a user could assign a bad value and then output a bad value because of that. That is not a problem unique to datetimes though, and exists for every value within any library built on top of pydantic. Unfortunately there is not really much to do about that. The guardrails are in place, and if they choose not to use them, that is outside the library's control. I do have a few other ideas that could be explored here, but it feels wrong to try and address issues with un-validated values at serialization time. And they would all come with some loss of performance.


@field_serializer("datetime")
def serialize_datetime(self, v: dt, _info: Any) -> str:
return v.strftime(DATETIME_RFC339)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need the serializer to make sure we respect the RFC3339 spec

This could be one implementation (to make sure we keep microseconds, if provided)

    @field_serializer("datetime", "start_datetime", "end_datetime", when_used="always")
    def serialize_datetime(self, v: Optional[dt], _info: Any) -> str:
        if v:
            if str(v)[19] == ".":
                return v.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
            else:
                return v.strftime("%Y-%m-%dT%H:%M:%SZ")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the datetime object has all the required values, then the pydantic output will always be valid RFC3339. But that is not always the case. Pydantic allows in time without seconds, but then the output would not be valid. If input is validated through parse_rfc3339 then it would be fine to use no serializer and let pydantic handle it.

If, however, we want to accept a wider range of inputs (AwareDatetime) then a serializer is probably necessary. I believe the best solution would actually be to do .isoformat() on the datetime object, not an .strftime. As RFC3339 allows any timezone.

   time-numoffset  = ("+" / "-") time-hour ":" time-minute
   time-offset     = "Z" / time-numoffset

And the use of .strftime("%Y-%m-%dT%H:%M:%S.%fZ") is not guaranteed to be correct. Depending on the input timezone.

from ciso8601 import parse_rfc3339
from datetime import datetime, timezone
from pydantic import TypeAdapter
d = parse_rfc3339("2011-11-04T00:05:23.283+04:00")
d.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
# '2011-11-04T00:05:23.283000Z' <- Not actually the correct timezone
d.isoformat()
# '2011-11-04T00:05:23.283000+04:00'
d.astimezone(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ")
# '2011-11-03T20:05:23.283000Z'
d.astimezone(timezone.utc).isoformat()
# '2011-11-03T20:05:23.283000+00:00'
TypeAdapter(datetime).dump_json(d)
# b'"2011-11-04T00:05:23.283000+04:00"'
TypeAdapter(datetime).dump_json(d.astimezone(timezone.utc))
# b'"2011-11-03T20:05:23.283000Z"'

All of those are valid RFC3339 outputs, but one of them is lying about the timezone.

And now looking back at https://github.com/radiantearth/stac-spec/blob/v1.0.0/item-spec/item-spec.md#properties-object seems like there is actually a bit more of an issue. The spec requires UTC on top of the RFC3339. So even with parse_rfc3339 that part is not being enforced on the input. So it would depend on if that needs to be enforced on input or not.

I see a few possibilities:

def parse_rfc3339_asutc(value: Any) -> datetime:
    return parse_rfc3339(value).astimezone(timezone.utc)

def parse_utc_rfc3339(value: Any) -> datetime:
    d = parse_rfc3339(value)
    if not d.tzinfo == timezone.utc:
        raise ValueError("Input must be UTC")
    return d

# Flexible input as long as there is a timezone. Converts to UTC. Use isoformat to get valid RFC format. 
Annotated[AwareDatetime, AfterValidator(lambda d: d.astimezone(timezone.utc)), PlainSerializer(lambda d: d.isoformat())]
# Input must be RFC, any timzone. Converts to UTC. Pydantic serializer outputs valid RFC format.
Annotated[datetime, BeforeValidator(parse_rfc3339_asutc)]
# Input must be RFC and UTC. Pydantic serializer outputs valid RFC format.
Annotated[datetime, BeforeValidator(parse_utc_rfc3339)]

I would probably still go with Pydantic for parsing. Gives the most flexibility on input, and can still guarantee valid output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @gadomski comments below in regards to "be permissive for inputs, strict on outputs."
Pystac solution for this uses dateutil.parser for converting strings and setting datetimes with missing timezone to UTC when parsing to json.

This has the advantage that any date format can be accepted including 2020, 2020-01 and 2020-01-01.

I added a possible solution in here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomas-maschler if going the dateutil.parser route, it would probably make sense to just let pydantic do its thing instead. It should be more performant.

The way it is current written in this PR is to accept a datetime with a timezone, and convert it to UTC. Rather than accept a naive value and assume UTC. Though the spec here does say it must be in UTC. So, its not a horrible thing to assume either. Tons of possible solutions. I was trying to optimize on keeping as much as possible inside the pydantic rust code and minimal python code to maintain here.

Its funny how a spec trying to simplify things or add clarity can make things more complicated than necessary.

@eseglem
Copy link
Contributor Author

eseglem commented Feb 9, 2024

And probably worth a ping to @gadomski for any thoughts as well.

@gadomski
Copy link
Member

gadomski commented Feb 9, 2024

And probably worth a ping to @gadomski for any thoughts as well.

I apologize but I don't have the bandwidth right now for a full dive-in, but I generally am in favor of "be permissive for inputs, strict on outputs." I'm not as familiar with pydantic as either yourself or @vincentsarago, so I'll have to defer on the correct spot to transition from permissiveness-to-strictness (whether that's on serialization or deserialization).

@@ -15,9 +22,16 @@

SEMVER_REGEX = r"^(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)(?:-((?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+([0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$"

# https://tools.ietf.org/html/rfc3339#section-5.6
# Unused, but leaving it here since it's used by dependencies
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed from the tests. And, if there are downstream users which use this outside of that, they should do their own formatting, as there are issues with this format. Its missing ms and the Z part is not always correct.

@eseglem
Copy link
Contributor Author

eseglem commented Feb 14, 2024

@vincentsarago Hopefully all make sense. Let me know if there are concerns. Tests are still passing for me, so should be good on that front.

@codecov-commenter
Copy link

codecov-commenter commented Feb 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (main@c03272e). Click here to learn what that means.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #131   +/-   ##
=======================================
  Coverage        ?   96.67%           
=======================================
  Files           ?       26           
  Lines           ?      572           
  Branches        ?        0           
=======================================
  Hits            ?      553           
  Misses          ?       19           
  Partials        ?        0           
Flag Coverage Δ
unittests 96.67% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

AfterValidator(lambda d: d.astimezone(timezone.utc)),
# Use `isoformat` to serialize the value in an RFC3339 compatible format
# for example: "2024-01-01T00:00:00+00:00"
PlainSerializer(lambda d: d.isoformat()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the tests pass without this serializer. I was thinking there was a case where the seconds were not included in the pydantic serializer, but I cannot reproduce that right now. I will try run through some more testing and see what I can figure out. There may be differences in pydantic versions.

Would prefer to leave it the pydantic serializer unless absolutely necessary. Also has the advantage of going back to Z over +00:00.

@vincentsarago
Copy link
Member

@eseglem I'm in favour of moving forward here and for this to be published on 3.1

Can we resolve the conflicts and update the changelog?

Maybe a quick addition in the Usage section of the Readme will be a good idea as well

@vincentsarago
Copy link
Member

friendly ping @eseglem 😄

@vincentsarago vincentsarago requested review from thomas-maschler, gadomski and vincentsarago and removed request for thomas-maschler April 24, 2024 14:38
@vincentsarago
Copy link
Member

@thomas-maschler @gadomski can I have your 👍 before we merge this 🙏

Copy link
Contributor

@thomas-maschler thomas-maschler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. I left two comments about validators and and moving datetime to StacCommonMetadata. This can be addressed in follow up PRs

My personal preference would be to be more permissive and convert naive datetime to UTC on input but we can try and see how others perceive this.

stac_pydantic/item.py Show resolved Hide resolved
Comment on lines 138 to 139
start_datetime: Optional[UtcDatetime] = None
end_datetime: Optional[UtcDatetime] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, the datetime/ start_datetime-end_datetime validator should live here.
It must verify that either datetime OR start_datetime AND end_datetime are set.

Copy link
Contributor Author

@eseglem eseglem May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear. You are saying to move the datetime field to here along with the corresponding validator. Correct? Reading through your link, that does appear to be what the spec is looking for.

It also appears to be missing license from here. That would be a separate issue, but I wanted to at least point it out.

Edit: I ended up at this comment in the code review screen without the overall context. This is much clearer now from the PR screen.

stac_pydantic/api/search.py Show resolved Hide resolved
@eseglem
Copy link
Contributor Author

eseglem commented May 13, 2024

@vincentsarago Sorry I have not been responsive lately, been focused on other work. Trying to catch back up now. Is there anything you still need from me on this?

@eseglem
Copy link
Contributor Author

eseglem commented May 13, 2024

Pushed two tweaks I had been messing with. Can roll them back if preferred.

I also pushed a commit moving datetime from ItemProperties to StacCommonMetadata to a separate branch: f4426b4

It was not quite as simple / clean as I initially thought, so I wanted to allow for a separate conversation if necessary. Can create a PR from it after this, or add it here. Whatever works.

@vincentsarago
Copy link
Member

@eseglem thanks 🙏

feel free to push f4426b4 in this branch 👍

It also appears to be missing license from here. That would be a separate issue, but I wanted to at least point it out.

feel also free to add this here

@vincentsarago
Copy link
Member

FYI I pushed f4426b4 in this branch

model_config = ConfigDict(
populate_by_name=True, use_enum_values=True, extra="allow"
)

@model_validator(mode="after")
def validate_datetime_or_start_end(self) -> Self:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well IMO the Asset model should not herit from the StacCommonMetadata but from StacBaseModel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do this in another PR

@vincentsarago vincentsarago merged commit 7b1cb44 into stac-utils:main May 20, 2024
5 checks passed
@vincentsarago
Copy link
Member

thanks a ton @eseglem

excellent work as usual 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ItemProperties modifies input data in place
5 participants