-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust ItemProperties Validation. #131
Changes from 2 commits
e4a284f
ed35e75
b026dc7
dd54306
af87bcf
28461bf
bd913a8
64360f4
202e33b
2aa30eb
f4426b4
72d1ace
3e6648d
8b7881b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1,22 @@ | ||
from datetime import datetime as dt | ||
from typing import Any, Dict, List, Optional, Union | ||
from typing import Any, Dict, List, Optional | ||
|
||
from geojson_pydantic import Feature | ||
from pydantic import ( | ||
AnyUrl, | ||
ConfigDict, | ||
Field, | ||
field_serializer, | ||
model_serializer, | ||
model_validator, | ||
) | ||
|
||
from stac_pydantic.links import Links | ||
from stac_pydantic.shared import ( | ||
DATETIME_RFC339, | ||
SEMVER_REGEX, | ||
Asset, | ||
StacBaseModel, | ||
StacCommonMetadata, | ||
) | ||
from stac_pydantic.utils import parse_datetime | ||
from stac_pydantic.version import STAC_VERSION | ||
|
||
|
||
|
@@ -28,38 +25,19 @@ class ItemProperties(StacCommonMetadata): | |
https://github.com/radiantearth/stac-spec/blob/v1.0.0/item-spec/item-spec.md#properties-object | ||
""" | ||
|
||
datetime: Union[dt, str] = Field(..., alias="datetime") | ||
datetime: Optional[dt] = Field(..., alias="datetime") | ||
|
||
# Check https://docs.pydantic.dev/dev-v2/migration/#changes-to-config for more information. | ||
model_config = ConfigDict(extra="allow") | ||
|
||
@model_validator(mode="before") | ||
@classmethod | ||
def validate_datetime(cls, data: Dict[str, Any]) -> Dict[str, Any]: | ||
datetime = data.get("datetime") | ||
start_datetime = data.get("start_datetime") | ||
end_datetime = data.get("end_datetime") | ||
|
||
if not datetime or datetime == "null": | ||
if not start_datetime and not end_datetime: | ||
raise ValueError( | ||
"start_datetime and end_datetime must be specified when datetime is null" | ||
) | ||
|
||
if isinstance(datetime, str): | ||
data["datetime"] = parse_datetime(datetime) | ||
|
||
if isinstance(start_datetime, str): | ||
data["start_datetime"] = parse_datetime(start_datetime) | ||
|
||
if isinstance(end_datetime, str): | ||
data["end_datetime"] = parse_datetime(end_datetime) | ||
|
||
return data | ||
@model_validator(mode="after") | ||
def validate_datetime(self) -> "ItemProperties": | ||
if not self.datetime and (not self.start_datetime or not self.end_datetime): | ||
raise ValueError( | ||
"start_datetime and end_datetime must be specified when datetime is null" | ||
) | ||
|
||
@field_serializer("datetime") | ||
def serialize_datetime(self, v: dt, _info: Any) -> str: | ||
return v.strftime(DATETIME_RFC339) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we still need the serializer to make sure we respect the RFC3339 spec This could be one implementation (to make sure we keep microseconds, if provided) @field_serializer("datetime", "start_datetime", "end_datetime", when_used="always")
def serialize_datetime(self, v: Optional[dt], _info: Any) -> str:
if v:
if str(v)[19] == ".":
return v.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
else:
return v.strftime("%Y-%m-%dT%H:%M:%SZ") There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assuming the datetime object has all the required values, then the pydantic output will always be valid RFC3339. But that is not always the case. Pydantic allows in time without seconds, but then the output would not be valid. If input is validated through If, however, we want to accept a wider range of inputs (
And the use of from ciso8601 import parse_rfc3339
from datetime import datetime, timezone
from pydantic import TypeAdapter
d = parse_rfc3339("2011-11-04T00:05:23.283+04:00")
d.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
# '2011-11-04T00:05:23.283000Z' <- Not actually the correct timezone
d.isoformat()
# '2011-11-04T00:05:23.283000+04:00'
d.astimezone(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ")
# '2011-11-03T20:05:23.283000Z'
d.astimezone(timezone.utc).isoformat()
# '2011-11-03T20:05:23.283000+00:00'
TypeAdapter(datetime).dump_json(d)
# b'"2011-11-04T00:05:23.283000+04:00"'
TypeAdapter(datetime).dump_json(d.astimezone(timezone.utc))
# b'"2011-11-03T20:05:23.283000Z"' All of those are valid RFC3339 outputs, but one of them is lying about the timezone. And now looking back at https://github.com/radiantearth/stac-spec/blob/v1.0.0/item-spec/item-spec.md#properties-object seems like there is actually a bit more of an issue. The spec requires UTC on top of the RFC3339. So even with I see a few possibilities: def parse_rfc3339_asutc(value: Any) -> datetime:
return parse_rfc3339(value).astimezone(timezone.utc)
def parse_utc_rfc3339(value: Any) -> datetime:
d = parse_rfc3339(value)
if not d.tzinfo == timezone.utc:
raise ValueError("Input must be UTC")
return d
# Flexible input as long as there is a timezone. Converts to UTC. Use isoformat to get valid RFC format.
Annotated[AwareDatetime, AfterValidator(lambda d: d.astimezone(timezone.utc)), PlainSerializer(lambda d: d.isoformat())]
# Input must be RFC, any timzone. Converts to UTC. Pydantic serializer outputs valid RFC format.
Annotated[datetime, BeforeValidator(parse_rfc3339_asutc)]
# Input must be RFC and UTC. Pydantic serializer outputs valid RFC format.
Annotated[datetime, BeforeValidator(parse_utc_rfc3339)] I would probably still go with Pydantic for parsing. Gives the most flexibility on input, and can still guarantee valid output. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @gadomski comments below in regards to "be permissive for inputs, strict on outputs." This has the advantage that any date format can be accepted including I added a possible solution in here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @thomas-maschler if going the The way it is current written in this PR is to accept a datetime with a timezone, and convert it to UTC. Rather than accept a naive value and assume UTC. Though the spec here does say it must be in UTC. So, its not a horrible thing to assume either. Tons of possible solutions. I was trying to optimize on keeping as much as possible inside the pydantic rust code and minimal python code to maintain here. Its funny how a spec trying to simplify things or add clarity can make things more complicated than necessary. |
||
return self | ||
|
||
|
||
class Item(Feature, StacBaseModel): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eseglem 👋
The issue with this is that pydantic will allow a lot of datetime format while the stac specification says they should conform with https://datatracker.ietf.org/doc/html/rfc3339#section-5.6
I've updated this portion in #133
I didn't touch the overall logic of the validation but I think we could totally have:
before
validator that convert string to datetime objectafter
check the datetime/start_datetime/end_datetime conflicNote: because of the specification, the serialized is required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My primary purpose here was to avoid messing with the input data. Also, note that
StacCommonMetadata
just usesOptional[datetime]
as well. So I was just matching up to it. Looks like #75 was where that got changed.But yes, when parsing input data, Pydantic does allow some flexibility. I think the most important part is that the output is valid RFC3339, and allowing some flexibility on input is not a bad thing. And I would just use Pydantic for it. As there is a lot of efficiency gain there not passing values back and forth. And I figured they were already being parsed elsewhere too. But if others feel strongly about not supporting it, that is not the focus of this PR.
I believe the best way to integrate your changes would be something like this:
That would avoid the issue and force the format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a middle ground possibility: Pydantic has
AwareDatetime
which would avoid inputs without any timezone info that could possibly cause some output issues.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well if you use stac-pydantic to validate item, IMO it should fails if the datetime is not a valid rfc3339 string (per specification)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I improved the logic a bit since the original comment, but there are still a few formats that are accepted but are not actually to spec. I was aiming for "be permissive for inputs, strict on outputs" with this iteration. It is a middle ground between where things were (before your updates) and completely strict.
As of right now, this will allow space and underscore in addition to T that RFC3339 specifies. While anything other than T is out of spec, it is not the end of the world. Its still 100% clear what the meaning of the string is. And in all likelihood the string will have it since that is the standard way for it to be. It will also allow for seconds, and not just ms, to be optional in the input. Again, its clear what the meaning of that string is.
The thing not addressed by any of these things so far is that RFC3339 does allow for any timezone, though says you probably should use UTC. And the STAC spec specifies it must be in UTC. Which, to me, feels excessive. As long as it follows includes the timezone the machine does not care if the string says
"2024-02-16T03:00:00+00:00"
or"2024-02-15T22:00:00-05:00"
they are the same thing.So, sure these things violate the spec, but not in any meaningful way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I kindly disagree (but I'm not using stac-pydantic a lot so I'll defer to you for the final 👍 😄)
I'm just worry that people are going to use stac-pydantic to validate Items then put them in a database or whatever and then 2 cases:
incorrect
datetime formatAgain I appreciate the time you're spending on this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not much of a user either, so I would prefer to defer to others as well. So, I went towards the permissive input strict output direction.
Those are totally valid concerns, which I also had. I tried to address them in how I implemented it. And I do not mind spending the time on it to make sure things are done well. And, if possible, in a manner that makes everyone comfortable.
Once the input data has been validated by pydantic and the object is created, they are all python datetime objects with timezone awareness which have been cast to UTC. At that point it is up to the user and whatever library they are using to interact with the database to get things correct. I am primarily experienced with sqlalchemy and psycopg2, and both of them will have no issues. And even if they are, for some reason, interacting with the database in a timezone naive manner, the datetime object is in UTC, so they should still be fine.
It will not accept these:
It is worth noting, there is a bug / regression in pydantic 2.6 with
"2024-01-01"
incorrectly being accepted, but that was patched yesterday (2024-02-15). I believe it only affects 2.6.0 and 2.6.1, but I can dig up exact versions if there is a concern about that.And the pydantic docs only really mention
T
so the space and underscore are either undocumented feature, or a bug. I'll see about digging into that more as well.Under normal usage, I do not see a way for that to happen. If pydantic validated the value then the datetime object is in UTC. And either option for serializing produces valid a valid RFC3339 string in UTC. As required by the STAC spec.
Because pydantic does not validate on assignment by default, a user could assign a bad value and then output a bad value because of that. That is not a problem unique to datetimes though, and exists for every value within any library built on top of pydantic. Unfortunately there is not really much to do about that. The guardrails are in place, and if they choose not to use them, that is outside the library's control. I do have a few other ideas that could be explored here, but it feels wrong to try and address issues with un-validated values at serialization time. And they would all come with some loss of performance.