Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Dataset "layer" conventions in STAC (dashboard specific) #32

Open
leothomas opened this issue Feb 25, 2022 · 0 comments
Open
Assignees

Comments

@leothomas
Copy link
Contributor

leothomas commented Feb 25, 2022

Context/Background:

Dashboard Evolution has various datasets that the frontend (or more accurately, the configuration repo used by the frontend) is aware of. Some datasets have "layers" or "variations" but not all. The concept of a “layer” or variation (as opposed to a separate dataset) is that all layers of a dataset should have the same temporal domain and geometries and should all relate to the same underlying data “capture” event. The most classic example of this is spectral bands in satellite images. All bands are captured at the same time and place by a satellite, and all relate to the same time and place on earth, and provide variations of the same fundamental measured quantity. In the case of Dashboard Evolution, some datasets will have layers and others will not.

Dashboard Evolution specific examples of layers:

Atmospheric Datasets (NO2/CO2):

These datasets each have single monthly average value (avg), but also a month by month difference from baselines (the average of that same month from 2005 to 2015 - diff). EG: Jan 2020 vs. Jan 2020 - AVG[Jan 2015, …, Jan 2005])

CMIP6:

This is possibly the most complicated “layered” dataset, since the layers have a multi-dimensional/multi-tiered hierarchy.

Each variable has been evaluate at daily values by a dozen or so ensemble models. To reduce granularity (for user experience) the Ames researchers have provided ensemble models, which average across all the models, over the entire month. There are 3 such ensemble models:

CMIP6_ensemble_median/
CMIP6_ensemble_p10/
CMIP6_ensemble_p90/

Each model has been generated in 2 different scenarios, called Shared Socio-economic Pathways, SSP (there may be more coming):

ssp245 # extreme corrective action taken to combat climate change
ssp585 # "business as usual" scenario

In this case, the layers of the CMIP6 dataset are a convolution of the SSP’s and the ensemble models (eg: ensemble_media-ssp245, ensemble_median-ssp585, ensemble_p10-ssp245, etc )

Nightlights:

Nightlights dataset is an example of a dataset that has no layers

Options Considered:

  1. Each "layer" as a different asset of the same STAC Item
/collections/co2/items
>>> { "datetime": "...", "bbox": [...], "assets":{ "avg":{...}, "diff":{...} } }

This seems like the most “correct” approach in the sense that we’ve based our idea of “layers” on the idea of spectral bands, and the official Asset documentation uses spectral bands as an example:

Item has a multispectral analytic asset, a 3-band full resolution visual asset, a down-sampled preview asset, and a cloud mask asset

src: https://github.com/radiantearth/stac-spec/blob/master/best-practices.md#asset-roles

  1. Each "layer" as it's own STAC collection
/collections
>>> co2-avg, co2-diff, nightlights, cmip6-ssp245-median-ensemble, cmip6-ssp585-median-ensemble, cmip6-ssp245-p90-ensemble, cmip6-ssp585-p90-ensemble
  1. Each "layer" as custom (filterable) property of the STAC Item
/collections/co2/items
>>> {"datetime": "...", "bbox":"...", "properties": {"custom:layer-name": "avg"}}

This approach is the most well suited to the CMIP6 dataset since it enables all models/ssps to live in the same collection and new models/ssps can be easily added as new items (ref: https://planetarycomputer.microsoft.com/dataset/nasa-nex-gddp-cmip6#Example-Notebook)
Note: The above example considers each of the 9 variables as assets of each STAC Item. While each of the 9 variables has the same temporal and geographic domain, they don’t necessarily relate to the same underlying data. I’m not sure how much of a departure this is from the intended usage of STAC assets.

Discussion:

While option 1 seems the most “correct”, I’m concerned about the case of having to add a new layer to an existing dataset, which would entail having to update the assets key of each STAC item in the database, something for which we don’t yet have a functionality. Further complications come with recurring ingests (eg: the ingestion pipeline would need logic to find a STAC record to add an asset to or create it if it doesn’t yet exist).

Option 3 is the best adapted to the CMIP6 dataset, however it fails against an important constraint: the dashboard needs to have access to the date domain of each layer. With option 3 this would force the dashboard to have to make a paginated query against the STAC api (with filters corresponding to the desired layer) and then extract the date object from each result.

@Alexandra K has already worked on an implementation for a custom date domain query by adding a domain key to the collection level summary - this requires each layer to be its own collection.

Decision:

For the time being we will go with option 2, and ingest each “layer” as a separate collection. This is create a large number of collections (especially in the case of CMIP6) which may negatively affect discoverability through the STAC API. We can mitigate this with custom datasets collections that collect all layer level collections for the sake of discoverability (for usage outside of the raster API).

Perhaps at a later date, once the datasets to ingest have stabilized we can consider switching to an asset based model (option 1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants