Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer docs on metadata for collections and items #4

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions new-collections.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
<!-- Dashboard-specific notes that supplement the full [stac-api collection specification](https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md). Note that there is no schema enforcement on the collection table content in pgstac—this provides flexibility but also requires caution when creating and modifying Collections.

>STATUS: revised with review comments [2022-03-20]

>STATUS: under review [2022-02-24] -->

## Collection field, extension, and naming recommendations
| **Field &/or Extension** | **Recommendations** |
| --- | --- |
| **id** | If dataset exists in NASA's Earthdata or presumably from some other data provider like ESA, use that ID. If appropriate, add a suffix for any additional processing that has been performed, e.g. "OMSO2PCA_cog". If dataset is not from NASA's Earthdata, we can use a human readable name with underscores like "facebook_population_density". |
| **dashboard extension** | To support the delta-ui we have added two new fields in a proposed dashboard extension. For now we are just adding the fields but after testing things out, we can formalize the extension with a [hosted json schema](https://github.com/stac-extensions/template). **_Dashboard extension properties are only required for collections that will be viewed in the delta-ui dashboard._** |
| **dashboard:is_periodic** | `True/False` This boolean is used when summarizing the collection—if the collection is periodic, the temporal range of the items in the collection and the time density are all the front end needs to generate a time picker. If the items in the collection are not periodic, a complete list of the unique item datetimes is needed. |
| **dashboard:time_density** | `year`, `month`, `day`, `hour`, `minute`, or `null`. These time steps should be treated as enum when the extension is formalized. For collections with a single time snapshot this value is null. |
| **item_assets** | [stac-extension/item_assets](https://github.com/stac-extensions/item-assets/blob/main/README.md) is used to explain the assets that are provided for each item in the collection. We’re not providing thumbnails yet, but this example below includes a thumbnail asset to illustrate how the extension will be used. The population of this property is not automated, the creator of the collection writes the item assets documentation. **_Item assets are only required for collections that will be viewed in the delta-ui dashboard._** |
| **summaries**| The implementation of this [core stac-spec](https://github.com/radiantearth/stac-api-spec/blob/master/stac-spec/collection-spec/collection-spec.md#summaries) field is use-case specific. Our implementation is intended to support the dashboard and will supply datetime and raster statistics for the default map layer asset across the entire collection. **_Currently summaries are manually updated with a delta-ui specific [user defined function in pgstac](https://github.com/NASA-IMPACT/delta-backend/issues/31)._** |
| **title and description** | Use these properties to provide specific information about the collection to API users and catalog browsers. These properties correspond to [dataset name and info in the covid-api](https://github.com/NASA-IMPACT/covid-api/blob/develop/covid_api/db/static/datasets/no2-diff.json) but the delta dashboard will use delta-config to set these values in the UI so the information in our stac collections will be for data curators and API users. |
| **collection name style choices** | Prefer lower-case kebab-case collection names. Decision: Should names align with underlying data identifiers or should it be an interpreted name? `omi-trno2-dhrm` and `omi-trno2-dhrm-difference` vs `no2-monthly` and `no2-monthly-diff`; `bmhd-30m-monthly` vs `nightlights-hd-monthly` |
| **license** | [SPDX license id](https://spdx.org/licenses/), license is likely available in CMR but we may need to research other sources of data. Default open license: `CC0-1.0` |

**item_assets example**

```json
"item_assets": {
"thumbnail": {
"type": "image/jpeg",
"roles": [
"thumbnail"
],
"title": "Thumbnail",
"description": "A medium sized thumbnail"
},
"cog_default": {
"type": "image/tiff; application=geotiff; profile=cloud-optimized",
"roles": [
"data",
"layer"
],
"title": "Default COG Layer",
"description": "Cloud optimized default layer to display on map"
}
}
```

**summaries example for periodic collection**

```json
"summaries": {
"datetime": ["2016-01-01T00:00:00Z", "2022-01-01T00:00:00Z"],
"cog_default": {
"max": 50064805976866820,
"min": -6618294421291008
}
}
```

**summaries example for non-periodic collection**

```json
"summaries": {
"datetime": [
"2020-01-01T00:00:00Z",
"2020-02-01T00:00:00Z",
"2020-03-01T00:00:00Z",
"2020-04-01T00:00:00Z",
"2020-05-01T00:00:00Z",
"2020-06-01T00:00:00Z",
"2020-07-01T00:00:00Z",
"2020-08-01T00:00:00Z",
"2020-09-01T00:00:00Z",
"2020-10-01T00:00:00Z",
"2020-11-01T00:00:00Z",
"2020-12-01T00:00:00Z",
"2021-01-01T00:00:00Z",
"2021-02-01T00:00:00Z",
"2021-03-01T00:00:00Z",
"2021-04-01T00:00:00Z",
"2021-05-01T00:00:00Z",
"2021-06-01T00:00:00Z",
"2021-07-01T00:00:00Z",
"2021-08-01T00:00:00Z",
"2021-09-01T00:00:00Z"
],
"cog_default": {
"max": 255,
"min": 0
}
}
```
85 changes: 85 additions & 0 deletions new-items.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
<!-- This document is intended to define a set of conventions for generating STAC Items consistently for the dashboard UI and future API users. After the conventions are reviewed and finalized these should represent the minimum metadata API users can expect from the backend.
>STATUS: under review -->

## Rio-stac conventions for generating STAC Items
We use [rio-stac](https://developmentseed.org/rio-stac/) to generate item metadata for COGs so the notes below are organized around the input parameters of the [create_stac_item](https://developmentseed.org/rio-stac/api/rio_stac/stac/#create_stac_item) method.

**example rio-stac python usage**

```python
item = rio_stac.stac.create_stac_item(
id = item_id,
source = f"s3://{obj.bucket_name}/{obj.key}",
collection = collection_id,
input_datetime = <datetime.datetime>,
with_proj = True,
with_raster = True,
asset_name = "cog_default",
asset_roles = ["data", "layer"],
asset_media_type = "image/tiff; application=geotiff; profile=cloud-optimized",
)
```

**Rio-stac create item parameter recommendations**

These recommendations are for generating STAC Item metadata for collections intended for the dasboard and may not be applicable to all ARCO collections.

| Parameter | **Recommendations** |
| --- | --- |
| **id** | (1) When STAC Item metadata is generated from a COG file, strip the full file extension from the filename for the item id. (2) When ids are not unique across collections, append the collection id to the item id. For example the no2-monthly and no2-monthly-diff COGs are stored with unique bucket prefixes but within the prefix all the filenames are the same, so the collection id is appended: `OMI_trno2_0.10x0.10_201604_Col3_V4` → `OMI_trno2_0.10x0.10_201604_Col3_V4-no2-monthly`). |
| **with_proj** | `True`. Generate projection extension metadata for the item for future ARCO datastore users. |
| **with_raster** | `True`. This will generate gdal statistics for every band in the COG—we use these to get the range of values for the full collection. |
| **asset_name** |A meaningful asset name for the default cloud optimized asset to be displayed on a map. `cog_default` is a placeholder—we need to choose and commit to an asset name for all collections. If not set, will default to `asset`. * TODO Decision: For items with many assets we should ingest all with appropriate keys and duplicate one preferred display asset as the default cog. We should be considering [metadata conventions in pgstac-titiler](https://github.com/stac-utils/titiler-pgstac/issues/30) |
| **asset_roles** | `["data", "layer"]` data is an appropriate role, we may also choose to add something like layer to indicate that the asset is optimized to be used as a map layer ([stac specification for asset roles](https://github.com/radiantearth/stac-api-spec/blob/master/stac-spec/item-spec/item-spec.md#asset-role-types)). |
| **asset_media_type** | `"image/tiff; application=geotiff; profile=cloud-optimized` ([stac best practices for asset media type](https://github.com/radiantearth/stac-api-spec/blob/master/stac-spec/best-practices.md#working-with-media-types)). |
| **properties** | CMIP6: TODO, CMR: TODO if we don’t store links to the original data, downstream users are not going to be able to pair STAC records with the versioned parent data in CMR |


## Data provenance convention
When adding STAC items that were derived from previously published data (such as CMR records), there are multiple ways to preserve the linkage between the item and the more complete source metadata. We should provide at a minimum metadata assets for any items derived from previously published data. Here are three examples from HLS:

**metadata are assets**

The CMR properties question in the table above (how to refer the STAC Item to it’s CMR source metadata) could instead be solved by adding a metadata asset. This does not require creating a new extension for CMR, it just involves creating an asset from the CMR granule metadata which should be in the event context for CMR search driven ingests. The example below is from [documentation for using HLS cloud optimized data](https://lpdaac.usgs.gov/resources/e-learning/getting-started-with-cloud-native-harmonized-landsat-sentinel-2-hls-data-in-r/).
```json
"assets": {
"metadata": {
"href": "https://cmr.earthdata.nasa.gov/search/concepts/G2099379244-LPCLOUD.xml",
"type": "application/xml"
},
"thumbnail": { ...}
}
```

**stac-spec [scieintific extension](https://github.com/stac-extensions/scientific)**

```json
"properties": {
"sci:doi": "10.5067/HLS/HLSS30.002",
...
}
```

**Item links to metadata**

Use a `cite-as` Item link to the DOI for the source data.
```json
"links": [
{
"rel": "cite-as",
"href": "https://doi.org/10.5067/HLS/HLSS30.002"
},
...
]
```


## STAC Item validation convention

We are producing [pystac.items](https://pystac.readthedocs.io/en/stable/api/item.html) with rio-stac’s create_stac_item method and we should validate them before publishing them to s3. Testing found that it is possible to produce structurally sound but invalid STAC Items with create_stac_item.

The built in pystac validator on the pystac.item returned by create_stac_item can be used to easily validate the metadata—`item.validate()` will raise an exception for invalid metadata. Pystac does need to be [installed with the appropriate dependencies for validation](https://pystac.readthedocs.io/en/stable/api.html?highlight=validation#validation).


## Convention for default map layer assets for spectral data
Many of the collections for the dashboard have a clear default map layer asset that we can name `cog_default`. This convention does not map as well to spectral data with many assets (B01, B02,...). A preferred band asset could be duplicated to define a default map layer asset to be consistent but this needs to be decided.