Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STAC Collection Creation Conventions (Dashboard Specific) #29

Closed
anayeaye opened this issue Feb 24, 2022 · 19 comments
Closed

STAC Collection Creation Conventions (Dashboard Specific) #29

anayeaye opened this issue Feb 24, 2022 · 19 comments
Labels
question Further information is requested

Comments

@anayeaye
Copy link
Collaborator

anayeaye commented Feb 24, 2022

Dashboard-specific notes that supplement the full stac-api collection specification. Note that there is no schema enforcement on the collection table content in pgstac—this provides flexibility but also requires caution when creating and modifying Collections.

STATUS: revised with review comments [2022-03-20]

STATUS: under review [2022-02-24]

Collection field, extension, and naming recommendations

Field &/or Extension Recommendations
id If dataset exists in NASA's Earthdata or presumably from some other data provider like ESA, use that ID. If appropriate, add a suffix for any additional processing that has been performed, e.g. "OMSO2PCA_cog". If dataset is not from NASA's Earthdata, we can use a human readable name with underscores like "facebook_population_density".
dashboard extension To support the delta-ui we have added two new fields in a proposed dashboard extension. For now we are just adding the fields but after testing things out, we can formalize the extension with a hosted json schema. Dashboard extension properties are only required for collections that will be viewed in the delta-ui dashboard.
dashboard:is_periodic True/False This boolean is used when summarizing the collection—if the collection is periodic, the temporal range of the items in the collection and the time density are all the front end needs to generate a time picker. If the items in the collection are not periodic, a complete list of the unique item datetimes is needed.
dashboard:time_density year, month, day, hour, minute, or null. These time steps should be treated as enum when the extension is formalized. For collections with a single time snapshot this value is null.
item_assets stac-extension/item_assets is used to explain the assets that are provided for each item in the collection. We’re not providing thumbnails yet, but this example below includes a thumbnail asset to illustrate how the extension will be used. The population of this property is not automated, the creator of the collection writes the item assets documentation. Item assets are only required for collections that will be viewed in the delta-ui dashboard.
summaries The implementation of this core stac-spec field is use-case specific. Our implementation is intended to support the dashboard and will supply datetime and raster statistics for the default map layer asset across the entire collection. Currently summaries are manually updated with a delta-ui specific user defined function in pgstac.
title and description Use these properties to provide specific information about the collection to API users and catalog browsers. These properties correspond to dataset name and info in the covid-api but the delta dashboard will use delta-config to set these values in the UI so the information in our stac collections will be for data curators and API users.
collection name style choices Prefer lower-case kebab-case collection names. Decision: Should names align with underlying data identifiers or should it be an interpreted name? omi-trno2-dhrm and omi-trno2-dhrm-difference vs no2-monthly and no2-monthly-diff; bmhd-30m-monthly vs nightlights-hd-monthly
license SPDX license id, license is likely available in CMR but we may need to research other sources of data. Default open license: CC0-1.0

item_assets example


"item_assets": {
    "thumbnail": {
      "type": "image/jpeg",
      "roles": [
        "thumbnail"
      ],
      "title": "Thumbnail",
      "description": "A medium sized thumbnail"
    },
    "cog_default": {
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "roles": [
        "data",
        "layer"
      ],
      "title": "Default COG Layer",
      "description": "Cloud optimized default layer to display on map"
    }
  }

summaries example for periodic collection

"summaries": {
    "datetime": ["2016-01-01T00:00:00Z", "2022-01-01T00:00:00Z"],
    "cog_default": {
      "max": 50064805976866820,
      "min": -6618294421291008
    }
  }

summaries example for non-periodic collection

"summaries": {
    "datetime": [
      "2020-01-01T00:00:00Z",
      "2020-02-01T00:00:00Z",
      "2020-03-01T00:00:00Z",
      "2020-04-01T00:00:00Z",
      "2020-05-01T00:00:00Z",
      "2020-06-01T00:00:00Z",
      "2020-07-01T00:00:00Z",
      "2020-08-01T00:00:00Z",
      "2020-09-01T00:00:00Z",
      "2020-10-01T00:00:00Z",
      "2020-11-01T00:00:00Z",
      "2020-12-01T00:00:00Z",
      "2021-01-01T00:00:00Z",
      "2021-02-01T00:00:00Z",
      "2021-03-01T00:00:00Z",
      "2021-04-01T00:00:00Z",
      "2021-05-01T00:00:00Z",
      "2021-06-01T00:00:00Z",
      "2021-07-01T00:00:00Z",
      "2021-08-01T00:00:00Z",
      "2021-09-01T00:00:00Z"
    ],
    "cog_default": {
      "max": 255,
      "min": 0
    }
  }
@anayeaye
Copy link
Collaborator Author

#32

@anayeaye anayeaye added the question Further information is requested label Feb 25, 2022
@anayeaye
Copy link
Collaborator Author

anayeaye commented Mar 2, 2022

@abarciauskas-bgse @jvntf @slesaad Can you weigh in on these tickets for STAC metadata conventions in regards to the data ingests we are doing and point out anything that should be adjusted or added? We are definitely going to need to make adjustments for datetimes (start/end vs nominal datetime), anything else?

@abarciauskas-bgse
Copy link
Contributor

abarciauskas-bgse commented Mar 9, 2022

This is really great @anayeaye your table and examples are 💯

Questions about naming of some of the fields:

  • Was there a precedent for time_density? What do you think about time_unit or time_period?
  • I'm wondering if cog_default is too specific and it should be something more generic like tiling_defaults. I think that information will be used to generate default rescaling right? I could imagine generating rescaling parameters where the source is something other than a COG

Additional questions about values in summaries:

  • How do we derive cog_default min/max/avg values? Do we ask the science teams or inspect all the files being ingested (or sample them if there are a sufficiently large number)?
  • Is datetime required if it is the same as the temporal interval?

Is this a valid example of the conventions you are proposing:

{
    "id": "OMSO2PCA",
    "type": "Collection",
    "links": [],
    "title": "OMSO2PCA", 
    "extent": {
        "spatial": {
            "bbox": [
                [
                    -180,
                    -90,
                    180,
                    90
                ]
            ]
        },
        "temporal": {
            "interval": [
                [
                    "2005-01-01T00:00:00Z",
                    "2021-01-01T00:00:00Z"
                ]
            ]
        }
    },
    "license": "MIT",
    "description": "OMI/Aura Sulfur Dioxide (SO2) Total Column L3 1 day Best Pixel in 0.25 degree x 0.25 degree V3",
    "stac_version": "1.0.0",
    "summaries": {
        "datetime": [
            "2005-01-01T00:00:00Z",
            "2021-01-01T00:00:00Z"
        ],
        "cog_default": {
            "avg": 287.90577560637,
            "max": 478.89999389648,
            "min": 51
        }
    },
    "properties": {
        "dashboard:is_periodic": true,
        "dashboard:time_density": "year"
    }    
}

@anayeaye
Copy link
Collaborator Author

anayeaye commented Mar 10, 2022

@abarciauskas-bgse thanks digging in to this! To unblock UI development we did just settle on a few solutions that we could commit to deliver for the dashboard UI. I don't know that it too late to make changes but at this point it will impact the front end so we'd have to coordinate to not break anything.

Was there a precedent for time_density? What do you think about time_unit or time_period?

We did discuss other keys but since this dashboard extension is purely for the front end, UI got final vote on preferences.

--

I'm wondering if cog_default is too specific and it should be something more generic like tiling_defaults. I think that information will be used to generate default rescaling right? I could imagine generating rescaling parameters where the source is something other than a COG

Yeah, it is for rescaling parameters. This is an incremental solution that only supports simple products with single band COG assets. I think it needs to be somewhat specific but other asset keys might fit better (even just cog but at this point we would need to coordinate a change). I think that if we added non-COG assets that needed rescaling values, we'd add a new asset key and write a new function to derive the values (what is implemented here actually uses the stac raster extension and is COG specific).

Cog_default came out of the need to create a consistent asset key for the map tiler and to make it easier to automate the summary. This is supposed to be a catch all asset key for all of our basic single band dataset products. The cog_default doesn't cover collections like HLS will not have a default asset to display and we don't really want to be calculating full collection statistics for the reflectance data anyway. There is a running delta-config discussion describing how the UI handles these two types of collections differently--including how the setting for these map layers that require something more complex than a simple rescale.

--

How do we derive cog_default min/max/avg values? Do we ask the science teams or inspect all the files being ingested (or sample them if there are a sufficiently large number)?

We have a user defined function for pgstac--the evolution of the function is issue #31 and I am working on adding that function to our deployment in [PR 34](#43. It is not a perfect solution but the goal is to make a simple function call that could be the terminal step in an ingest pipeline (maybe a fan-in to a single pgstac function call that will dynamically create the summary). We're also creating an update all method that will update any collection that has the necessary dashboard metadata attributes that we might want to schedule to update regularly. It would be preferable if we could identify trigger events to run when needed for a given collection.

The latest iteration nixes the average because the way it is derived is not useful (min of mins is a valid metric; average of means is less so).

--

Is datetime required if it is the same as the temporal interval?

For now we have committed to maintaining this information in one place for the dashboard UI. But we intend to make the creating and updating of summaries hands-off.

--

Is this a valid example of the conventions you are proposing?

Yes but we will have a function to automatically generate the summaries for all of our non-spectral datasets if there is an item_assets property on the collection. Totally open to discussing that but for now the SQL routine looks at the item assets property to decide whether or not to create a cog_default summary, if not it will only create a datetime summary. And one nit: the license should be one of the predefined SPDX licenses because stac browsers will link to the the spdx license on an id lookup. But this does not impact any of our features so it's the kind of thing we'll probably want to circle back on when we have easier ways to edit the metadata.

@abarciauskas-bgse
Copy link
Contributor

Thanks so much for all these detailed responses and apologies for my belated comments and that you may have had to repeat any information I should have been aware of. You obviously have thought through this solution comprehensively and developed some really cool functionality. I'm more than happy to implement the conventions as defined above and seek your review of all the metadata moving forward 🙇🏽‍♀️

@abarciauskas-bgse
Copy link
Contributor

I also updated https://j2wlly6xg8.execute-api.us-east-1.amazonaws.com/collections/OMSO2PCA and the example above with the "MIT" as the license, given that's what you used for the other datasets.

@abarciauskas-bgse
Copy link
Contributor

Decision about id's and titles:

  • If dataset exists in NASA's Earthdata or presumably from some other data provider like ESA, use that ID. If appropriate add an underscore for any additional processing that has been performed, e.g. "OMSO2PCA_cog"
  • If dataset is not from NASA's Earthdata, we can use a human readable name with underscores like "facebook_population_density"

@anayeaye I'm inclined to keep title and ID the same but do you know if there is a good use case where they should be different? Like dataset landing pages where the title might be a more descriptive name?

@anayeaye
Copy link
Collaborator Author

@abarciauskas-bgse

  1. Either _ or - work but the hyphen is more consistent with other stac catalogs. This is style only and I'll use whatever we settle on.
  2. I think having a title that is more descriptive than the id is going to be really helpful for new users when discovering data. CMR has a Title property that would be appropriate, we'd just need to append COG.
  3. A third convention to decide on: when we pull datasets from the covid-19 dashboard: do we want to use the same ids as the source dashboard datasets? I'm inclined to make the stac collections as similar as possible to the covid-19 datasets.

@jvntf
Copy link

jvntf commented Mar 10, 2022

@anayeaye a small nit, should we add hour to time_density or is it left out for a reason?

@abarciauskas-bgse
Copy link
Contributor

abarciauskas-bgse commented Mar 10, 2022

  1. I suggested underscore because I find it more readable but don't have a strong preference. I would be interested to know if there is a reason other STAC catalogs are using dashes, perhaps there is some tool or database convention for which it is required to use dashes instead of underscores. But again, I don't have a strong preference.
  2. 👍🏽
  3. I haven't inspected each dataset json file but the filenames in https://github.com/NASA-IMPACT/covid-api/tree/develop/covid_api/db/static/datasets all look reasonable to me

I think my takeaways with respect to id and title at this point are:

  • If the dataset has been sourced from another data archive, such as OMI from Earthdata, the id of that dataset should be re-used somewhere in the id or title or both
  • If the dataset has been sourced from another data archive, but changed in any way this should also be transparent in the title (such as COG)
  • Where it makes sense, make both the title and id as human-readable and descriptive as is reasonable, but the id should be a (much) abbreviated identifier with dashes in places of spaces.

@anayeaye what do you think with that summary ⬆️

@abarciauskas-bgse
Copy link
Contributor

abarciauskas-bgse commented Mar 10, 2022

Question on summaries: if we're going to implement summaries in a terminal step, should we be adding them at all right now while we are creating the collections? I'm going to leave summaries out for now, assuming we can run the summaries function after ingest.

@abarciauskas-bgse
Copy link
Contributor

One more thought about naming: s3 data directories should match the ids of the datasets. It would be nice to enforce this in the future but for now just stating it for the group.

@anayeaye
Copy link
Collaborator Author

@jvntf - totally missed hour. Definitely adding that in the edit.

@anayeaye
Copy link
Collaborator Author

Creative Commons Zero licensing recommendation
We may want to add some additional guidance on choosing the correct license when not provided explicitly with the data to future data curator documentation so I'm recording some notes here.

Still true: choose a SPDX license id license id or use proprietary (STAC community tools are built to link to SPDX license when an id is provided in the metadata).

Snippets/links from discussion with about choosing a license

cc: @slesaad @j08lue @ashiklom

@j08lue
Copy link

j08lue commented Nov 10, 2022

Thanks for recording this here, @anayeaye. I take it from our discussion on Slack that the decision was made in favor of CC0-1.0. Where do we need to document this?

@slesaad
Copy link
Member

slesaad commented Nov 11, 2022

@j08lue probably edit the first post in this thread and specify it there

@j08lue
Copy link

j08lue commented Nov 12, 2022

Done.

@j08lue
Copy link

j08lue commented Jun 12, 2023

These docs are now published at https://nasa-impact.github.io/veda-docs/contributing/dataset-ingestion/stac-collection-conventions.html

Perhaps this issue can now be closed and we in the future maintain this information in the docs site?

@jsignell
Copy link
Contributor

I'm going to close and lock this issue.

@NASA-IMPACT NASA-IMPACT locked as resolved and limited conversation to collaborators Jun 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants