Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

Add a test set of CMIP6 datasets #204

Closed
moradology opened this issue Oct 18, 2022 · 28 comments
Closed

Add a test set of CMIP6 datasets #204

moradology opened this issue Oct 18, 2022 · 28 comments
Assignees
Labels

Comments

@moradology
Copy link
Contributor

moradology commented Oct 18, 2022

Research + prepare processing for the dataset: Identify the dataset and what the processing needs are

The data is hosted in the S3 Bucket: s3://nex-gddp-cmip6-cog/
Already cogs: yes
Point of contact: @leothomas

Design the metadata and publish to the Dev API

collection grouping:
s3://nex-gddp-cmip6-cog/daily/
id: cmip6-daily-___
title and description: TBD; collection specific
dashboard:is_periodic: True
dashboard:time_density: daily

collections:

  • hurs: near-surface relative humidity
  • huss: near-surface specific humidity
  • pr: daily mean precipitation rate
  • rlds: surface downwelling longwave radiation
  • rsds: surface downwelling shortwave radiation
  • sfcWind: daily-mean surface wind speed
  • surface-downwelling-longwave-radiation
  • tasmax: daily maximum near-surface air temperature
  • tasmin: daily minimum near-surface air temperature
  • tas: daily mean near-surface air temperature

collection grouping:
s3://nex-gddp-cmip6-cog/monthly/
id: cmip6-monthly-___
title and description: TBD; collection specific
dashboard:is_periodic: True
dashboard:time_density: monthly

collections:

  • ensemble_p10
  • ensemble_p50 (ensemble_median)
  • ensemble_p90

collection grouping:
s3://nex-gddp-cmip6-cog/crossing/
title: Projected date of 4 degree average temperature increase
description: The projected date at which the average temperature in a given region is anticipated to have risen by 4 degrees under different scenarios
dashboard:is_periodic: False

shared metadata:
item_assets: { "cog_default": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Default COG Layer", "description": "Cloud optimized default layer to display on map" } }
license: CC0-1.0
temporal_interval: jan, 1950 - dec, 2100

Step 1 from issue #191

@moradology moradology self-assigned this Oct 18, 2022
@moradology moradology changed the title Add a new dataset to the API (high-level steps) Add CMIP6 datasets Oct 18, 2022
@leothomas
Copy link

Thanks for taking the lead on this @moradology !

Let me know if I can be helpful in any way!

@moradology
Copy link
Contributor Author

@leothomas This is a complicated dataset, so I appreciate the assistance. Looking at the public storage, I'm finding more categorizations than I know how to navigate at the moment. In particular, s3://nex-gddp-cmip6-cog/daily/GFDL-CM4/ and s3://nex-gddp-cmip6-cog/daily/GISS-E2-1-G/ both have their own ssp248/ssp585 directories underneath them. Should I simply create enough collections to capture all the directory branches or is this a case where the ingest should be selective?

@moradology
Copy link
Contributor Author

moradology commented Oct 19, 2022

I'm also curious about the meaning of this part of this part of the s3 URL: r1i1p1f2 Does it have special meaning I should capture? e.g.: s3://nex-gddp-cmip6-cog/daily/GFDL-CM4/ssp245/r1i1p1f1/

@moradology
Copy link
Contributor Author

Another small thing: the ingests are currently built such that each item gets a default_cog. It occurs to me that this is not obviously the best way to manage datasets like this. It occurs to me that tas/tasmax/tasmin/pr etc ought to be assets on items rather than duplicating items - one for each asset

@leothomas
Copy link

Should I simply create enough collections to capture all the directory branches or is this a case where the ingest should be selective?

This is a tough question, that we've discussed a bit here.

Some of the options we identified:

  • One collection for each directory branch: would lead to 36 collections (2 models: GISS-E2-1-GGFDL-CM4 * 2 SSPs: ssp245/ssp585 * 9 variables: hurs/huss/pr/rlds/rsds/sfcWind/tas/tasmax/tasmin)
  • One single collection for the entire dataset with the groupings (model/ssp/variable) as filterable custom parameters on each Item
  • Since collections can be composable you could have one STAC Item for each COG, all belonging to the parent CMIP6 collection, and then create sub-collections for each sub-group. Let me know if this makes sense - I'm happy to keep brainstorming around this options.

It occurs to me that this is not obviously the best way to manage datasets like this. It occurs to me that tas/tasmax/tasmin/pr etc ought to be assets on items rather than duplicating items - one for each asset

Absolutely agreed, and this is the way that Microsoft Planetary Computer has organized their CMIP6 STAC Collection. We made the assumption that each STAC Item would have a single asset due to a limitation in the way the VEDA VisEx searches for available collections to display in the frontend. Not sure if that limitation still stands, or if the dashboard would be able to use the assets object to display different datasets to the user.

Note: Tom Augusperger has also proposed a CMIP6 STAC extension

@moradology
Copy link
Contributor Author

We can certainly have catalogs with collections under them but IIRC, collections are not composable per se: note the discussion of child (collection or catalog) and item links in relation to the 'browseable' API. Also: my PR to implement browseable has grown incredibly stale from lack of love 😢, so I'm not sure how supported this would be in stac-fastapi
image

@leothomas
Copy link

Got it, I'm very sorry to hear that about your PR!

Thinking a bit more on this:

It occurs to me that this is not obviously the best way to manage datasets like this. It occurs to me that tas/tasmax/tasmin/pr etc ought to be assets on items rather than duplicating items - one for each asset

While I agree that this is completely reasonable, I think you could make the argument that any of the dataset's sub-classes could be assets:

eg:

ItemId: GFD-CM4_ssp245_2022_01_01
Assets: 
    - hurs
    - huss
    - sfcWind
    - ...

=== OR ===

ItemId: hurs_ssp245_2022_01_01
Assets: 
    - GFDL-CM4
    - GISS-E2-1-G

=== OR ===

ItemId: hurs_GFDL-CM4_2022_01_01
Assets: 
    - ssp245
    - ssp585

Would it make sense to have 3 collections: daily, monthly and crossingyear (since they all have different temporal resolutions), with one STAC Item per COG in S3, across all the models/ssps/variables, and then add catalogs (or collections) corresponding to "filtered views" for any combination of model/ssp/variable, that point to the corresponding items that belong directly to the 3 collections above?

@leothomas
Copy link

Also, I'm sorry I forgot to respond to this:

I'm also curious about the meaning of this part of this part of the s3 URL: r1i1p1f2 Does it have special meaning I should capture? e.g.: s3://nex-gddp-cmip6-cog/daily/GFDL-CM4/ssp245/r1i1p1f1/

It looks like r1i1p1f2 is the Variant Label (see: https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk/edit#bookmark=id.n7tsog12ccvr). As far as I can tell it's not a meaningful distinction for the downscaled CMIP6 - so I've been treating it as an ID that depends on the model ID

@anayeaye
Copy link
Contributor

That google document is super helpful! I wonder if we could export a PDF, put that in veda-data-store-staging or somewhere that ASDI stores assets and either link it as an asset link in the item metadata or from the collection at some point.

@leothomas
Copy link

Here's a PDF version: CMIP6_global_attributes_filenames_CVs.pdf! I definitely agree that it would be a good thing to have somewhere in terms of dataset documentation

@anayeaye
Copy link
Contributor

@moradology, just for reference, this is the CMIP6 dashboard we will be deprecating https://cmip6.surge.sh. I think there is a specific issue for comparing the new data to the deprecated dashboard but I though the link might be helpful here, too. We obviously don't have to match the previous metadata structure but it might be helpful for comparison when making decisions.

@j08lue
Copy link
Contributor

j08lue commented Nov 1, 2022

@vlulla
Copy link
Contributor

vlulla commented Dec 14, 2022

Reading the Example Notebook for accessing CMIP6 data (https://planetarycomputer.microsoft.com/dataset/nasa-nex-gddp-cmip6#Example-Notebook 1) from Microsoft Planetary Computer's page I learned that the CMIP6 data is stored in NetCDF format. And it appears that this CMIP6 NetCDF collection has a full list of files and associated md5 hash (at https://nex-gddp-cmip6.s3.us-west-2.amazonaws.com/index.html ). Comparing this with the COG bucket (s3://nex-gddp-cmip6-cog) it appears that there isn't such a list for the COG bucket. How come there isn't a list like this for COGs? Wouldn't it be helpful to have a list of all the available COGs and associated md5 hashes just like the NetCDF example above, at least for large collections like this CMIP6? Generating this list is a one-time cost but wouldn't it help all the subsequent discovery/ingest runs?

Since there is interest in thinking about how to do ingestion of large collections like this, why not have a separate ingestion process where the input is a list of COGs, possibly with a md5 hash, which is used as an input for the ingestor instead of doing s3 discoveries using regex?

Footnotes

  1. ctrl+click or cmd+click to open in new tab

@j08lue
Copy link
Contributor

j08lue commented Dec 15, 2022

Thanks for pushing this further, @vlulla. Are you suggesting to have a list with all objects in S3 and their checksums? I think that would not be necessary here, since S3 already has checksums for all objects in the object metadata.

I think there are two possible goals:

  1. As a data ingestion operator, I want to make sure that the file in storage is identical to the original, so I can be sure about data integrity in storage.
  2. As a data user, I want to make sure that a local copy I make of a file is the same as the data in store, so I can be sure about data integrity in my local copy.

I think for the first goal (upload), calculating the checksum locally and submitting it with the S3 upload for automatic validation would be best - no need to also store checksums anywhere else.

Not sure how relevant the second (download) goal is, since we do not want users to make local copies but load the data directly from cloud storage and perhaps only portions of each file. But if they do download from S3, they can also access the checksum values stored with the object.

@vlulla
Copy link
Contributor

vlulla commented Dec 15, 2022

Good points @j08lue . They make perfect sense. I was thinking about the usability a bit more broadly than just [up/down]loads. Let's say once a large collection is ingested successfully, and nothing changes, then instead of looking for what is available by doing either aws s3 ls or boto s3's list_objects we can just download the listing file for cursory exploration. My thinking is based on the observation that it is very hard for me to find out how many assets belong to a STAC. As far as I understand it there's no way of knowing this except walking the catalog (possibly inaccurate owing to my cursory understanding of STAC). So, for instance, to see how many NetCDF files are in the CMIP-6 bucket all i have to do is download the csv listing and look at it. Actually, that's exactly what I did! I believe the asset list would be very useful for scientists exploring the data sets (i could be off here too!). Additionally, i believe that once we have this list we can use it for various operational software tasks (since there isn't any transaction control that'll be needed...unless i'm missing something completely) too.

Anyways, it appears to me that this requires more consideration and thought.

@j08lue
Copy link
Contributor

j08lue commented Dec 18, 2022

I see! I remember seeing large collections elsewhere that come with some kind of inventory stored in an auxiliary file. They probably serve what you mentioned - an authoritative alternative to listing the buckets, which is cumbersome.

I'd say let us consider this as an enhancement of our catalogue service, but not a (blocking) requirement for the CMIP6 collection. Ok? I opened a new issue.

@leothomas
Copy link

leothomas commented Dec 19, 2022

I don't know if MD5s are an accepted field in STAC Items - but I think it could be an option to calculate the MD5 of the file and add it to the STAC Item as part of the ingestion process

STAC Catalogs are intended very easy to parse automatically

@j08lue
Copy link
Contributor

j08lue commented Dec 19, 2022

I agree, @leothomas, that STAC might also fulfil the same needs as an inventory file. Would be great to understand the user needs behind this first, in the issue linked here: https://github.com/NASA-IMPACT/veda-architecture/issues/138

@vlulla
Copy link
Contributor

vlulla commented Dec 19, 2022

I think that storing MD5 (or some other hash) as a part of the STAC also has the added benefit of providing the ability to verify that the file[s]/asset[s] provided/rendered are in fact same. This could also allow us to detect issues with COG generation.

@anayeaye
Copy link
Contributor

In addition to the inventory use case discussion, which I think is super valuable especially in terms of informing ingest processes, there is a stable file stac-spec extension to store checksum (and other file info) in STAC Item Assets that we could adopt or learn from.

@anayeaye
Copy link
Contributor

anayeaye commented Dec 19, 2022

I'd like to propose an alternate complimentary incremental approach to integrating CMIP6 in the VEDA catalog.
EDIT: A lot of work has already been done, I think this pilot MVP proposal doesn't have to replace that--but I do think focusing in on a bite sized chunk with a dashboard payoff will be worth while to inform the rest of the work in progress.

Pilot/MVP

I propose we narrow down our goals to start with. If we work with monthly data we have less immediate metadata concerns, no scaling concerns, and a deliverable for the front end.

  1. The crossing year data as single-scenario single-item collections (these will be great for dashboard storytelling)
  2. Start with only the monthly ensemble median data (after these work for us, we can add p10 and p90 if it makes sense for the dashboard)
  3. Skip the daily for now. If we choose to add daily data, we need to handle some metadata design work for the different scenarios. Daily records may never make sense for the dashboard so this could also open up our options for metadata design--the cog_default asset is dashboard specific.

I'd like to test an alternate PR that will not require custom orchestration (I may try to throw this together tomorrow against the SFN branch just to have a look at it)

  • 2 single item collections
  • 9 ~2000 item collections (one per variable)

This subset will let us start supporting the dashboard sooner and begin evaluating how we want to manage model data going forward. I'm sure as soon as we have something live we will begin to see new concerns for handling this data that we can use to inform the next step.

Inquire about need for p10 and p90 monthly ensembles

Did we need any additional metadata for the median monthly ensemble variables?
Would things have worked better with the cmip6 extension?
Would a filter based approach (rather than cog_default flat collection) work better for us?

Consider daily ingests

1 model-scenario-variable=~55K COG assets

@anayeaye
Copy link
Contributor

anayeaye commented Jan 4, 2023

I added my update to the wrong issue (#231)

I've started some experimental work ingesting the monthly ensemble median data: https://github.com/NASA-IMPACT/veda-data-pipelines/compare/feature/cmip6-monthly-ensemble-median. I'll compose some notes but initially it looks like there are ways to use our current pipelines and very simple filters for multi variable collections.

The historical data indexing needs more consideration--I think we will find that the distinct time ranges of the historical and modeled data mean that different collections are needed for automated dashboard components like the date picker. I am going to update the pilot monthly ingest to separate historical and modeled because:

  • how would the user know to change selection from historical to a model at year 2015?
  • if the collections are separated into historical and modeled, the dashboard and users can trust that assets will be available for all dates in the temporal range of a given collection

@leothomas
Copy link

how would the user know to change selection from historical to a model at year 2015?

We circumvented this issue in the cmip6 demo dashboard by duplicating the historical data into each model's collection. All the models had a timerange [1950-2100], where the data for [1950-2015] is the same historical data for each model. Not sure if this is valid approach for VEDA

@anayeaye
Copy link
Contributor

anayeaye commented Jan 6, 2023

@leothomas thanks!

We circumvented this issue in the cmip6 demo dashboard by duplicating the historical data into each model's collection. All the models had a timerange [1950-2100], where the data for [1950-2015] is the same historical data for each model. Not sure if this is valid approach for VEDA

I think this could be the most intuitive approach. In that case we could manage all of the monthly ensemble data in two filterable collections that cover 1950-2100. I am going to mock up this organization in dev and give it a test run with the tiler (filtered search-->mosaic_id-->tiles).

@anayeaye
Copy link
Contributor

anayeaye commented Jan 9, 2023

Placeholder update: I selected the monthly collections from the original PR associated with CMIP6 ingestion and @leothomas's suggestion and created two collections each with historical and experiment data combined to provide a continuous series of dates. With the caveat of the a datetime formatting bug in the pipelines API and a missing year of data from two variables in the SSP585 series, we should now have enough ingested dev data to test out using filtered CMIP6 data with the dashboard. I manually created filtered summaries for the two collections and will provide better documentation to pull things together for the dashboard, but for now here are summaries of what is in the dev stack and how I captured the summaries.

CMIP6 Monthly Historical and SSP585 Experiments (Aggregated GFDL-CM4 and GISS-E2-1 Models)

"filtered_summaries": {
    "hurs_median": {"max": 99.15868377685547, "min": 3.0335476398468018, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "hurs"},
    "huss_median": {"max": 0.03447316214442253, "min": 0.0000037452555261552334, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "huss"},
    "pr_median": {"max": 1336.43212890625, "min": 0, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "pr"},
    "rlds_median": {"max": 572.8646850585938, "min": 82.66290283203125, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "rlds"},
    "rsds_median": {"max": 432.68817138671875, "min": -1.1011608839035034, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "rsds"},
    "sfcWind_median": {"max": 13.04371166229248, "min": 0.44544002413749695, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "sfcWind"},
    "tas_median": {"max": 318.7181091308594, "min": 219.6027374267578, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1788, "cmip6:ensemble": "median", "cmip6:variable_id": "tas"},
    "tasmax_median": {"max": 327.1618957519531, "min": 223.0437774658203, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "tasmax"},
    "tasmin_median": {"max": 311.8375244140625, "min": 215.48802185058594, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1788, "cmip6:ensemble": "median", "cmip6:variable_id": "tasmin"}
}

CMIP6 Monthly Historical and SSP245 Experiments (Aggregated GFDL-CM4 and GISS-E2-1 Models)

"filtered_summaries": {
    "hurs_median": {"max": 99.15803527832031, "min": 4.945906162261963, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "hurs"},
    "huss_median": {"max": 0.03276791423559189, "min": 0.0000037452555261552334, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "huss"},
    "pr_median": {"max": 1228.4393310546875, "min": 0, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "pr"},
    "rlds_median": {"max": 479.4415283203125, "min": 82.66290283203125, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "rlds"},
    "rsds_median": {"max": 432.68817138671875, "min": -2.5996298789978027, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "rsds"},
    "sfcWind_median": {"max": 12.933960914611816, "min": 0.44544002413749695, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "sfcWind"},
    "tas_median": {"max": 315.2054748535156, "min": 219.6027374267578, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "tas"},
    "tasmax_median": {"max": 323.5796813964844, "min": 223.0437774658203, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "tasmax"},
    "tasmin_median": {"max": 309.3370666503906, "min": 215.48802185058594, "datetime": ["1950-01-01T00:00:00Z", "2099-12-01T00:00:00Z"], "item_count": 1800, "cmip6:ensemble": "median", "cmip6:variable_id": "tasmin"}
}

Summary method
One-off method to get aggregated summaries for grouped by ensemble (p10, median, p90) and variable (pr, hurs,...). For now this should help us QA the ingest (why is a year of tas and tasmin missing from the ssp585 collection?) and prepare visualizations (what is the range of values for a given ensemble+variable?).

SQL
select 
	jsonb_build_object(
	variable_ensemble,
	sq1.summary
) filtered_summaries 
from (
select 
	concat(items."content"->'properties'->>'cmip6:variable_id','_',items."content"->'properties'->>'cmip6:ensemble') variable_ensemble, 
    jsonb_build_object(
    	'datetime', array[
            to_char(min(items.datetime) at time zone 'Z', 'YYYY-MM-DD"T"HH24:MI:SS"Z"'),
            to_char(max(items.datetime) at time zone 'Z', 'YYYY-MM-DD"T"HH24:MI:SS"Z"')
        ], 
        'min', min((items."content"->'assets'->'cog_default'->'raster:bands'-> 0 ->'statistics'->>'minimum')::float),
        'max', max((items."content"->'assets'->'cog_default'->'raster:bands'-> 0 ->'statistics'->>'maximum')::float),
        'cmip6:variable_id', items."content"->'properties'->>'cmip6:variable_id',
        'cmip6:ensemble', items."content"->'properties'->>'cmip6:ensemble',
        'item_count', count(*)
    ) summary,
	collections.id
from items 
join collections on collections.id=items.collection
where collections.id like '%nex-gddp-cmip6-cog-monthly-ensemble-ssp245%'
group by 
	collections.id, 
	items.collection, 
	collections."content", 
	items."content"->'properties'->>'cmip6:variable_id',
	items."content"->'properties'->>'cmip6:ensemble'
order by collections.id) sq1;

@j08lue j08lue changed the title Add CMIP6 datasets Add a test set of CMIP6 datasets Feb 20, 2023
@anayeaye
Copy link
Contributor

The next step for this issue is to do a working session with the front end team to see if the filters demo-d in this gist can be implemented in the front end. The methods should also be cleaned up and reviewed with Vincent to see if we are approaching the tiling properly as well (I'm not dropping an @ until I clean up that notebook with better notes)

@anayeaye
Copy link
Contributor

anayeaye commented May 2, 2023

I temporarily pulled the ready label off this issue until we can better constrain the reduced scope

@j08lue
Copy link
Contributor

j08lue commented May 30, 2023

Keeping this ticket for reference but not for tracking.

@j08lue j08lue closed this as not planned Won't fix, can't repro, duplicate, stale May 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants