-
Notifications
You must be signed in to change notification settings - Fork 6
Add a test set of CMIP6 datasets #204
Comments
Thanks for taking the lead on this @moradology ! Let me know if I can be helpful in any way! |
@leothomas This is a complicated dataset, so I appreciate the assistance. Looking at the public storage, I'm finding more categorizations than I know how to navigate at the moment. In particular, |
I'm also curious about the meaning of this part of this part of the s3 URL: |
Another small thing: the ingests are currently built such that each item gets a |
This is a tough question, that we've discussed a bit here. Some of the options we identified:
Absolutely agreed, and this is the way that Microsoft Planetary Computer has organized their CMIP6 STAC Collection. We made the assumption that each STAC Item would have a single asset due to a limitation in the way the VEDA VisEx searches for available collections to display in the frontend. Not sure if that limitation still stands, or if the dashboard would be able to use the Note: Tom Augusperger has also proposed a CMIP6 STAC extension |
Got it, I'm very sorry to hear that about your PR! Thinking a bit more on this:
While I agree that this is completely reasonable, I think you could make the argument that any of the dataset's sub-classes could be assets: eg:
Would it make sense to have 3 collections: |
Also, I'm sorry I forgot to respond to this:
It looks like |
That google document is super helpful! I wonder if we could export a PDF, put that in veda-data-store-staging or somewhere that ASDI stores assets and either link it as an asset link in the item metadata or from the collection at some point. |
Here's a PDF version: CMIP6_global_attributes_filenames_CVs.pdf! I definitely agree that it would be a good thing to have somewhere in terms of dataset documentation |
@moradology, just for reference, this is the CMIP6 dashboard we will be deprecating https://cmip6.surge.sh. I think there is a specific issue for comparing the new data to the deprecated dashboard but I though the link might be helpful here, too. We obviously don't have to match the previous metadata structure but it might be helpful for comparison when making decisions. |
Ping @amarouane-ABDELHAK for advice on payload passing via s3 https://docs.aws.amazon.com/step-functions/latest/dg/avoid-exec-failures.html |
Reading the Example Notebook for accessing CMIP6 data (https://planetarycomputer.microsoft.com/dataset/nasa-nex-gddp-cmip6#Example-Notebook 1) from Microsoft Planetary Computer's page I learned that the CMIP6 data is stored in NetCDF format. And it appears that this CMIP6 NetCDF collection has a full list of files and associated md5 hash (at https://nex-gddp-cmip6.s3.us-west-2.amazonaws.com/index.html ). Comparing this with the COG bucket ( Since there is interest in thinking about how to do ingestion of large collections like this, why not have a separate ingestion process where the input is a list of COGs, possibly with a md5 hash, which is used as an input for the ingestor instead of doing s3 discoveries using regex? Footnotes
|
Thanks for pushing this further, @vlulla. Are you suggesting to have a list with all objects in S3 and their checksums? I think that would not be necessary here, since S3 already has checksums for all objects in the object metadata. I think there are two possible goals:
I think for the first goal (upload), calculating the checksum locally and submitting it with the S3 upload for automatic validation would be best - no need to also store checksums anywhere else. Not sure how relevant the second (download) goal is, since we do not want users to make local copies but load the data directly from cloud storage and perhaps only portions of each file. But if they do download from S3, they can also access the checksum values stored with the object. |
Good points @j08lue . They make perfect sense. I was thinking about the usability a bit more broadly than just [up/down]loads. Let's say once a large collection is ingested successfully, and nothing changes, then instead of looking for what is available by doing either Anyways, it appears to me that this requires more consideration and thought. |
I see! I remember seeing large collections elsewhere that come with some kind of inventory stored in an auxiliary file. They probably serve what you mentioned - an authoritative alternative to listing the buckets, which is cumbersome. I'd say let us consider this as an enhancement of our catalogue service, but not a (blocking) requirement for the CMIP6 collection. Ok? I opened a new issue. |
I don't know if MD5s are an accepted field in STAC Items - but I think it could be an option to calculate the MD5 of the file and add it to the STAC Item as part of the ingestion process STAC Catalogs are intended very easy to parse automatically |
I agree, @leothomas, that STAC might also fulfil the same needs as an inventory file. Would be great to understand the user needs behind this first, in the issue linked here: https://github.com/NASA-IMPACT/veda-architecture/issues/138 |
I think that storing MD5 (or some other hash) as a part of the STAC also has the added benefit of providing the ability to verify that the file[s]/asset[s] provided/rendered are in fact same. This could also allow us to detect issues with COG generation. |
In addition to the inventory use case discussion, which I think is super valuable especially in terms of informing ingest processes, there is a stable file stac-spec extension to store checksum (and other file info) in STAC Item Assets that we could adopt or learn from. |
I'd like to propose an Pilot/MVPI propose we narrow down our goals to start with. If we work with monthly data we have less immediate metadata concerns, no scaling concerns, and a deliverable for the front end.
I'd like to test an alternate PR that will not require custom orchestration (I may try to throw this together tomorrow against the SFN branch just to have a look at it)
This subset will let us start supporting the dashboard sooner and begin evaluating how we want to manage model data going forward. I'm sure as soon as we have something live we will begin to see new concerns for handling this data that we can use to inform the next step. Inquire about need for p10 and p90 monthly ensemblesDid we need any additional metadata for the median monthly ensemble variables? Consider daily ingests1 model-scenario-variable=~55K COG assets |
I added my update to the wrong issue (#231)
The historical data indexing needs more consideration--I think we will find that the distinct time ranges of the historical and modeled data mean that different collections are needed for automated dashboard components like the date picker. I am going to update the pilot monthly ingest to separate historical and modeled because:
|
We circumvented this issue in the cmip6 demo dashboard by duplicating the historical data into each model's collection. All the models had a timerange [1950-2100], where the data for [1950-2015] is the same historical data for each model. Not sure if this is valid approach for VEDA |
@leothomas thanks!
I think this could be the most intuitive approach. In that case we could manage all of the monthly ensemble data in two filterable collections that cover 1950-2100. I am going to mock up this organization in dev and give it a test run with the tiler (filtered search-->mosaic_id-->tiles). |
Placeholder update: I selected the monthly collections from the original PR associated with CMIP6 ingestion and @leothomas's suggestion and created two collections each with historical and experiment data combined to provide a continuous series of dates. With the caveat of the a datetime formatting bug in the pipelines API and a missing year of data from two variables in the SSP585 series, we should now have enough ingested dev data to test out using filtered CMIP6 data with the dashboard. I manually created filtered summaries for the two collections and will provide better documentation to pull things together for the dashboard, but for now here are summaries of what is in the dev stack and how I captured the summaries. CMIP6 Monthly Historical and SSP585 Experiments (Aggregated GFDL-CM4 and GISS-E2-1 Models)
CMIP6 Monthly Historical and SSP245 Experiments (Aggregated GFDL-CM4 and GISS-E2-1 Models)
Summary method SQL
|
The next step for this issue is to do a working session with the front end team to see if the filters demo-d in this gist can be implemented in the front end. The methods should also be cleaned up and reviewed with Vincent to see if we are approaching the tiling properly as well (I'm not dropping an |
I temporarily pulled the ready label off this issue until we can better constrain the reduced scope |
Keeping this ticket for reference but not for tracking. |
Research + prepare processing for the dataset: Identify the dataset and what the processing needs are
The data is hosted in the S3 Bucket: s3://nex-gddp-cmip6-cog/
Already cogs: yes
Point of contact: @leothomas
Design the metadata and publish to the Dev API
collection grouping:
s3://nex-gddp-cmip6-cog/daily/
id: cmip6-daily-___
title and description: TBD; collection specific
dashboard:is_periodic: True
dashboard:time_density: daily
collections:
collection grouping:
s3://nex-gddp-cmip6-cog/monthly/
id: cmip6-monthly-___
title and description: TBD; collection specific
dashboard:is_periodic: True
dashboard:time_density: monthly
collections:
collection grouping:
s3://nex-gddp-cmip6-cog/crossing/
title: Projected date of 4 degree average temperature increase
description: The projected date at which the average temperature in a given region is anticipated to have risen by 4 degrees under different scenarios
dashboard:is_periodic: False
shared metadata:
item_assets:
{ "cog_default": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Default COG Layer", "description": "Cloud optimized default layer to display on map" } }
license: CC0-1.0
temporal_interval: jan, 1950 - dec, 2100
Step 1 from issue #191
The text was updated successfully, but these errors were encountered: