-
Notifications
You must be signed in to change notification settings - Fork 6
CMIP6 COGs #191
Comments
Lastly, @sharkinsspatial made a really great point about the fact that if the main purpose of reprocessing all the NetCDF files into COGs is just to display in them a visual dashboard, we may be approaching the problem wrong. We can make use of tools to provide a "tile-like" access to the data without having to actually reprocess the data. The idea would be to implement an index file to serve data from the NetCDFs in a zarr-like fashion and then use the Zarr-TiTiler extension @vincentsarago has been working to serve the data directly from the NetCDF files, eliminating the need for the CMIP6 COG dataset entirely. This should be discussed further to establish: feasibility and timeline |
@leothomas are you talking about kerchunk ? The issue for visualisation is then that you don't have overviews which then limit to raw level. IMO when we need visualization, and if we can, we should create a COG (this might change if Zarr supports overview in the future). Zarr-Titiler is not yet ready and I'm pretty afraid of the performance! |
While the quest for rendering from Zarr etc is ongoing, we should finish up getting the COGs into VEDA for now and consider replacing the setup when a Zarr-native (or other better) solution is available. |
Superseded by #204 |
I've finally gotten the CMIP6 COGs registered in AWS Open Data! This means I can begin handing off the task of ingesting them into the VEDA data holdings.
The data is hosted in the S3 Bucket: s3://nex-gddp-cmip6-cog
Next steps that I see, for getting the CMIP6 dataset into VEDA are:
Discuss design of the STAC collection for the CMIP6 dataset: this issue further details the complexity that CMIP6 brings in terms of STAC, but the gist of it that the CMIP6 can be organized in many different levels and sub-levels (ie: the data can be grouped by any combination of model, variable and ssp; which, if we create one collection per unique model/variable/ssp combination, would entail to 34 new collections for the CMIP6 daily data, 6 collections for the monthly data and one more collection for a special data product called CrossingYear)
Discuss an indexing strategy for the PgSTAC database:
A quick description of the technical problem we encountered with PgSTAC<v0.5, when generating partitions for a dataset with a temporal range as important as CMIP6's:
Now that indexing strategies are customizable in PgSTAC we should to establish such a strategy. David Bitner suggested that the data has such low temporal density (1 file/day AT MOST) that we may not even want to partition the dataset for dates outside of the 30 year period (approx. 2000 - 2030) during which most of the VEDA datasets overlap, or we may not even want to partition the dataset at all.
Missing data: I have identified 54 missing files (out of ~1.63M expected files --> 0.0033% loss, not bad!). The missing data likely exists in the raw NetCDF data (in s3://nex-gddp-cmip6). Since each NetCDF contains a year's worth of data and each COG is for a single day, the 54 missing COGs come from only 11 different NetCDFs - meaning only those 11 files need to be reprocessed. See attached file for list of missing files:
missing_files.txt
[OPTIONAL] Set up a validation pipeline: The COG data (in s3://nex-gddp-cmip6-cog is hosted in the NASA SMCE AWS account. Some reprocessing was necessary for which I set up an SQS queue + Lambda function in the NASA Impact/UAH AWS account. This same setup could be reused to run and rio cogeo validate to ensure that all the expected data is present and correctly formatted
[HOUSEKEEPING]:
Delete the CMIP6 data from the NASA Impact/UAH account (in s3://climatedashboard-data/cmip6) to avoid un-necessary storage cost/data duplication.
Delete the SQS Queue and Lambda function used for reprocessing the data in the SMCE account to avoid further un-necessary costs due to empty "receives" by the SQS queue.
The text was updated successfully, but these errors were encountered: