Custom data sources #79

nbren12 · 2023-10-23T17:38:12Z

nbren12
Oct 23, 2023
Maintainer

@NickGeneva How do you think we should support custom data sources? We can support certain file formats (i.e. hdf5+json) in the main code base, but for sometimes users won't want to transform the data. Should we have a similar plugin mechanism for the data as the models? Do we need a config layer for this?

@ankurmahesh Any thoughts?

ankurmahesh · 2023-10-25T19:56:36Z

ankurmahesh
Oct 25, 2023

I think it would be great to have a plugin mechanism for the dataset, similar to the model. When I used the codebase in a new computing environment, this was a large part of the integration. Such a mechanism would be very useful.

Do you think it would be worth supporting the ERA5 mirror in the format it's stored in here? Because this data has a globus endpoint, it's the most common mirror of ERA5 that I have seen on a few different computing environments.

If you think it would be worth supporting, I wrote a version of open_era5_xarray that works with this file structure. I could work this into a plugin mechanism for the data to load initial conditions.

This format is fine for initial conditions, but it isn't too sustainable for the scoring pipeline. I originally loaded the data from this directory structure. Eventually, though, I ended up reworking the data loader into a script to transform the ERA5 data into a 73-channel preprocessed version. For the score_ensemble_outputs.py script, the I/O times were prohibitively slow (~30 minutes on a CPU node on Perlmutter) , when it became necessary to load in 60+ ERA5 time steps.

0 replies

nbren12 · 2023-10-25T21:54:47Z

nbren12
Oct 25, 2023
Maintainer Author

Great to hear from you @ankurmahesh. Indeed implementing a plugin as an entrypoint for data sources make sense. However, for now it is possible to score with a custom data source using the python API like this: https://github.com/NVIDIA/earth2mip#basic-inference.

We've been discussing the merits of CLIs vs python APIs. We'll plan to support both, but python APIs will always be more flexible for new use cases. We could use your feedback on this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom data sources #79

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Custom data sources #79

nbren12 Oct 23, 2023 Maintainer

Replies: 2 comments

ankurmahesh Oct 25, 2023

nbren12 Oct 25, 2023 Maintainer Author

nbren12
Oct 23, 2023
Maintainer

ankurmahesh
Oct 25, 2023

nbren12
Oct 25, 2023
Maintainer Author