Gridded Data redesign #712
Replies: 4 comments
-
Grid 2.0The next version of Grid (ucar.ft2.coverage) explored doing away with index based access, and only use coordinate access.
but some index based operation remained:
Changes in schema creates a separate Group in the NetcdfDataset, which is a seperate CoverageCollection. So a dataset turns into a FeatureDatasetCoverage, which contains one or more CoverageCollection, where all coverages share the same schema. The main problem is that things are complicated, and simple things arent simple. |
Beta Was this translation helpful? Give feedback.
-
Grid 3.0Now in version 7, the next version of Grid (ucar.nc2.grid) gets rid of legacy packages of ucar.nc2.time, ucar.ma2. Grid is not built on top of NetcdfDataset, this is important for scaling to large collections. A NetcdfDataset can be materialized, but only only when needed. Consider a large collection that just stores time start/stop and resolution, and materializes the coordinate values when read.
Changes from coverage:
In particular, maybe add a TimeCoordinateSystem that handles the details of the time coordinate, which can be complicated in an FMRC.
|
Beta Was this translation helpful? Give feedback.
-
GRIB Collectionssee discussion #732 |
Beta Was this translation helpful? Give feedback.
-
Im doing a pretty complete Grid 3.0 redesign in PR #739. The main driver is a GRIB implementation that is not built on a netCDF IOSP. The IOSP forces you to work in rectangular arrays, and materialized Coordinate systems (ie one where all the coordinates and their sizes are known). So this is an experiment to see what an API could look like free of those constraints. Does it solve some of the impedance mismatch of GRIB? There are 2 other implementations that provide forces on the design: 1) nc2.internal.grid is based on a NetcdfDataset, and 2) GcdmGrid is provides remote access to a GridDataset. Those are important use cases. Would be nice to have a large netcdf dataset collection to test also. At our last meeting we mentioned a CMIPS dataset. |
Beta Was this translation helpful? Give feedback.
-
Background
The essence of NetcdfFile is index space subsetting, using Fortran-style array manipulation.
A Variable in a NetcdfFile has a list of Dimensions that define its shape. The order of
the Dimensions and the returned Array simply reflect whatever was stored into the netCDF file:
A NetcdfDataset identifies the coordinate axes of the Variable, called the coordinate system. A CoordinateAxis that is
one dimensional and monotonic is a coordinate variable. A CoordinateAxis1D allows
CoordinateAxis1D.findCoordElement(double coordVal) -> index
A geo referencing coordinate system allows geolocating a data value to the real world in (lat, lon, vertical, time) coordinates. The vertical coordinate may be either altitude or pressure. This underpins the feature type datasets, of which FeatureType = Grid is the most common, and most extensively developed.
Grid 1.0
The early version of Grid (ucar.nc2.dt) has various bridges between index and coordinate:
While data reading is still done in index space, the returned data is placed into canonical order:
where the indices are associated with coordinates, and the returned Array is placed into canonical row-major order based on a standard order of coordinates (runtime, ensemble, time, vertical, y/lat, x/lon).
This assumes that 1) the coordinates are all one dimensional, and 2) the shape is constant across time.
The common case in atmospheric science data is that the x and y are not lat and lon, but either cartesion coordinates in some projection (typical for model data), or simple indices in an image (satellite or remote imaging data). In this case there is a complicated 2D Projection function from (x,y) <--> (lat,lon).
For model data where the Projection is not specified, or for satellite data where there may not be an easily computable function, one can supply the lat, lon coordinate arrays:
and assume that one can interpolate between values. These coordinates are called curvilinear. Satellite data may be even more complicated, in which case its no longer handled as a Grid, but as an Image feature type.
Often the vertical coordinates for model data are in a pressure based coordinates, rather than an altitude. This is handles by a vertical transform function: VT(z) -> altitude or pressure. For some vertical coordinates, eg terrain-following sigma coordinates, the transformation requires VT(x,y,z) -> pressure.
When we tried to work with large GRIB data collections, for example forecast model runs, we found that the GRIB model data schema keeps changing, because variables are added and deleted, vertical levels are added and deleted, and model reference time spacing and forecast times are changed. So the shape of the dataset constantly changes. This makes index access impossible.
Although GRIB is notorious for not storing their data schema, mostly one can reconstruct it by examining the data. Or you could say that the GRID model schema is a collection of 2D slices, and that its a mistake to try to fit it into a more structured schema, because of the constant changes.
In principle, the same problem may occur for long collections of data stored in netCDF files.
There are more complicated datasets, but we assume that theese require specialized domain specific code, and are out of scope for a general data handling package.
Beta Was this translation helpful? Give feedback.
All reactions