Gridded Data redesign #712

JohnLCaron · 2021-06-13T23:54:40Z

JohnLCaron
Jun 13, 2021
Collaborator

Background

The essence of NetcdfFile is index space subsetting, using Fortran-style array manipulation.

A Variable in a NetcdfFile has a list of Dimensions that define its shape. The order of
the Dimensions and the returned Array simply reflect whatever was stored into the netCDF file:

  Array<?> data = variable.readArray() 
  variable.getShape() == data.getShape()
  Dimensions.makeShape(variable.getDimensions()) == data.getShape()

A NetcdfDataset identifies the coordinate axes of the Variable, called the coordinate system. A CoordinateAxis that is
one dimensional and monotonic is a coordinate variable. A CoordinateAxis1D allows

CoordinateAxis1D.findCoordElement(double coordVal) -> index

A geo referencing coordinate system allows geolocating a data value to the real world in (lat, lon, vertical, time) coordinates. The vertical coordinate may be either altitude or pressure. This underpins the feature type datasets, of which FeatureType = Grid is the most common, and most extensively developed.

Grid 1.0

The early version of Grid (ucar.nc2.dt) has various bridges between index and coordinate:

  GridCoordSystem.findXYindexFromLatLon(lat, lon) -> (xindex, yindex)
  GridCoordSystem.getLatLon(xindex, yindex) -> (lat, lon)
  GridCoordSystem.getRangesFromLatLonRect(LatLonRect) -> List<Range>
  
  CoordinateAxis1D.findCoordinate(coordVal) -> index

While data reading is still done in index space, the returned data is placed into canonical order:

  GridDatatype.readDataSlice(int rt_index, int e_index, int t_index, int z_index, int y_index, int x_index) -> Array
  GridDatatype.readVolumeData(int t_index) -> Array

where the indices are associated with coordinates, and the returned Array is placed into canonical row-major order based on a standard order of coordinates (runtime, ensemble, time, vertical, y/lat, x/lon).

This assumes that 1) the coordinates are all one dimensional, and 2) the shape is constant across time.

Coordinate variables that are not 1D.

The common case in atmospheric science data is that the x and y are not lat and lon, but either cartesion coordinates in some projection (typical for model data), or simple indices in an image (satellite or remote imaging data). In this case there is a complicated 2D Projection function from (x,y) <--> (lat,lon).

For model data where the Projection is not specified, or for satellite data where there may not be an easily computable function, one can supply the lat, lon coordinate arrays:

  double lat(x,y);
  double lon(x,y);

and assume that one can interpolate between values. These coordinates are called curvilinear. Satellite data may be even more complicated, in which case its no longer handled as a Grid, but as an Image feature type.

Often the vertical coordinates for model data are in a pressure based coordinates, rather than an altitude. This is handles by a vertical transform function: VT(z) -> altitude or pressure. For some vertical coordinates, eg terrain-following sigma coordinates, the transformation requires VT(x,y,z) -> pressure.

Inconsistent array shape across time.

When we tried to work with large GRIB data collections, for example forecast model runs, we found that the GRIB model data schema keeps changing, because variables are added and deleted, vertical levels are added and deleted, and model reference time spacing and forecast times are changed. So the shape of the dataset constantly changes. This makes index access impossible.

Although GRIB is notorious for not storing their data schema, mostly one can reconstruct it by examining the data. Or you could say that the GRID model schema is a collection of 2D slices, and that its a mistake to try to fit it into a more structured schema, because of the constant changes.

In principle, the same problem may occur for long collections of data stored in netCDF files.

Other complications.

There are more complicated datasets, but we assume that theese require specialized domain specific code, and are out of scope for a general data handling package.

JohnLCaron · 2021-06-13T23:54:50Z

JohnLCaron
Jun 13, 2021
Collaborator Author

Grid 2.0

The next version of Grid (ucar.ft2.coverage) explored doing away with index based access, and only use coordinate access.

	GeoReferencedArray Coverage.readData(SubsetParams subset)
	CoverageCoordSys CoverageCoordSys.subset(SubsetParams params)

but some index based operation remained:

	HorizCoordSys2D.findXYindexFromCoord(double xcoord, double ycoord) -> (xindex, yindex)
	HorizCoordSys2D.getLatLon(xindex, yindex) -> (lat, lon)

Changes in schema creates a separate Group in the NetcdfDataset, which is a seperate CoverageCollection. So a dataset turns into a FeatureDatasetCoverage, which contains one or more CoverageCollection, where all coverages share the same schema.

The main problem is that things are complicated, and simple things arent simple.

0 replies

JohnLCaron · 2021-06-14T00:02:29Z

JohnLCaron
Jun 14, 2021
Collaborator Author

Grid 3.0

Now in version 7, the next version of Grid (ucar.nc2.grid) gets rid of legacy packages of ucar.nc2.time, ucar.ma2.

Grid is not built on top of NetcdfDataset, this is important for scaling to large collections. A NetcdfDataset can be materialized, but only only when needed. Consider a large collection that just stores time start/stop and resolution, and materializes the coordinate values when read.

/** A georeferenced Field of data. */
public interface Grid {
   GridReferencedArray data = grid.getReader().setXXX().read();

Changes from coverage:

Dont allow multiple schemas in the same dataset. Other than "Best" that may be an edge case that should be handled in a special way.
Explore replacing SubsetParams with a fluid API:

GridReferencedArray data = grid.getReader().setXXX().read();

Clarify components of a GridCoordinateSystem:

HorizCoordSystem (x,y) <--> (lat, lon)
VerticalAxis	  z    <--> height or pressure
TimeCoordSystem  reftime, validtime, offsetFromReference
EnsembleAxis     e

In particular, maybe add a TimeCoordinateSystem that handles the details of the time coordinate, which can be complicated in an FMRC.

A Also explore adding convenience methods to make easier to use.

0 replies

JohnLCaron · 2021-06-20T20:34:35Z

JohnLCaron
Jun 20, 2021
Collaborator Author

GRIB Collections

see discussion #732

0 replies

JohnLCaron · 2021-06-25T21:21:57Z

JohnLCaron
Jun 25, 2021
Collaborator Author

Im doing a pretty complete Grid 3.0 redesign in PR #739.

The main driver is a GRIB implementation that is not built on a netCDF IOSP. The IOSP forces you to work in rectangular arrays, and materialized Coordinate systems (ie one where all the coordinates and their sizes are known). So this is an experiment to see what an API could look like free of those constraints. Does it solve some of the impedance mismatch of GRIB?

There are 2 other implementations that provide forces on the design: 1) nc2.internal.grid is based on a NetcdfDataset, and 2) GcdmGrid is provides remote access to a GridDataset. Those are important use cases.

Would be nice to have a large netcdf dataset collection to test also. At our last meeting we mentioned a CMIPS dataset.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gridded Data redesign #712

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Gridded Data redesign #712

JohnLCaron Jun 13, 2021 Collaborator

Background

Grid 1.0

Replies: 4 comments

JohnLCaron Jun 13, 2021 Collaborator Author

Grid 2.0

JohnLCaron Jun 14, 2021 Collaborator Author

Grid 3.0

JohnLCaron Jun 20, 2021 Collaborator Author

GRIB Collections

JohnLCaron Jun 25, 2021 Collaborator Author

JohnLCaron
Jun 13, 2021
Collaborator

JohnLCaron
Jun 13, 2021
Collaborator Author

JohnLCaron
Jun 14, 2021
Collaborator Author

JohnLCaron
Jun 20, 2021
Collaborator Author

JohnLCaron
Jun 25, 2021
Collaborator Author