Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design docs #20

Merged
merged 14 commits into from
Nov 13, 2023
274 changes: 274 additions & 0 deletions design_doc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
# XDGGS - Design document

Xarrays extension for DGGS. Technical specifications.

## Goals

The goal of the `xdggs` library is to facilitate working with multiple Discrete Global Grid Systems (DGGSs) via a unified, high-level and user-friendly API that is deeply integrated with [Xarray](https://xarray.dev).

Examples of common DGGS features that `xdggs` should provide or facilitate:

- convert a DGGS from/to another grid (e.g., a DGGS, a latitude/longitude rectilinear grid, a raster grid, an unstructured mesh)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the discussion on Thursday, we pretty much agreed that this doesn't need to live in xdggs (though as part of the work on xdggs we need to ensure they exist). xesmf and pyresample seem like good candidates.

Edit: maybe that's what you meant with the section in the "non-goals"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what I mean here is that xdggs should provide the API for doing these operations (or at least some core API that should facilitate it). xdggs does not necessarily need to implement the logic: if there is a well-maintained 3rd party library that does that well, it is preferable to rely on it instead of reinventing the wheel.

About xesmf and pyresample, do they handle complex cell geometries (non-rectilinear) on the sphere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1st sentence of the "goals" section is particularly important: "unified, high-level and user-friendly API that is deeply integrated with Xarray". It is all about the API so I expect that xdggs would mainly consist of adapter code that extend the Xarray API and calls routines from 3rd party libs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly a API design issue: does it make sense to provide a interpolate accessor method if we simply forward to another library? Oftentimes, using that library directly is more efficient, since at least in the case of xesmf it allows applying the same multiple times without a recomputation of weights. (I don't think there's a universal answer here, it depends on the library and how it works)

About xesmf and pyresample, do they handle complex cell geometries (non-rectilinear) on the sphere?

not sure, but the point is that if they don't we could (try to) extend them instead of implementing this ourselves. Note that those two libraries were examples, if there's a different library that does this already I wouldn't mind using that instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is all about the API so I expect that xdggs would mainly consist of adapter code that extend the Xarray API and calls routines from 3rd party libs.

this would certainly be very convenient, and I didn't think of alternatives, yet (just to have more options to choose from). That will take me some time, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly a API design issue: does it make sense to provide a interpolate accessor method if we simply forward to another library?

Perhaps not an interpolate accessor but conversion methods in a dggs accessor yes. I think it often makes sense since we're not only regridding but we're also conforming the output to some conventional (or standard?) DGGS representation used by xdggs (+ building/setting an index).

It is also about discoverability. We could certainly add tutorials showing how to regrid lat/lon 2D grids to DGGS by directly calling xesmf, pyreseample, etc., and then set the DGGS index. But I find it quite conventional to have some high-level API layer in xdggs to do the regridding and conversion all at once, since it is a very common operation.

Want to create a new DGGS Dataset from another Dataset? Just type ds.dggs.from_<tab-completion> and pick up the conversion method that is relevant for your dataset :).

at least in the case of xesmf it allows applying the same multiple times without a recomputation of weights.

I think we could arrange the xdggs API so that it allows reusing interpolation weights. That doesn't seem a big issue to me.

if there's a different library that does this already I wouldn't mind using that instead.

Totally agreed, we are on the same page!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I also tried to state in the doc (non-goals section): "Likewise, there may be many ways of resampling a grid to a DGGS ; xdggs should support the most common methods but not try to support all of them."

While I think we should really have conversion methods for converting lat/lon 2D grid to DGGS and vice-versa that handle 80% of the use-cases, it shouldn't prevent reusing directly xesmf, pyresample, etc. directly for the more advanced use cases.

- convert a DGGS from/to vector data (points, lines, polygons, envelopes)
- convert between different cell id representations of a same DGGS (e.g., uint64 vs. string)
- select data on a DGGS by cell ids or by geometries (spatial indexing)
- change DGGS resolution (upgrade or downgrade)
- operations between similar DGGS (with auto-alignment)
- re-organize cell ids (e.g., spatial shuffling / partitioning)
- plotting

Conversion between DGGS and other grids or vector features may requires specific interpolation or regridding methods.

`xdggs` should leverage the current recommended Xarray extension mechanisms ([apply_ufunc](https://docs.xarray.dev/en/stable/examples/apply_ufunc_vectorize_1d.html), [accessors](https://docs.xarray.dev/en/stable/internals/extending-xarray.html), [custom indexes](https://docs.xarray.dev/en/stable/internals/how-to-create-custom-index.html)) and possibly the future ones (e.g., variable encoders/decoders) to provide DGGS-specific functionality on top of Xarray core features.

`xdggs` should facitiltate interoperability with other existing Xarray extensions (e.g., [xvec](https://github.com/xarray-contrib/xvec) for vector data or [uxarray](https://github.com/UXARRAY/uxarray) for unstructured grids).

`xdggs` should also leverage the existing implementation of various DGGSs exposed to Python via 3rd-party libraries, here below referred as "backends". Preferrably, those backends would expose DGGS functionality in an efficient way as vectorized functions (e.g., working with NumPy arrays).

`xdggs` should try to follow standards and/or conventions defined for DGGS (see below). However, we may need to depart from them for practical reasons (e.g., common practices in popular DGGS libraries that do not fit well with the proposed standards). Strict adherence to a standard is welcome but shouldn't be enforced by all means.

`xdggs` should also try to support applications in both GIS and Earth-System communities, which may each use DGGS in slightly different ways (see examples below).

When possible, `xdggs` operations should scale to fine DGGS resolutions (millions of cells). This can be done vertically using backends with vectorized bindings of DGGS implementations written in low-level languages and/or horizontally leveraging Xarray interoperability with Dask. Some operations like spatial indexing may be hard to scale horizontally, though. For the latter, we should probably focus `xdggs` development first towards good vertical scaling before figuring out how they can be scaled horizontally (for reference, see [dask-geopandas](https://github.com/geopandas/dask-geopandas) and [spatialpandas](https://github.com/holoviz/spatialpandas)).

## Non-Gloals

`xdggs` should focus on providing the core DGGS functionality and operations that are listed above. Higher-level operations that can be implemented by combining together those core operations are out-of-scope and should be implemented in downstream libraries. Likewise, there may be many ways of resampling a grid to a DGGS ; `xdggs` should support the most common methods but not try to support _all of them_.

`xdggs` should try not re-inventing the wheel and delegate to Xarray API when possible.

`xdggs` does not implement any particular DGGS from scratch. `xdggs` does not aim at providing _all the functionality provided by each grid_ (e.g., some functionality may be very specific to one DGGS and not supported by other DGGSs, or some functionality may not be available yet in one DGGS Python backend).

Although some DGGS may handle both the spatial and temporal domains in a joint fashion, `xdggs` focuses primarily on the spatial domain. The temporal domain is considered as orthogonal and already benefits from many core features provided by Xarray.

## Discrete Global Grid Systems

A Discrete Global Grid System (DGGS) can be roughly defined as a partitioning or tessellation of the entire Earth's surface into a finite number of "cells" or "zones". The shape and the properties of these cells generally vary from one DGGS to another. Most DGGSs are also hierarchical, i.e., the cells are arranged on recursively on multiple levels or resolutions. Follow the links in the subsection below for a more strict and detailed definition of a DGGS.

DGGSs may be used in various ways, e.g.,

- Applications in Earth-system modelling seem to use DGGS as grids of contiguous, fixed-resolution cells covering the entire Earth's surface or a region of interest (figure 1). This makes the analysis of simulation outputs on large extents of the Earth's surface easier. DGGS may also be used as pyramid data (multiple stacked datasets at different resolutions)
- Applications in GIS often consist of using DGGS to display aggregated (vector) data as a collection of cells with a more complex spatial distribution (sparse) and sometimes with mixed resolutions (figures 2 and 3).

![figure1](https://user-images.githubusercontent.com/4160723/281698490-31cb5ce8-64db-4bbf-a0d9-a8d6597bb2df.png)
Figure 1: DGGS data as contiguous cells of fixed resolution ([source](https://danlooo.github.io/DGGS.jl/))

![figure2](https://github.com/benbovy/xdggs/assets/4160723/430fd646-220a-4027-8212-1d927bb339ba)

Figure 2: Data aggreated on DGGS (H3) sparsely distributed cells of fixed resolution ([source](https://medium.com/@jesse.b.nestler/how-to-convert-h3-cell-boundaries-to-shapely-polygons-in-python-f7558add2f63)).

![image](https://github.com/benbovy/xdggs/assets/4160723/f2e4ec02-d88e-475e-9067-e93cf185923e)

Figure 3: Raster data converted as DGGS (H3) cells of mixed resolutions ([source](https://github.com/nmandery/h3ronpy)).

### Standards and Conventions

There is no released standard yet regarding DGGS. However, there is a group working on a draft of OGC API for DGGS: https://github.com/opengeospatial/ogcapi-discrete-global-grid-systems.

Another draft of DGGS specification can be found here: https://github.com/danlooo/dggs-data-spec.

There are some discrepancies between the proposed standards and popular DGGS libraries (H3, S2, HealPIX). For example regarding the term used to define a grid unit: The two specifications above use "zone", S2/H3 use "cell" and HealPIX uses "pixel". Although in this document we use "cell", the term to choose for `xdggs` is still open for discussion.

### Backends (Python)

Several Python packages are currently available for handling certain DGGSs. They mostly consist of Python bindings of DGGS implementations written in C/C++/Rust. Here is a list (probably incomplete):

- [healpy](https://github.com/healpy/healpy): Python bindings of [HealPix](https://healpix.sourceforge.io/)
- mostly vectorized
- [rhealpixdggs-py](https://github.com/manaakiwhenua/rhealpixdggs-py): Python/Numpy implementation of rHEALPix
- [h3-py](https://github.com/uber/h3-py): "official" Python bindings of [H3](https://h3geo.org/)
- experimental and incomplete vectorized version of H3's API (removed in the forthcoming v4 release?)
- [h3pandas](https://github.com/DahnJ/H3-Pandas): integration of h3-py (non-vectorized) with pandas and geopandas
- [h3ronpy](https://github.com/nmandery/h3ronpy): Python bindings of [h3o](https://github.com/HydroniumLabs/h3o) (Rust implementation of H3)
- provides high-level features (conversion, etc.) working with arrow, numpy (?), pandas/geopandas and polars
- [s2geometry](https://github.com/google/s2geometry): Python bindings generated with SWIG
- not vectorized nor very "pythonic"
- plans to switch to pybind11 (no time frame given)
- [spherely](https://github.com/benbovy/spherely): Python bindings of S2, mostly copying shapely's API
- provides numpy-like universal functions
- not yet ready for use
- [dggrid4py](https://github.com/allixender/dggrid4py): Python wrapper for [DGGRID](https://github.com/sahrk/DGGRID)
- DGGRID implements many DGGS variants!
- DGGRID current design makes it hardly reusable from within Python in an optimal way (the dggrid wrapper communicates with DGGRID through OS processes and I/O generated files)

## Representation of DGGS Data in Xdggs

`xdggs` represents a DGGS as an Xarray Dataset or DataArray containing a 1-dimensional coordinate with cell ids as labels and with grid name, resolution & parameters (optional) as attributes. This coordinate is indexed using a custom, Xarray-compatible `DGGSIndex`.

`xdggs` does not support a Dataset or DataArray with multiple coordinates indexed with a `DGGSIndex` (only one DGGS per object is supported).

The cell ids in the 1-dimensional coordinate are all relative to the _exact same_ grid, i.e., same grid system, same grid parameter values and same grid resolution! For simplicity, `xdggs` does not support cell ids of mixed-resolutions in the same coordinate.

### DGGSIndex

`xdggs.DGGSIndex` is the base class for all Xarray DGGS-aware indexes. It inherits from `xarray.indexes.Index` and has the following specifications:

- It encapsulates an `xarray.indexes.PandasIndex` built from cell ids so that selection and alignment by cell id is possible
Copy link
Collaborator

@keewis keewis Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to change that for hierarchical access by cell id

Edit: this is probably discussed immediately below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean exactly by "hierarchical access by cell id"? Do you have an example in mind?

In the "selection by cell ids" section there's a small reference to selection using ids on another resolution. Maybe we could make it work with a PandasIndex? (i.e., get parent/child cell ids before/after passing ids as query to the pandas index).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, that's what I meant (hence the edit).

- It might also eventually encapsulate a spatial index (RTree, KDTree) to enable data selection by geometries, e.g., find nearest cell centroids, extract all cells intersecting a polygon, etc.
- Alternatively, spatial indexing might be enabled by explicit conversion of cells to vector geometries and then by reusing the Xarray spatial indexes available in [xvec](https://github.com/xarray-contrib/xvec)
- It partially implements the Xarray Index API to enable DGGS-aware alignment and selection
- Calls are most often redirected to the encapsulated `PandasIndex`
- Some pre/post checks or processing may be done, e.g., to prevent the alignment of two indexes that are not on the exact same grid.
- The `DGGSIndex.__init__()` constructor only requires cell ids and the name of the cell (array) dimension
- The `DGGSIndex.from_variables()` factory method parses the attributes of the given cell ids coordinates and creates the right index object (subclass) accordingly
- It declares a few abstract methods for grid-aware operations (e.g., convert between cell id and lat/lon point or geometry, etc.)
- They can be implemented in subclasses (see below)
- They are either called from within the DGGSIndex or from the `.dggs` Dataset/DataArray accessors

Each DGGS supported in `xdggs` has its own subclass of `DGGSIndex`, e.g.,

- `HealpixIndex` for Healpix
- `H3Index` for H3
- ...

A DGGSIndex can be set directly from a cell ids coordinate using the Xarray API:

```python
import xarray as xr
import xdggs

ds = xr.Dataset(
coords={"cell": ("cell", [...], {"grid_name": "h3", "resolution": 3})}
)

# auto-detect grid system and parameters
ds.set_xindex("cell", xdggs.DGGSIndex)

# set the grid system and parameters manually
ds.set_xindex("cell", xdggs.H3Index, resolution=3)
```

The DGGSIndex is set automatically when converting a gridded or vector dataset to a DGGS dataset (see below).

## Conversion from/to DGGS

DGGS data may be created from various sources, e.g.,

- regridded from a latitude/longitude rectilinear grid
- regridded from an unstructured grid
- regridded and reprojected from a raster
- aggregated from vector point data
- filled from polygon data

Conversely, DGGS data may be converted to various forms, e.g.,

- regridded on a latitude/longitude rectilinear grid
- rasterized (resampling / projection)
- conversion to vector point data (cell centroids)
- conversion to vector polygon data (cell boundaries)

Here is a tentative API based on Dataset/DataArray `.dggs` accessors (note: other options are discussed in [this issue](https://github.com/benbovy/xdggs/issues/13)):

```python
# "convert" directly from existing cell ids coordinate to DGGS
# basically an alias to ds.set_xindex(..., DGGSIndex)
ds.dggs.from_cell_ids(...)

# convert from lat/lon grid
ds.dggs.from_latlon_grid(...)

# convert from raster
ds.dggs.from_raster(...)

# convert from point data
ds.dggs.from_points(...)

# convert from point data (with aggregation)
ds.dggs.from_points_aggregate(...)

# convert from point data (with aggregation using Xarray API)
ds.dggs.from_points(...).groupby(...).mean()

# convert to lat/lon grid
ds.dggs.to_latlon_grid(...)

# convert to raster
ds.dggs.to_raster(...)

# convert to points (cell centroids)
ds.dggs.to_points(...)

# convert to polygons (cell boundaries)
ds.dggs.to_polygons(...)
```

In the API methods above, the "dggs" accessor name serves as a prefix.

Those methods are all called from an existing xarray Dataset (DataArray) and should all return another Dataset (DataArray):

- Xarray has built-in support for regular grids
- for rasters, we could return objects that are [rioxarray](https://github.com/corteva/rioxarray)-friendly
- for vector data, we could return objects that are [xvec](https://github.com/xarray-contrib/xvec)-friendly (coordinate of shapely objects)
- etc.

## Extracting DGGS Cell Geometries

DGGS cell geometries could be extracted using the conversion methods proposed above. Alternatively, it would be convenient to get them directly as xarray DataArrays so that we can for example manually assign them as coordinates.

The API may look like:

```python
# return a DataArray of DGGS cell centroids as shapely.POINT objects
ds.dggs.cell_centroids()

# return two DataArrays of DGGS cell centroids as lat/lon coordinates
ds.dggs.cell_centroids_coords()

# return a DataArray of DGGS cell boundaries as shapely.POLYGON objects
ds.dggs.cell_boundaries()

# return a DataArray of DGGS cell envelopes as shapely.POLYGON objects
ds.dggs.cell_envelopes()
```

## Indexing and Selecting DGGS Data

### Selection by Cell IDs

The simplest way to select DGGS data is by cell ids. This can be done directly using Xarray's API (`.sel()`):

```python
ds.sel(cell=value)
```

where `value` can be a single cell id (integer or string/token?) or an array-like of cell ids. This is easily supported by the DGGSIndex encapsulating a PandasIndex. We might also want to support other `value` types, e.g.,

- assuming that DGGS cell ids are defined such that contiguous cells in space have contiguous id values, we could provide a `slice` to define a range of cell ids.
- DGGSIndex might implement some DGGS-aware logic such that it auto-detects if the given input cells are parent cells (lower DGGS resolution) and then selects all child cells accordingly.

We might want to select cell neighbors (i.e., return a new Dataset/DataArray with a new neighbor dimension), probably via a specific API (`.dggs` accessors).

### Selection by Geometries (Spatial Indexing)

Another useful way of selecting DGGS data is from input geometries (spatial queries), e.g.,

- Select all cells that are the closest to a collection of data points
- Select all cells that intersects with or are fully contained in a polygon

This kind of selection requires spatial indexes as this can not be done with a pandas index (see [this issue](https://github.com/benbovy/xdggs/issues/16)).

If we support spatial indexing directly in `xdggs`, we can hardly reuse Xarray's `.sel()` for spatial queries as `ds.sel(cell=shapely.Polygon(...))` would look quite confusing. Perhaps better would be to align with [xvec](https://github.com/xarray-contrib/xvec) and have a separate `.dggs.query()` method.

Alternatively, we could just get away with the conversion and cell geometry extraction methods proposed above and leave spatial indexes/queries to [xvec](https://github.com/xarray-contrib/xvec).

## Handling hierarchical DGGS

Even though the DGGS coordinate of a Dataset (DataArray) is limited to cell ids of same resolution (no mixed-resolutions), `xdggs` can still provide functionality to deal with the hierarchical aspect of DGGSs.

Selection by parent cell ids may be in example (see section above). Another example would be to have utility methods to explicitly change the grid resolution (see [issue #18](https://github.com/benbovy/xdggs/issues/18) for more details and discussion).

## Operations between similar DGGS (alignment)

Computation involving multiple DGGS datasets (or dataarrays) often requires to align them together. Sometimes this can be trivial (same DGGS with same resolution and parameter values) but in other cases this can be complex (requires regridding or a change of DGGS resolution).

In Xarray, alignment of datasets (dataarrays) is done primarily via their indexes. Since a DGGSIndex wraps a PandasIndex, it is easy to support alignment by cells ids (trivial case). At the very least, a DGGSIndex should raise an error when trying to align cell ids that do not refer to the exact same DGGS (i.e., same system, resolution and parameter values). For the complex cases, it may be preferable to handle them manually instead of trying to make the DGGSIndex perform the alignment automatically. Regridding and/or changing the resolution of a DGGS (+ data aggregation) often highly depend on the use-case so it might be hard to find a default behavior. Also performing those operations automatically and implicitly would probably feel too magical. That being said, in order to help alignment `xdggs` may provide some utility methods to change the grid resolution (see section above) and/or to convert from one DGGS to another.

## Plotting

Three approaches are possible (non-mutually exclusive):

1. convert cell data into gridded or raster data (choose grid/raster resolution depending on the resolution of the rendered figure) and then reuse existing python plotting libraries (matplotlib, cartopy) maybe through xarray plotting API
2. convert cell data into vector data and plot the latter via, e.g., [xvec](https://github.com/xarray-contrib/xvec) or [geopandas](https://github.com/geopandas/geopandas) API
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might become sufficiently performant (I think James Bednar had a demo of hvplot doing something like this somewhere), so we might also want to consider it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean hvplot plotting vector data like https://hvplot.holoviz.org/reference/geopandas/polygons.html?

Copy link
Collaborator

@keewis keewis Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant this: https://discourse.holoviz.org/t/dealing-high-resolution-vector-plots-xarray-holoviz/6310

Edit: but in that case it's a vector plot, not arbitrary geometries, so maybe not quite there yet?
Edit2: and this (scatter plot): https://discourse.holoviz.org/t/show-individual-points-when-zoomed-else-datashade/2204/14?u=ahuang11

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the links! Yes it looks like a similar approach to what I suggest by using Datashader with rasterization of DGGS to a dynamic resolution and extent. This could actually be applied for polygon plots: downgrade the DGGS resolution (#18), then aggregate data, then convert it to vector data, then plot it. It would be nice to have some high-level API to do that interactively instead of requiring the user doing all of those steps manually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased the paragraph below in 5a97e2c to reflect that comment.

3. leverage libraries that support plotting DGGS data, e.g., [lonboard](https://github.com/developmentseed/lonboard) enables interactive plotting in Jupyter via deck.gl, which has support of H3 and S2 cell data.

The 3rd approach (lonboard) is efficient for plotting large DGGS data: we would only need to transfer cell ids (tokens) and cell data and then let deck.gl render the cells efficiently in the web browser using the GPU. For approach 1, we might want to investigate using [datashader](https://github.com/holoviz/datashader) to set both the resolution and raster extent dynamically. Likewise for approach 2, we could dynamically downgrade the DGGS resolution and aggregate the data before converting it into vector data in order to allow (interactive) plotting of large DGGS data.