Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Xarray accessor mirroring Raster class #446

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

rhugonnet
Copy link
Member

@rhugonnet rhugonnet commented Jan 24, 2024

This PR adds the rst Xarray accessor mirroring the Raster class.

The accessor allows to access all attributes and run all methods already implemented for rasters in GeoUtils from a xarray.DataArray object (e.g., ds.rst.reproject()), and thereby easily access the rest of the Xarray functionalities through ds (such as plotting) and other low-level behaviour (such as implicit loading, cached loading, chunked loading).

It also opens up the opportunity to easily add Dask support to run our functionalities out-of-memory. This requires our functionalities to support a da.Array as input. This is not mandatory (if the functionality only runs with NumPy array, it will simply load the da.Array in memory immediately), but is very practical when it is supported.
We already did a lot of work to implement the most complex Dask functionalities in #537 (namely reproject(), subsample() and interp_points()). Most other functionalities are much easier to support (using dask.map_blocks or equivalent).

Resolves #383
Resolves #567

Facilitated by recent code re-structuring

Recent code re-structuring moved out methods out of the Raster or Vector class into separate modules. In some cases, this changed the argument of those (non-public) wrapper methods to accept base inputs (array, transform, crs) instead of the class object itself. Among other things, this facilitates the transition to using our functions with an accessor that has a slightly different object type (classic NumPy array instead of a NumPy masked-array).
See #624.

Summary of changes

Most content of the Raster class was moved into a non-public RasterBase parent class, containing all attributes and methods shared by the Raster and rst accessor classes.
The Raster and rst are subclasses of RasterBase, and only implement method specific to their object type (such as set_mask() for Raster that uses NumPy masked-array, or its __array_interface__ specific to masked arrays).

Remaining in Raster are only functionalities specific to the Raster object itself:

  • Methods related to object creation, instantiation and loading: __init__, load(), is_loaded, from_array(), copy(),
  • Methods related to NumPy masked-array: set_mask(), etc,
  • Methods related to array interfacing/arithmetic: __array_interface__, __add__, etc.

Added in the rst accessor are only functionalities specific to the accessor: __init__, from_array(), copy().

A new _is_xr boolean attribute identifies if the RasterBase.data is an Xarray object or not.
This allows to make choices where necessary, which is only used to return the main attributes stored in the object itself: data, crs, transform and nodata.

All methods returning a raster object (like reproject() or crop()) now use from_array() that is overridden in Raster and rst to ensure they return the same type as the input: a Raster returns a Raster input, and a xarray.DataArray returns a xarray.DataArray input.
All other attributes and methods return exactly the same non-raster input.

A new method geoutils.open_raster is added to open a raster as a xr.DataArray (built on top of rioxarray.open_rasterio()). The difference is that our open_raster forces the data type to be float32 at minimum, and replaces nodata values to NaNs to natively support most NumPy array operations with nodata propagation.
This seemed required because Rioxarray does not mask nodata values while preserving the nodata value in its metadata, which is incompatible with the behaviour we need. (To give an clear example: With Rioxarray, either you load the array with -9999 in it and the ds.rio.nodata is -9999, or you load the array with NaNs in it, and the ds.rio.nodata is NaN).
I did not find another way to do this here...

New tests

Adding new tests is simple: We simply need to check that all functions give the exact same result for a raster opened as a Raster, or as a xr.DataArray.

For this, the new tests introduce a function to check the equality of a Raster and xarray.DataArray.
Then, they check that all common attributes and methods of RasterBase run and return exactly the same output (or equal to the other object type when output is a Raster/xarray.DataArray).

Discussion of core differences

The problem with the rst accessor object is that, if I'm not mistaken, we won't have access to functionalities that are not explicit such as __array_interface__. Thus, we likely cannot mirror the entire behaviour of the Raster class (for instance, no overloading_check to verify that the georeferencing is the same during an array or arithmetic operation). We can look more into it to be sure, but I don't think it is possible...

Thankfully, Xarray generally has similar behaviour as our Raster class, from the implicit loading to array-interfacing. We might want to adjust our functionalities to ensure we mirror that behaviour when possible, so that the code is written the same.

The main difference is that Xarray won't natively support nodata in its operations for integer arrays (no masked-array support in Xarray), and thus those need to be converted to NaN-arrays to do so, which increases RAM usage significantly for datasets of integer type. Here again, thankfully, chunked Dask-support can compensate for this, and run any NaN-array size.

So there are pros and cons to using the Raster or the rst accessor. We can try to reconcile differences where possible, and for those that are structural to the data objects, we should simply explain them clearly on a documentation page and leave the choice to users! 😄

TO-DO

Code

  • Add equivalent of Raster.from_array() to RasterAccessor class (or RasterBase class?) and individual setting operations (for transform, crs, nodata, and area_or_point) to make all methods (reproject, etc) naturally work on both Raster and ds.rst,
  • Ensure dual support of masked-array & NaN arrays for all methods (_reproject, etc),
  • Link to existing delayed function (in _reproject, _interp_points and _subsample) by detecting automatically if input array is a Dask array.
  • Add tests specific to RasterBase,
  • Add tests specific to RasterAccessor functionalities (comparing to ds.rio),
  • Add tests using xr.DataArray objects as match-reference input.

Documentation

  • Add rst accessor to "The georeferenced raster" page,
  • Update all feature pages to list Xarray accessor option: Raster.reproject() or ds.rst.reproject(),
  • Add page "Using Xarray and Pandas accessors" explaining pros and cons in "Fundamentals"?,
  • Add page "Dask support" listing all functionalities supported out-of-memory, with tips,
  • Update API with rst accessor.

Other Dask support to add (will be moved as issues for later PRs)

  • The reduce_points function can copy the same logic as interp_points,
  • The crop function using isel of Rioxarray,
  • Potentially look at geocube for rasterize/polygonize support,
  • But the proximity function would be a bit of work...

@rhugonnet rhugonnet marked this pull request as draft January 24, 2024 12:34
@rhugonnet rhugonnet changed the title Add Xarray accessor mirroring Raster API Add Xarray accessor mirroring Raster class Jan 24, 2024
@rhugonnet rhugonnet marked this pull request as ready for review December 7, 2024 02:03
@rhugonnet
Copy link
Member Author

@adehecq @atedstone @erikmannerfelt This PR is also ready for your first review! It is not finalized, but at a good stage to hear your feedback, questions, recommendations, and then move forward to finalize it.
I have described the general concepts in the description, and the tests comparing both types of outputs should give you a good idea of how this works in practice in the code.

Once you are done with this one, you should look at the one for the dem accessor in xDEM (which depends on this PR).
Much less changes are needed, but it has a slightly more complex class structure: GlacioHack/xdem#656

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add set_crs and set_transform to force re-set geospatial attributes? Add Xarray accessor gu
1 participant