Add Xarray accessor mirroring `Raster` class #446

rhugonnet · 2024-01-24T12:33:54Z

This PR adds the rst Xarray accessor mirroring the Raster class.

The accessor allows to access all attributes and run all methods already implemented for rasters in GeoUtils from a xarray.DataArray object (e.g., ds.rst.reproject()), and thereby easily access the rest of the Xarray functionalities through ds (such as plotting) and other low-level behaviour (such as implicit loading, cached loading, chunked loading).

It also opens up the opportunity to easily add Dask support to run our functionalities out-of-memory. This requires our functionalities to support a da.Array as input. This is not mandatory (if the functionality only runs with NumPy array, it will simply load the da.Array in memory immediately), but is very practical when it is supported.
We already did a lot of work to implement the most complex Dask functionalities in #537 (namely reproject(), subsample() and interp_points()). Most other functionalities are much easier to support (using dask.map_blocks or equivalent).

Resolves #383
Resolves #567

Facilitated by recent code re-structuring

Recent code re-structuring moved out methods out of the Raster or Vector class into separate modules. In some cases, this changed the argument of those (non-public) wrapper methods to accept base inputs (array, transform, crs) instead of the class object itself. Among other things, this facilitates the transition to using our functions with an accessor that has a slightly different object type (classic NumPy array instead of a NumPy masked-array).
See #624.

Summary of changes

Most content of the Raster class was moved into a non-public RasterBase parent class, containing all attributes and methods shared by the Raster and rst accessor classes.
The Raster and rst are subclasses of RasterBase, and only implement method specific to their object type (such as set_mask() for Raster that uses NumPy masked-array, or its __array_interface__ specific to masked arrays).

Remaining in Raster are only functionalities specific to the Raster object itself:

Methods related to object creation, instantiation and loading: __init__, load(), is_loaded, from_array(), copy(),
Methods related to NumPy masked-array: set_mask(), etc,
Methods related to array interfacing/arithmetic: __array_interface__, __add__, etc.

Added in the rst accessor are only functionalities specific to the accessor: __init__, from_array(), copy().

A new _is_xr boolean attribute identifies if the RasterBase.data is an Xarray object or not.
This allows to make choices where necessary, which is only used to return the main attributes stored in the object itself: data, crs, transform and nodata.

All methods returning a raster object (like reproject() or crop()) now use from_array() that is overridden in Raster and rst to ensure they return the same type as the input: a Raster returns a Raster input, and a xarray.DataArray returns a xarray.DataArray input.
All other attributes and methods return exactly the same non-raster input.

A new method geoutils.open_raster is added to open a raster as a xr.DataArray (built on top of rioxarray.open_rasterio()). The difference is that our open_raster forces the data type to be float32 at minimum, and replaces nodata values to NaNs to natively support most NumPy array operations with nodata propagation.
This seemed required because Rioxarray does not mask nodata values while preserving the nodata value in its metadata, which is incompatible with the behaviour we need. (To give an clear example: With Rioxarray, either you load the array with -9999 in it and the ds.rio.nodata is -9999, or you load the array with NaNs in it, and the ds.rio.nodata is NaN).
I did not find another way to do this here...

New tests

Adding new tests is simple: We simply need to check that all functions give the exact same result for a raster opened as a Raster, or as a xr.DataArray.

For this, the new tests introduce a function to check the equality of a Raster and xarray.DataArray.
Then, they check that all common attributes and methods of RasterBase run and return exactly the same output (or equal to the other object type when output is a Raster/xarray.DataArray).

Discussion of core differences

The problem with the rst accessor object is that, if I'm not mistaken, we won't have access to functionalities that are not explicit such as __array_interface__. Thus, we likely cannot mirror the entire behaviour of the Raster class (for instance, no overloading_check to verify that the georeferencing is the same during an array or arithmetic operation). We can look more into it to be sure, but I don't think it is possible...

Thankfully, Xarray generally has similar behaviour as our Raster class, from the implicit loading to array-interfacing. We might want to adjust our functionalities to ensure we mirror that behaviour when possible, so that the code is written the same.

The main difference is that Xarray won't natively support nodata in its operations for integer arrays (no masked-array support in Xarray), and thus those need to be converted to NaN-arrays to do so, which increases RAM usage significantly for datasets of integer type. Here again, thankfully, chunked Dask-support can compensate for this, and run any NaN-array size.

So there are pros and cons to using the Raster or the rst accessor. We can try to reconcile differences where possible, and for those that are structural to the data objects, we should simply explain them clearly on a documentation page and leave the choice to users! 😄

TO-DO

Code

Add equivalent of Raster.from_array() to RasterAccessor class (or RasterBase class?) and individual setting operations (for transform, crs, nodata, and area_or_point) to make all methods (reproject, etc) naturally work on both Raster and ds.rst,
Ensure dual support of masked-array & NaN arrays for all methods (_reproject, etc),
Link to existing delayed function (in _reproject, _interp_points and _subsample) by detecting automatically if input array is a Dask array.
Add tests specific to RasterBase,
Add tests specific to RasterAccessor functionalities (comparing to ds.rio),
Add tests using xr.DataArray objects as match-reference input.

Documentation

Add rst accessor to "The georeferenced raster" page,
Update all feature pages to list Xarray accessor option: Raster.reproject() or ds.rst.reproject(),
Add page "Using Xarray and Pandas accessors" explaining pros and cons in "Fundamentals"?,
Add page "Dask support" listing all functionalities supported out-of-memory, with tips,
Update API with rst accessor.

Other Dask support to add (will be moved as issues for later PRs)

The reduce_points function can copy the same logic as interp_points,
The crop function using isel of Rioxarray,
Potentially look at geocube for rasterize/polygonize support,
But the proximity function would be a bit of work...

rhugonnet · 2024-12-07T02:07:28Z

@adehecq @atedstone @erikmannerfelt This PR is also ready for your first review! It is not finalized, but at a good stage to hear your feedback, questions, recommendations, and then move forward to finalize it.
I have described the general concepts in the description, and the tests comparing both types of outputs should give you a good idea of how this works in practice in the code.

Once you are done with this one, you should look at the one for the dem accessor in xDEM (which depends on this PR).
Much less changes are needed, but it has a slightly more complex class structure: GlacioHack/xdem#656

rhugonnet added 4 commits August 9, 2023 16:53

Start Xarray accessor structure

525ab94

Merge remote-tracking branch 'upstream/main' into add_xarray_accessor

648b869

Incremental commit on Xarray accessor

c5ba304

Merge remote-tracking branch 'upstream/main' into add_xarray_accessor

2c14d80

rhugonnet marked this pull request as draft January 24, 2024 12:34

rhugonnet changed the title ~~Add Xarray accessor mirroring Raster API~~ Add Xarray accessor mirroring Raster class Jan 24, 2024

rhugonnet mentioned this pull request Mar 17, 2024

Add Raster.from_xarray() to create raster from a xr.DataArray #521

Merged

rhugonnet mentioned this pull request Apr 28, 2024

Add Dask-delayed raster subsample(), reproject() and interp_points() #537

Merged

5 tasks

rhugonnet added 9 commits November 13, 2024 21:18

Merge remote-tracking branch 'upstream/main' into add_xarray_accessor

4bc27ca

Linting

9a9f8a6

Incremental commit on Xarray accessor

7ef89d5

Minimal linting

4243dd8

Incremental commit on Xarray accessor

c75aa07

Incremental commit on accessor

f680eaa

Incremental commit on accessor

886884e

Merge remote-tracking branch 'upstream/main' into add_xarray_accessor

24494cf

Incremental commit on accessor

2222171

rhugonnet mentioned this pull request Dec 7, 2024

Add dem Xarray accessor mirroring DEM class GlacioHack/xdem#656

Draft

11 tasks

rhugonnet marked this pull request as ready for review December 7, 2024 02:03

remi-braun mentioned this pull request Dec 9, 2024

Daskify 'rasters' module: roadmap sertit/sertit-utils#27

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Xarray accessor mirroring `Raster` class #446

Add Xarray accessor mirroring `Raster` class #446

rhugonnet commented Jan 24, 2024 •

edited

Loading

rhugonnet commented Dec 7, 2024

Add Xarray accessor mirroring Raster class #446

Are you sure you want to change the base?

Add Xarray accessor mirroring Raster class #446

Conversation

rhugonnet commented Jan 24, 2024 • edited Loading

Facilitated by recent code re-structuring

Summary of changes

New tests

Discussion of core differences

TO-DO

Code

Documentation

Other Dask support to add (will be moved as issues for later PRs)

rhugonnet commented Dec 7, 2024

Add Xarray accessor mirroring `Raster` class #446

Add Xarray accessor mirroring `Raster` class #446

rhugonnet commented Jan 24, 2024 •

edited

Loading