-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Dask-delayed raster subsample()
, reproject()
and interp_points()
#537
Add Dask-delayed raster subsample()
, reproject()
and interp_points()
#537
Conversation
@adehecq @ameliefroessl I think this PR is essentially finished and ready for your review 😄. (EDIT: Adding @erikmannerfelt too, as this joins some of the For tests, monitoring the memory usage during the Finally, the remaining issue is the discrepancy comparing the delayed I have also opened issues or continued discussions in Rioxarray (maybe moving the Dask reprojection there? corteva/rioxarray#119 (comment)), Xarray (maybe covering the case of regular grid interpolation directly there? pydata/xarray#6799 (comment)) and Dask (maybe having a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice tests!! Looking very good :)
Allowing boolean array in delayed_subsample
Summary
(Moved from GlacioHack/xdem#508. Some functions were written with help from @ameliefroessl, after discussions that stemmed in GlacioHack/xdem#501.)
This PR adds three out-of-memory raster operations:
xarray.interp()
.Summary of reasoning behind implementations
For both subsampling and point interpolation, we had to deal with the difficulty of ragged outputs (varying 1D length, and different than the 2D input chunks). See this blogpost for a nice example of why this can be tricky: https://blog.dask.org/2021/07/02/ragged-output, and the issue it originated from: dask/dask#7589.
The
map_blocks(drop_axis=)
solution proposed in the blogpost unfortunately loads a lot of the chunks in memory at once by dropping the axis (rechunking in 1D essentially). Hence why we inspired ourselves from thedask.delayed
solution instead.For subsampling: Using
dask.vindex
(or vectorized Xarray indexing) allows to do pointwise sampling, but we cannot know the nodata values in advance.In this PR, we compute the number of valid values per chunk in advance, then create a random subsample of the total number of valid values for the array (no matter which chunk they fall in), then identify which chunks they fall in, and sample them out-of-memory per chunk.
The only downside of the method is that the random subsample is chunksize-dependent (for a fixed random state, it will give a different random subsample for different chunk sizes). If the user has multiple arrays to sample from at the same points, the argument
return_indices
can solve this. So this seems fine as long as it is clear in the function description.For interpolation:
Interpolation in the chunked dimensions was recently added to Xarray: pydata/xarray#4155, but the memory usage is really big pydata/xarray#6799 (we checked this ourselves too, see GlacioHack/xdem#501 (reply in thread)), because the grid can be rectilinear.
In this PR, we rely on the fact that a raster grid is of equal length in each dimension, which allows to both map the location of points to interpolate using less memory and, for per-chunk interpolation, only need to pass a couple values (step in X/Y and starting X/Y) instead of a full vector of coordinate.
The memory usage is bigger than a single chunk because we transform the input delayed array into overlapping arrays (with a depth of overlap depending on the interpolation method), so overlapping chunks have to be loaded as well. This is fine as long as the chunking is small enough.
For reprojection:
The only implementation of this to my knowledge is in opendatacube/odc-geo#88 (and maybe also in
pyresample
andgeowombat
?), but it depends on a long class of classes inodc-geo
(starting withGeoBox
andVariableSizedTiles
). As it didn't seem like so much work, I made a concise stand-alone implementation with the idea that it could potentially be moved to Rioxarray, or be used in GeoUtils's Xarray accessor without adding any new dependency.Possible contribution to upstream packages
I will start discussions in Rioxarray/Xarray/Dask to see if it makes sense to integrate some there directly, I was thinking:
reproject()
, but would requires shapely as optional dependency with current implementation;interp()
, on the assumptions that the grid is not rectilinear but equal/regular along the interpolated dimensions, and would have to extend to N-D!subsample()
: to mirror theirdask.dataframe.sample()
functionality, whiledask.array
doesn't have one.Next steps
This is a big step forward for #383.
We should not describe too much these functions in the API just yet, but wait until the Xarray accessor is out (as it will facilitate greatly reading the chunked array properly through Rioxarray).
TODO
test_delayed
to consistently check out-of-memory operation work as intended,rio.warp.reproject()
fails with an error during multi-threading only? (Forced to single-thread)