Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support chunking the time/band dimensions #116

Merged
merged 22 commits into from
Feb 2, 2022
Merged

Support chunking the time/band dimensions #116

merged 22 commits into from
Feb 2, 2022

Conversation

gjoseph92
Copy link
Owner

Closes #106, #105, progress towards #26.

This lets you specify chunksize=(20, 1, 2048, 2048) for example. Or even chunksize="auto". I'm hoping this will help with getting graph size under control for large arrays. Particularly when you're doing composites, chunking a few assets together will greatly reduce graph size.

The neat thing is that through just this parameter, you can totally alter the parallelism strategy for a workflow. Are you doing something that's spatially embarrassingly parallel, where intuitively you'd just do a for-loop over tiles? Use chunksize=(-1, 1, 1024, 1024) to process every image within a 1024x1024 tile. Is it spatially parallel, but you want to batch by time as well to keep peak memory down? Try chunksize=(40, 1, 1024, 1024) to keep the 1024x1024 tiles, but only in stacks of 40 images at a time. Or even just chunksize="200MiB" to get reasonably-sized chunks every time.

Implementation-wise, when there are multiple assets in a single chunk, we just load them serially in the task. I think this deserves some research; I have a feeling total runtime might be slower with fewer multi-asset chunks than more single-asset chunks. I could imagine a small threadpool to overlap IO might be reasonable.

This also further speeds up array generation and reduces graph memory usage by using a BlockwiseDep, which will soon be available from dask/dask#7417, to represent the spatial component of the array, instead of the materialized slices_from_chunks graph we had before (for spatially-enormous arrays, this could be rather large).

And, finally, this adds some tests! They're not enabled in CI yet (will be a follow-up PR), but there are Hypothesis tests for the core to_dask routine.

cc @TomAugspurger

Still pretty complicated. Probably need some helpers in order to test the more interesting stuff. Also would like to hypothesis?
opening asset table probably works; reads still todo. a few missteps along the way:
* `np.vectorize` on `asset_table_to_reader_and_window` is segfaulting. May have to do with all the extra arguments? Don't care to gdb it; just moved the vectorization into the function
* Blockwise wants to fuse the reader table to the reads. I'm not sure it's good that blockwise optimization tries to fuse such a large expansion, but probably kinda makes sense. Unfortunately the way to disable it (annotations) feels like a little bit of a hack
Passing dtype and fill_value into `fetch_raster_window`, instead of storing them in the reader table, simplifies things a lot. Also lets us get rid of those fields on the reader protocol.
[PEP 585](https://www.python.org/dev/peps/pep-0585/) is only available in 3.9. We're still supporting 3.8 for now; planetary computer uses it, as do probably many others
@gjoseph92 gjoseph92 marked this pull request as ready for review February 1, 2022 00:25
@gjoseph92
Copy link
Owner Author

Update: I've tested this branch out for making moderate-sized median composites, and it does help, but it's not a panacea.

I'm doing stack(..., chunksize=(-1, 1, "128MiB", "128MiB")).median("time") on a small cluster (4 workers, 16 CPUs, 128GiB memory), loading Sentinel-2 from PC.

Note the -1 chunksize in the first dimension basically makes "spatial skewers": each chunk will be a small spatial tile, which loads every single item into one tall array per tile. This is great specifically for medians, since this is the chunk pattern you need to do median("time") anyway, so it saves an expensive rechunking step.

Here's cluster memory to compute this median composite (dropping each chunk immediately after it's computed, simulating a .to_zarr at the end or something):

download-2

Blue is with -1 temporal chunks on this PR, orange is current behavior with 1 temporal chunks.

Clearly cluster memory usage is way lower, and runtime is about the same—great!

The dashboard also looks a bit cleaner (fewer transfers). Plus, there are 10x fewer tasks with -1 chunks!

-1 chunks with this PR:
skewers

1 chunks, current behavior:
spatio-temporal


BUT as I said, it's not a panacea. I've tried doing some actually-big composites, like processing 30TiB, and the graph is still just way too big. When your input data is 200k chunks, you do three operations and you have 1mil tasks. After uploading 800MiB of graph to the scheduler (!!), I'm still waiting 20min later for anything to happen.

Additionally, probably 95% of the data is NaN generated by stackstac for the pixels out-of-bounds of each asset. stackstac is very efficient about generating these NaNs—it's super fast and takes up almost no space—but in the end, those fake broadcast-trick arrays will likely get expanded into full-size NumPy arrays by downstream operations. And they represent tons of useless tasks which just bog down the scheduler.

In the end, I just don't believe that (dask) arrays / DataArrays are a good data structure for geospatial operations at a large scale. Small to medium scale, they're decent, since they're intuitive and have so much functionality built in. But at larger scales, the lack of sparse-ness becomes problematic. And you really need to push any compositing/filtering/grouping into the data-loading step. In theory this is the sort of thing optimizations should be good for, but in reality it would be very hard to express currently (dask/dask#7933).

@gjoseph92 gjoseph92 merged commit 0bc305f into main Feb 2, 2022
@gjoseph92 gjoseph92 deleted the from-array branch February 2, 2022 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support chunking multiple assets together in the time/band dimensions
1 participant