-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support chunking the time
/band
dimensions
#116
Conversation
Still pretty complicated. Probably need some helpers in order to test the more interesting stuff. Also would like to hypothesis?
opening asset table probably works; reads still todo. a few missteps along the way: * `np.vectorize` on `asset_table_to_reader_and_window` is segfaulting. May have to do with all the extra arguments? Don't care to gdb it; just moved the vectorization into the function * Blockwise wants to fuse the reader table to the reads. I'm not sure it's good that blockwise optimization tries to fuse such a large expansion, but probably kinda makes sense. Unfortunately the way to disable it (annotations) feels like a little bit of a hack
Passing dtype and fill_value into `fetch_raster_window`, instead of storing them in the reader table, simplifies things a lot. Also lets us get rid of those fields on the reader protocol.
small steps!
[PEP 585](https://www.python.org/dev/peps/pep-0585/) is only available in 3.9. We're still supporting 3.8 for now; planetary computer uses it, as do probably many others
Update: I've tested this branch out for making moderate-sized median composites, and it does help, but it's not a panacea. I'm doing Note the Here's cluster memory to compute this median composite (dropping each chunk immediately after it's computed, simulating a Blue is with Clearly cluster memory usage is way lower, and runtime is about the same—great! The dashboard also looks a bit cleaner (fewer transfers). Plus, there are 10x fewer tasks with BUT as I said, it's not a panacea. I've tried doing some actually-big composites, like processing 30TiB, and the graph is still just way too big. When your input data is 200k chunks, you do three operations and you have 1mil tasks. After uploading 800MiB of graph to the scheduler (!!), I'm still waiting 20min later for anything to happen. Additionally, probably 95% of the data is NaN generated by stackstac for the pixels out-of-bounds of each asset. stackstac is very efficient about generating these NaNs—it's super fast and takes up almost no space—but in the end, those fake broadcast-trick arrays will likely get expanded into full-size NumPy arrays by downstream operations. And they represent tons of useless tasks which just bog down the scheduler. In the end, I just don't believe that (dask) arrays / DataArrays are a good data structure for geospatial operations at a large scale. Small to medium scale, they're decent, since they're intuitive and have so much functionality built in. But at larger scales, the lack of sparse-ness becomes problematic. And you really need to push any compositing/filtering/grouping into the data-loading step. In theory this is the sort of thing optimizations should be good for, but in reality it would be very hard to express currently (dask/dask#7933). |
Closes #106, #105, progress towards #26.
This lets you specify
chunksize=(20, 1, 2048, 2048)
for example. Or evenchunksize="auto"
. I'm hoping this will help with getting graph size under control for large arrays. Particularly when you're doing composites, chunking a few assets together will greatly reduce graph size.The neat thing is that through just this parameter, you can totally alter the parallelism strategy for a workflow. Are you doing something that's spatially embarrassingly parallel, where intuitively you'd just do a for-loop over tiles? Use
chunksize=(-1, 1, 1024, 1024)
to process every image within a 1024x1024 tile. Is it spatially parallel, but you want to batch by time as well to keep peak memory down? Trychunksize=(40, 1, 1024, 1024)
to keep the 1024x1024 tiles, but only in stacks of 40 images at a time. Or even justchunksize="200MiB"
to get reasonably-sized chunks every time.Implementation-wise, when there are multiple assets in a single chunk, we just load them serially in the task. I think this deserves some research; I have a feeling total runtime might be slower with fewer multi-asset chunks than more single-asset chunks. I could imagine a small threadpool to overlap IO might be reasonable.
This also further speeds up array generation and reduces graph memory usage by using a
BlockwiseDep
, which will soon be available from dask/dask#7417, to represent the spatial component of the array, instead of the materializedslices_from_chunks
graph we had before (for spatially-enormous arrays, this could be rather large).And, finally, this adds some tests! They're not enabled in CI yet (will be a follow-up PR), but there are Hypothesis tests for the core
to_dask
routine.cc @TomAugspurger