-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stackstac.stack to support one time coordinate per unique datetime? #66
Comments
As a bit of context, the STAC items in this global IO-LULC dataset all have the same timestamp. AFAICT, if you wanted to load 100 of these items with stackstac, memory usage would be ~100x larger than necessary, since the array would be |
What about stackstac/stackstac/to_dask.py Lines 176 to 178 in 2c799e3
I haven't tested the memory usage; I'd be curious what actually happens.
We could think about it, but it might be tricky to implement during the stacking itself. If a chunk of the array could be sourced from multiple GDAL datasets, that would require rethinking a lot of the dask graph generation logic. I might prefer to implement this through optimized mosaic logic instead. The thing we want to skip is generating that all-NaN array for out-of-bounds chunks when we know it will immediately get dropped by a mosaic. There's already logic that avoids even opening an asset for spatial chunks that don't overlap with that asset: stackstac/stackstac/prepare.py Lines 233 to 238 in 2c799e3
If we did #13, I have a feeling mosaic could look at each spatial chunk, pluck out only the assets that overlap with that spatial chunk, and mosaic only those together.However, I'd want to test the current mosaic performance, because between the broadcast trick and short-circuiting the dataset open, performance might already be pretty good.
So if mosaic performance is good (or can be good), I might be in favor of a |
👍 either of those sound perfect to me. |
@TomAugspurger or @thuydotm, any interest in trying out |
I think we can conclude that There is a decent amount of communication, but I'm not sure how concerned to be about that... IIUC, Here's the code I ran import stackstac
from pystac_client import Client
import planetary_computer as pc
bounds = (120.9, 14.8, 120.91, 14.81)
catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
search = catalog.search(collections=["io-lulc"], bbox=bounds)
# Check how many items were returned
items = list(search.get_items())
print(f"Returned {len(items)} Items")
signed_items = [pc.sign(item).to_dict() for item in items]
ds = stackstac.stack(signed_items, epsg=32650, chunksize=4096)
# cluster
from dask_gateway import GatewayCluster
cluster = GatewayCluster()
cluster.scale(28)
client = cluster.get_client()
ds2 = stackstac.mosaic(ds).persist() |
Nice to not see any spilling to disk in that performance report!
It definitely would, just like any other reduction (mean, sum, etc.). I haven't looked at the graphs it makes yet, but they may be a little weird, since it's not a tree-reduce graph, but conceptually more like
Seems like it does! In [1]: import numpy as np
In [2]: from dask.sizeof import sizeof
In [3]: trick = np.broadcast_to(np.nan, (2048, 2048))
In [4]: sizeof(trick)
Out[4]: 8
In [5]: full = np.full_like(trick, np.nan)
In [6]: sizeof(full)
Out[6]: 33554432 So hopefully One interesting thing to look at would be how the initial chunks of the dataset get prioritized by |
Thanks! I had been looking at that landcover dataset a few weeks ago and wondering about an approach. |
Following up; is there anything we want to do here? Do we want to add any convenience functionality:
Or maybe some documentation about this? Or just close? |
Convenience functions are good I think Gabe. |
+1 for a |
I'm leaning towards that too. I think that with #116, some of the hard part is already done. The process would basically be:
|
Please see below an example running on the Planetary Computer using Esri 10m Land Cover data, where each STAC item is derived from a mosaic of many images. The output is a 4D cube with the time dimension is 4 and the time coordinates are just
2020-06-01T00:00:00
repeated.The documentation clearly stated that
``time`` will be equal in length to the number of items you pass in, and indexed by STAC Item datetime.
But in a more natural way, it's expected that the dataarray should have one time coordinate per unique datetime in the STAC items. Wouldstackstac.stack
support this feature?Output:
((4, 1, 185098, 134598), array(['2020-06-01T00:00:00.000000000'], dtype='datetime64[ns]'))
The text was updated successfully, but these errors were encountered: