-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficiency of item_collection() #659
Comments
I'm going to move this question to the pystac-client repo since it is primarily about the API. But my sense is that the performance will depend on the STAC API that you are accessing. If it is public then it would be helpful to have a functioning example. |
Some providers will offer bulk access to STAC collections through geoparquet datasets (for example https://planetarycomputer.microsoft.com/docs/quickstarts/stac-geoparquet/). Otherwise, you'll be limited by the speed of your network and the server (and pystac / pystac-client is currently sequential & syncrhonous. See #4 and #77 ). You might able to speed things up a bit by increasing the |
@TomAugspurger Very useful information thank you! Your links, especially this from #77 is close to what spurred me to ask this question. I can provide some more context... I have been interested in getting a better understanding of cloud-free data availability through time from HLS across large areas (e.g., make sure there is enough clear observations during a given year/time of year to make a project feasible). To do this across Canada, I created a 120 km fishnet and got points representing the centroid of each square (~900 points). Then I pretty much created a for loop that for each point...
There is also a simple try/except bit to avoid server issues stopping the process (can come back to those points at the end). For a large number of points this can be a pretty slow process, taking anywhere from a minute or so (e.g., southern point in cloudy area with a few hundred time-steps to consider) to almost an hour (e.g., northern Arctic point with tons of orbit overlap with thousands of time-steps to consider). For ~900 points, it takes several days to complete. This is usually a one-off thing I would run at the beginning of a project to explore the imagery time-series, but it serves as a good opportunity to find bottlenecks in my data access pipeline for when I attempt analyses that involve working with every pixel through time. For these 1 pixel x 1 band x many time-step x many points example, the pystac-client portion eats up a significant amount of time since so many images need to be considered (e.g., I probably "touched" almost 3 HLS million images to get those Canada-wide numbers). All this is to say I am still pretty new to working with imagery through STAC and these other tools, and learning about best practices for given workflows by checking out GitHub discussions has been very helpful in my learning. I will consider the options you noted above and test out the point-based example you linked to above. If you have any other tips related to this type of approach, feel free to share! I won't be able to do more testing on this until next week at the earliest, so if you have nothing else to add, you can close this and I can come back to it later. |
This simple change (setting limit to 250, the max value allowed) led to a big speed up by itself! For a year of HLS data over a 60 x 60 km square I can now build a dask xarray with stackstac in under 10 seconds, when with the default limit (100?) it is closer to 30 seconds. Also just an FYI that the pystac documentation for search is out-of-date: https://pystac-client.readthedocs.io/en/stable/api.html#pystac_client.Client.search It still says that max_items defaults to 100. |
After a little more testing, I think the current limit for However, when I set limit to 10, I got similar search times as the default (around 30 seconds). Also note I ran these tests 5 times each to confirm the numbers, while restarting my kernel each time to make sure they are independent (no cached data). |
There might be some server-side settings that affect the default and maximum value for |
Got it thanks, I will need to play around with this a bit more - since I started getting 502 Bad Gateway errors with limit = 250 in the search over larger grid squares (120 x 120 km in this case) - but not with limit = 100. I may need a little system where I change the limit value based on the size of the area I am pulling data from to get best speed while avoiding server-related errors. |
Hello,
I use pystac_client as my window into STACs, from which I load satellite imagery items into stackstac (or odc-stac) to create xarray datacubes. I have been attempting to increase the efficiency of my processing workflows since I want to use it for some wider area/dense in time analysis.
One bottleneck I currently have with pystac_client is item_collection(). My typical workflow:
With search parameters that find 300 items, item_collection() requires about 50 seconds. If I try to grab more items (e.g., all HLS observations over 5-10 years) - this can take several minutes. I just wanted to understand if I am being as efficient as I can here, or if there is something I can do to reduce the time it takes to build the collection. Note that I am on a government network that could be the culprit causing some slow down.
Thank you for your time!
The text was updated successfully, but these errors were encountered: