Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STAC search pagination is applied before query filtering #579

Closed
christophfriedrich opened this issue Mar 7, 2024 · 6 comments · Fixed by #591
Closed

STAC search pagination is applied before query filtering #579

christophfriedrich opened this issue Mar 7, 2024 · 6 comments · Fixed by #591
Assignees

Comments

@christophfriedrich
Copy link

Regarding STAC API extensions, the README says:

The STAC endpoint implements the query, filter, fields, and sort extensions, all of which are bound to the search endpoint as used with POST requests, with fields and sort additionally bound to the features endpoint.

Fields contained in the item properties must be prefixed with properties., ex properties.dea:dataset_maturity.

From this information, I figured that to limit my results by maximum cloud cover, I need to POST to /stac/search with a body like this:

{
    "query": {
        "properties.eo:cloud_cover": {
            "lte": 10
        }
    }
}

But the numberMatched of the returned result is the same as when I omit this query. This leaves me wondering that maybe my JSON is wrong? Adding an example to the README explaining how to use the query API would really help to rule this out, as right now I'm not sure whether it's a problem with my request or my data. I tried various variants of the request (e.g. adding "gte": 0), but none seems to do anything. But also my data does have the eo:cloud_cover attribute, which also is included in the properties of the returned features, so I don't really see a problem there either...

@Ariana-B Ariana-B self-assigned this Mar 8, 2024
@Ariana-B
Copy link
Contributor

Ariana-B commented Mar 8, 2024

Hi @christophfriedrich,
This is due to an issue with how our STAC API is currently implemented; we are aware of the issue and looking to fix it.
In short, first all datasets that match the time/collection/bbox/etc parameters are being retrieved, and then the query extension is filtering the result. Hence why it doesn't affect numberMatched; however, the datasets actually being returned will all correctly have eo:cloud_cover <= 10.
The best workaround for the time being would probably be to set limit=1000 (to avoid/reduce pagination) and check the length of features rather than the value of numberMatched.

@christophfriedrich
Copy link
Author

christophfriedrich commented Mar 8, 2024

Hi Ariana, thanks for the quick reply and explanation! Indeed I only looked at the numberReturned/numberMatched/context section and didn't pay attention to the actual results, which in fact fulfill the query. However, what wasn't clear to me initially after reading your comment, is that the filtering currently takes place after pagination. So e.g. for my test case, the 20 results matched initially happened to have a cloud cover between 76 and 99%, so that filtering for <10% gives an empty result (and for <90% only 3 results).

That means that I can't be sure that all possible results are included, unless I increase the limit to the total number of datasets. I tested your suggestion of limit=1000, but that request already took 30 seconds to complete and resulted in some 10 MB of data being returned, so is absolutely not feasible for the interactive web application where I want to use this. The whole collection I'm dealing with has some 50,000 datasets, and even smaller bboxes still include 3,000 to 10,000 datasets...

It's nice to hear you're already aware of the issue and keen to fix it -- is there any timeframe when this will be corrected? I wouldn't even care that much about the numberMatched stuff, but moving the filtering before the pagination is crucial, as the current implementation makes the query extension quite unusable for me and even produces answers that look wrong for the uninformed (e.g. apparently empty result when querying for <10% even though there are datasets with <10% in the database).

Is sort applied earlier so that I could make the least cloud-covered images come to the top? 🤔 Alternatively the only workaround I see for me right now is submitting requests with e.g. limit=100 and using the next links until at some point there were 10 actual results included, but that is rather cumbersome...

@christophfriedrich christophfriedrich changed the title Add example for STAC search query extension to README STAC search pagination is applied before query filtering Mar 8, 2024
@Ariana-B
Copy link
Contributor

Ariana-B commented Mar 8, 2024

Sorry, I should have been clearer in my explanation - but yes, that's the essence of the problem. If you paginate through the results you will eventually manage to retrieve all matching datasets but obviously that's not ideal. And unfortunately, all the extensions are essentially broken in the same way for the exact same reason.
Fixing this is my top priority but unfortunately I can't give you an exact timeframe as I anticipate it might require some larger changes to the underlying design/logic that I'll have to look into - but hopefully it won't take more than a week or two.
Apologies for the inconvenience and thank you for raising!

@christophfriedrich
Copy link
Author

No worries, after I realised it I kind of read it between the lines too. And regarding the timeframe you gave quite a good answer already, it's nice to hear that it's the next thing being worked on and I absolutely understand that changing something like this might induce bigger changes to the architecture that are not done in a day. Better do it thoroughly than rushing it. For me it's not super urgent, just wanted to know whether it's likely to stay like this for months to come or fixed soon™️Looking forward to it! :)

@Ariana-B Ariana-B linked a pull request Apr 12, 2024 that will close this issue
@christophfriedrich
Copy link
Author

Hi, thanks for working on this @Ariana-B -- is there a roadmap on when a release can be expected that incorporates your fixes? :)

@Ariana-B
Copy link
Contributor

Unfortunately I'm not certain - that's dependent on when @omad is able to provide a review, as well as other fixes needing to be incorporated into the next release.

@omad omad closed this as completed in #591 Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants