-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize topn requests #5075
optimize topn requests #5075
Conversation
add logic to detect which splits will deliver the top n results for requests. This is only supported for match_all requests, with optional sort_by on timestamp sorting. start_timestamp, end_timestamp as well as a filter on the timestamp field is not supported currently but could be.
// we want to detect cases where we can convert some split queries to count only queries | ||
let num_requested_docs = request.start_offset + request.max_hits; | ||
match self { | ||
CanSplitDoBetter::SplitIdHigher(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we apply the logic on CanSplitDoBetter
instead of the actual sort order here?
Can't we have CanSplitDoBetter set to informative even though docs are requested to be ordered by timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah answering my own question. if sorted by timestmap, but we don't have a bound, then CanSplitDoBetter
is set to Higher or Lower but with None
quickwit/quickwit-search/src/leaf.rs
Outdated
// | ||
// Calculate the number of splits which are guaranteed to deliver enough documents. | ||
let num_splits = count_required_splits(&split_with_req, num_requested_docs); | ||
assert!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the justification of this assert? Why is this true? Where is the code that enforces it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was there to safeguard the code below for changes later, but I changed the algorithm to handle num_splits=0
.unwrap(); | ||
for (split, ref mut request) in split_with_req.iter_mut().skip(num_splits) { | ||
if split.timestamp_end() < smallest_start_timestamp { | ||
disable_search_request_hits(request); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we don't request count, the resulting disable_search_request
does nothing.
Do we have a pass after that to entirely remove this split / request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strangely currently we don't have a no count parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already have an CountHits enum. We could add that or use the Option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we don't request count, the resulting disable_search_request does nothing.
No, it disables the returning of hits from that split. That leaves only the count, and in this simple query case the count can be served from metadata and is basically free.
Initially we don't know which split may contain the best splits so we return hits from all splits.
You already have an CountHits enum. We could add that or use the Option?
Yes, but that would be relevant for cases where we can't serve counts from metadata. But in that case we have already the run_all_splits
optimization, which means we probably run num_cpu splits before early exiting.
"We should always have at least one split to search" | ||
); | ||
// | ||
// If we know that some splits will deliver enough documents, we can convert the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great explanation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
On SSD:ERROR: Not the same queries, cannot compare, difference: {'big_term_query_count_only', 'match_all_count_only', 'last_6_hours_sort_timestamp_2', 'last_6_days_sort_timestamp', 'last_6_hours_sort_timestamp'} On GCS:ERROR: Not the same queries, cannot compare, difference: {'match_all_count_only', 'last_6_hours_sort_timestamp', 'last_6_days_sort_timestamp', 'big_term_query_count_only', 'last_6_hours_sort_timestamp_2'} |
add logic to detect which splits will deliver the top n results for
requests. This is only supported for match_all requests, with optional
sort_by on timestamp sorting.
The change extends the python tests to distribute ndjson to random splits
start_timestamp, end_timestamp as well as a filter on the timestamp field
is not supported currently but could be.
search_after is also not supported currently
https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1861,1862&search_metric=engine_duration
Compare S3 Fetch Requests:
https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1799,1864&search_metric=object_storage_fetch_requests
Addresses #5032