Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator can return partial results after the timeout when allow_partial_search_results is true #16681

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kkewwei
Copy link
Contributor

@kkewwei kkewwei commented Nov 19, 2024

Description

In query phase, the coordinate concurrently search each shard, If any shard is blocked or responds very slowly, the coordination node will be stuck even if the timeout is set.

The pr supports timeout waiting, if the timeout is exceeded, the coordinator considers the shard as failed and gos on the fetch phase.

Related Issues

Resolves #817 (comment)

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added the enhancement Enhancement or improvement to existing feature or request label Nov 19, 2024
@kkewwei kkewwei changed the title opensearch should returns partial results after the timeout in coordinate node when allow_partial_search_results is true Coordinator can return partial results after the timeout when allow_partial_search_results is true Nov 19, 2024
Copy link
Contributor

❌ Gradle check result for 61d84d1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

long leftTimeMills;
if (queryPhase) {
// it's costly in query phase.
leftTimeMills = task.queryPhaseTimeout() - (System.currentTimeMillis() - task.startTimeMills());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation behind the queryPhaseTimeoutPercentage concept? I think it's going to depend on the query and the setup whether query or fetch phase takes longer and it doesn't seem super intuitive for a user to understand how to use this. For example a query that matches a lot of sparse documents using searchable snapshots might spend much longer in the fetch phase while a query that performs complex aggregations might spend a lot longer in the query phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to reserve some time for the subsequent phase as a backup measure, to ensure each stage can be allocated a certain amount of time. Of course, if the previous stage takes a very short time, it won't affect the remaining time available for the subsequent phases either.

If no such reservation is made, and a shard is blocked in query phase and uses up all the time, even if it returns after the timeout, there won't be any executable time left for the subsequent stages, and the timeout would be meaningless in that case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm should we have separate timeouts for the coordinator and the shard level search tasks then? I still think it's pretty unintuitive to use a % like this.

Copy link
Contributor

❌ Gradle check result for 5172db6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 5638e3c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 17cef4f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…artial_search_results is true

Signed-off-by: kkewwei <[email protected]>
Signed-off-by: kkewwei <[email protected]>
Copy link
Contributor

❌ Gradle check result for f2cb9f7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support timeout based search request cancellation
2 participants