Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Health-checks is not sufficient to detect network degradation #17236

Open
kkhatua opened this issue Feb 3, 2025 · 0 comments
Open

[BUG] Health-checks is not sufficient to detect network degradation #17236

kkhatua opened this issue Feb 3, 2025 · 0 comments
Labels
bug Something isn't working Cluster Manager untriaged

Comments

@kkhatua
Copy link
Member

kkhatua commented Feb 3, 2025

Describe the bug

Sometimes, when network for one or more data nodes has degraded, the existing mechanism of leader-follower health-checks appears to be insufficient. This is partly because the default approach is to have 3 retries with 30 seconds. That means there could be 2 timeouts and the third retry might succeed just before it is about to timeout.

Reducing the timeout window is an after-thought, but measuring sliding-window latency between the nodes during a health-check can be a good way to detect if there are network issues early on.

Adaptive Replica Selection is not reliable, because the workload and payload itself might vary based on the shards hosted. So inferring network degradation from that is difficult. Health-checks, on the other hand, are more uniform in size/payload and agnostic to the data being hosted by the data nodes.

Related component

Cluster Manager

To Reproduce

  1. Setup a multinode cluster in Linux
  2. Log onto one of the data nodes and use the tc (traffic controller) utility to simulate delays enough to not timeout on 3 retries.

Expected behavior

None. Probably expose a metric that shows the average delay in response times.

Additional Details

Plugins
Please list all plugins currently enabled.
# N/A

Screenshots
If applicable, add screenshots to help explain your problem.
# N/A

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]
    # N/A

Additional context
Add any other context about the problem here.
# N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager untriaged
Projects
Status: 🆕 New
Development

No branches or pull requests

1 participant