[BUG] Health-checks is not sufficient to detect network degradation #17236

kkhatua · 2025-02-03T19:15:53Z

Describe the bug

Sometimes, when network for one or more data nodes has degraded, the existing mechanism of leader-follower health-checks appears to be insufficient. This is partly because the default approach is to have 3 retries with 30 seconds. That means there could be 2 timeouts and the third retry might succeed just before it is about to timeout.

Reducing the timeout window is an after-thought, but measuring sliding-window latency between the nodes during a health-check can be a good way to detect if there are network issues early on.

Adaptive Replica Selection is not reliable, because the workload and payload itself might vary based on the shards hosted. So inferring network degradation from that is difficult. Health-checks, on the other hand, are more uniform in size/payload and agnostic to the data being hosted by the data nodes.

Related component

Cluster Manager

To Reproduce

Setup a multinode cluster in Linux
Log onto one of the data nodes and use the tc (traffic controller) utility to simulate delays enough to not timeout on 3 retries.

Expected behavior

None. Probably expose a metric that shows the average delay in response times.

Additional Details

Plugins
Please list all plugins currently enabled.
# N/A

Screenshots
If applicable, add screenshots to help explain your problem.
# N/A

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]
# N/A

Additional context
Add any other context about the problem here.
# N/A

The text was updated successfully, but these errors were encountered:

kkhatua added bug Something isn't working untriaged labels Feb 3, 2025

github-actions bot added the Cluster Manager label Feb 3, 2025

github-project-automation bot added this to Cluster Manager Project Board Feb 3, 2025

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Health-checks is not sufficient to detect network degradation #17236

[BUG] Health-checks is not sufficient to detect network degradation #17236

kkhatua commented Feb 3, 2025

[BUG] Health-checks is not sufficient to detect network degradation #17236

[BUG] Health-checks is not sufficient to detect network degradation #17236

Comments

kkhatua commented Feb 3, 2025

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details