You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes, when network for one or more data nodes has degraded, the existing mechanism of leader-follower health-checks appears to be insufficient. This is partly because the default approach is to have 3 retries with 30 seconds. That means there could be 2 timeouts and the third retry might succeed just before it is about to timeout.
Reducing the timeout window is an after-thought, but measuring sliding-window latency between the nodes during a health-check can be a good way to detect if there are network issues early on.
Adaptive Replica Selection is not reliable, because the workload and payload itself might vary based on the shards hosted. So inferring network degradation from that is difficult. Health-checks, on the other hand, are more uniform in size/payload and agnostic to the data being hosted by the data nodes.
Related component
Cluster Manager
To Reproduce
Setup a multinode cluster in Linux
Log onto one of the data nodes and use the tc (traffic controller) utility to simulate delays enough to not timeout on 3 retries.
Expected behavior
None. Probably expose a metric that shows the average delay in response times.
Additional Details
Plugins
Please list all plugins currently enabled. # N/A
Screenshots
If applicable, add screenshots to help explain your problem. # N/A
Host/Environment (please complete the following information):
OS: [e.g. iOS]
Version [e.g. 22] # N/A
Additional context
Add any other context about the problem here. # N/A
The text was updated successfully, but these errors were encountered:
Describe the bug
Sometimes, when network for one or more data nodes has degraded, the existing mechanism of leader-follower health-checks appears to be insufficient. This is partly because the default approach is to have 3 retries with 30 seconds. That means there could be 2 timeouts and the third retry might succeed just before it is about to timeout.
Reducing the timeout window is an after-thought, but measuring sliding-window latency between the nodes during a health-check can be a good way to detect if there are network issues early on.
Adaptive Replica Selection is not reliable, because the workload and payload itself might vary based on the shards hosted. So inferring network degradation from that is difficult. Health-checks, on the other hand, are more uniform in size/payload and agnostic to the data being hosted by the data nodes.
Related component
Cluster Manager
To Reproduce
tc
(traffic controller) utility to simulate delays enough to not timeout on 3 retries.Expected behavior
None. Probably expose a metric that shows the average delay in response times.
Additional Details
Plugins
Please list all plugins currently enabled.
# N/A
Screenshots
If applicable, add screenshots to help explain your problem.
# N/A
Host/Environment (please complete the following information):
# N/A
Additional context
Add any other context about the problem here.
# N/A
The text was updated successfully, but these errors were encountered: