-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvserver: ambiguous errors returned from a cluster with only 1 unhealthy node and no leases #137717
Comments
This comment was marked as outdated.
This comment was marked as outdated.
Generally the contract of DistSender.Send is to retry failures and only return ambiguous results in one specific case when we sent a modifying type of request to a node and have not received a response from it and we can not retry on other stores without knowing the outcome of that request. There are two cases when we decide what we won't get a response back:
In all other cases we should (and mostly do) return a different, non-ambiguous error. We recently added a case where the connection is still valid but the circuit breaker is tripped on a replica. However this was intended to handle cases where the leaseholder was stuck, not a non-leaseholder replica. In these two cases an ambiguous response is the correct response, but in any other case we don't need to return this response. In this specific case, we had a healthy range on two of the three replicas and the leaseholder was on one of the healthy nodes. The single replica case is not handled any differently, and typically we still only return an ambiguous error for the two cases described above. The error that happens with this request is specific to the structure of dist sender and when it retries. For the given request, there is a valid leaseholder and the request would have succeeded if either the client had the correct cache or even no cache at all. It failed because of a specific type of stale cache. |
Sorry, nevermind my comment -- I misunderstood the number of nodes (and therefore the number of replicas). |
That checks out from the error snippet above. However, I can't think of a way around this -- is there a proposed solution? |
The solution would be to handle the retry logic in the case of circuit breakers differently. The "hole" has to do with when we exit the
The forth case is the one that is handled incorrectly. The signature of The easiest short term solution would be to look at the eviction token before and after calling sendToReplicas and if it has changed and we have a The better fix would be to return a struct from |
Describe the problem
During a test run where a single node was overloaded and had no leases, a request failed with an ambiguous error. This should have been retried internally and handled by dist_sender.
To Reproduce
I ran the backfill perturbation test with the following command on a 30 node cluster using expiration leases.
However it didn't even get to the backfill, so this should be able to be recreated on a 30 node cluster with the following:
Expected behavior
The kv run failed with this error:
Note at this time, there were no leases on n30 both due to the constraints and because it had entered IO overload about 10 minutes earlier.
From a range log perspective on this range, there are the following relevant logs:
So it entered a tripped state at 20:34 and never leaves it.
On n7:
On n12:
Full kv-distribution log across all nodes for r345
Details
If applicable, add screenshots to help explain your problem.
Environment:
Jira issue: CRDB-45717
The text was updated successfully, but these errors were encountered: