Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster resolver should produce better errors when there are no healthy endpoints #7749

Open
dfawley opened this issue Oct 16, 2024 · 9 comments
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. P2 Type: Bug

Comments

@dfawley
Copy link
Member

dfawley commented Oct 16, 2024

EDIT: See #7749 (comment) -- (1) below is not correct. (2) should still be fixed.

If there are no healthy addresses, clutserresolver is just passing an empty address list to the child LB policy:

for _, endpoint := range locality.Endpoints {
// Filter out all "unhealthy" endpoints (unknown and healthy are
// both considered to be healthy:
// https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/core/health_check.proto#envoy-api-enum-core-healthstatus).
if endpoint.HealthStatus != xdsresource.EndpointHealthStatusHealthy && endpoint.HealthStatus != xdsresource.EndpointHealthStatusUnknown {
continue
}
addr := resolver.Address{Addr: endpoint.Address}
addr = hierarchy.Set(addr, []string{priorityName, localityStr})
addr = internal.SetLocalityID(addr, locality.ID)
// "To provide the xds_wrr_locality load balancer information about
// locality weights received from EDS, the cluster resolver will
// populate a new locality weight attribute for each address The
// attribute will have the weight (as an integer) of the locality
// the address is part of." - A52
addr = wrrlocality.SetAddrInfo(addr, wrrlocality.AddrInfo{LocalityWeight: lw})
var ew uint32 = 1
if endpoint.Weight != 0 {
ew = endpoint.Weight
}
addr = weightedroundrobin.SetAddrInfo(addr, weightedroundrobin.AddrInfo{Weight: lw * ew})
addrs = append(addrs, addr)
}
}
return &clusterimpl.LBConfig{
Cluster: mechanism.Cluster,
EDSServiceName: mechanism.EDSServiceName,
LoadReportingServer: mechanism.LoadReportingServer,
MaxConcurrentRequests: mechanism.MaxConcurrentRequests,
TelemetryLabels: mechanism.TelemetryLabels,
DropCategories: drops,
ChildPolicy: xdsLBPolicy,
}, addrs, nil

This means two things:

  1. The child policy will just ignore this update if it previously had healthy addresses, and
  2. The error that comes out, if the child policy never had valid addresses, says something about a resolver error, but it should say something like "cluster has no healthy endpoints".

Both of these should be considered bugs.

cc @markdroth @ejona86

@markdroth
Copy link
Member

markdroth commented Oct 16, 2024

The child policy will just ignore this update if it previously had healthy addresses

Yeah, that definitely seems like a bug in the child policy, even completely independent of xDS. I think the child policy should report TF and its Update() method should return non-OK, thus triggering the resolver to re-resolve.

The error that comes out, if the child policy never had valid addresses, says something about a resolver error, but it should say something like "cluster has no healthy endpoints".

Yeah, the more useful we can make the error message, the better. I did a bunch of work for cases like this in C-core as part of grpc/grpc#22883. The way we're handling this is:

  • The resolver API allows the resolver to return a resolution_note string that provides human-readable context that can be used in cases where it returns a valid address list and service config that the LB policy may not like in combination (e.g., if the address list is empty).
  • For EDS, we set the resolution_note to indicate when there's a problem. For example, we set if when the EDS resource contains no localities or when we get an error for the EDS resource.
  • All of our parent policies propagate this field through updates to children as appropriate. (Prior to implementing gRFC A75, there was a case where we needed to concatenate resolution_note strings from different clusters for aggregate clusters, but that's no longer necessary after the changes from that design.)
  • In both pick_first and our petiole policies (e.g., round_robin), if we get a valid but empty address list, we include the resolution_note in the error message we return.

@dfawley
Copy link
Member Author

dfawley commented Oct 16, 2024

Yeah, that definitely seems like a bug in the child policy, even completely independent of xDS. I think the child policy should report TF and its Update() method should return non-OK, thus triggering the resolver to re-resolve.

PF will return a non-OK result and that will trigger a re-resolution, but the point here is that the policy will keep using the old addresses, which is the defined behavior of PF.

@dfawley
Copy link
Member Author

dfawley commented Oct 16, 2024

The way we're handling this is:

I was expecting this needs to be handled by some custom logic in clusterresolver that detects when there are no addresses and substitutes the child picker for another one with a "no endpoints" error.

If you're pushing the EDS resolver update to the child policy, then how do you avoid the same situation where we keep using old addresses when zero address errors are encountered?

@dfawley
Copy link
Member Author

dfawley commented Oct 16, 2024

I think the child policy should report TF

Oops, I guess I misread this part the first time. I don't think the child policy is supposed to report TF in these cases, right? IIUC, PF is supposed to treat a resolver result of 0 addresses the same as a resolver error, which we ignore if we were already READY.

@markdroth
Copy link
Member

If you remember back when we agreed on the "always trust the control plane" principle (internally, see go/grpc-client-channel-principles-revamp), we decided that if the control plane sends us zero addresses, then we need to honor that immediately. Our currently agreed behavior of record is that PF should report TF in this case. It should not ignore the update if it was already READY -- we do that only if we can an error (as opposed to a valid but empty address list).

As you know, I'm not a big fan of the "always trust the control plane" principle, and I would still like us to reevaluate whether that's the right thing, especially now that we have more o11y infrastructure in place in OSS. But until we do that, the above is the behavior that we should be providing in all languages.

@purnesh42H purnesh42H added the Area: xDS Includes everything xDS related, including LB policies used with xDS. label Oct 21, 2024
@easwars
Copy link
Contributor

easwars commented Oct 21, 2024

The child policy will just ignore this update if it previously had healthy addresses, and

clusterresolver's child policy is priority, and if priority receives an update with no addresses, it should error saying no priority is provided, all priorities are removed.

Is that not what are you are seeing?

Where did you run into this? And what exact error were you seeing?

I did check our codebase, and looks like we do not have an e2e style test for a scenario where the clusterresolver receives endpoints, all of which are unhealthy. We should add a test for it, for sure.

@dfawley
Copy link
Member Author

dfawley commented Oct 22, 2024

Where did you run into this? And what exact error were you seeing?

It's a theoretical case that I quickly modified the existing clusterresolver e2e test of TestEDS_EndpointsHealth to stimulate. The error it gives when you set all addresses to UNHEALTHY is rpc error: code = Unavailable desc = last resolver error: produced zero addresses, which is what I mention in (2) and implies (1) is likely true, but it's technially possible something else happens.

@dfawley
Copy link
Member Author

dfawley commented Oct 22, 2024

Also, an update: we may actually want to change the expected behavior of some of these things. I'll follow up here when I know move.

@dfawley
Copy link
Member Author

dfawley commented Nov 1, 2024

It looks like my understanding of our PF implementation was wrong, and that (1) is false. We will accept a zero-address update from the resolver and apply it, making future RPCs fail. This was implemented in #5274. Sorry for the mistake.

We still need to improve our error messages when a cluster has no routable endpoints, however.

@dfawley dfawley changed the title No healthy EDS addresses should result in picker errors Cluster resolver should produce better errors when there are no healthy endpoints Nov 19, 2024
@dfawley dfawley added the P2 label Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. P2 Type: Bug
Projects
None yet
Development

No branches or pull requests

4 participants