Cluster resolver should produce better errors when there are no healthy endpoints #7749

dfawley · 2024-10-16T18:11:47Z

EDIT: See #7749 (comment) -- (1) below is not correct. (2) should still be fixed.

If there are no healthy addresses, clutserresolver is just passing an empty address list to the child LB policy:

grpc-go/xds/internal/balancer/clusterresolver/configbuilder.go

Lines 258 to 290 in 4544b8a

    
           	for _, endpoint := range locality.Endpoints { 
        
           		// Filter out all "unhealthy" endpoints (unknown and healthy are 
        
           		// both considered to be healthy: 
        
           		// https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/core/health_check.proto#envoy-api-enum-core-healthstatus). 
        
           		if endpoint.HealthStatus != xdsresource.EndpointHealthStatusHealthy && endpoint.HealthStatus != xdsresource.EndpointHealthStatusUnknown { 
        
           			continue 
        
           		} 
        
           		addr := resolver.Address{Addr: endpoint.Address} 
        
           		addr = hierarchy.Set(addr, []string{priorityName, localityStr}) 
        
           		addr = internal.SetLocalityID(addr, locality.ID) 
        
           		// "To provide the xds_wrr_locality load balancer information about 
        
           		// locality weights received from EDS, the cluster resolver will 
        
           		// populate a new locality weight attribute for each address The 
        
           		// attribute will have the weight (as an integer) of the locality 
        
           		// the address is part of." - A52 
        
           		addr = wrrlocality.SetAddrInfo(addr, wrrlocality.AddrInfo{LocalityWeight: lw}) 
        
           		var ew uint32 = 1 
        
           		if endpoint.Weight != 0 { 
        
           			ew = endpoint.Weight 
        
           		} 
        
           		addr = weightedroundrobin.SetAddrInfo(addr, weightedroundrobin.AddrInfo{Weight: lw * ew}) 
        
           		addrs = append(addrs, addr) 
        
           	} 
        
           } 
        
           return &clusterimpl.LBConfig{ 
        
           	Cluster:               mechanism.Cluster, 
        
           	EDSServiceName:        mechanism.EDSServiceName, 
        
           	LoadReportingServer:   mechanism.LoadReportingServer, 
        
           	MaxConcurrentRequests: mechanism.MaxConcurrentRequests, 
        
           	TelemetryLabels:       mechanism.TelemetryLabels, 
        
           	DropCategories:        drops, 
        
           	ChildPolicy:           xdsLBPolicy, 
        
           }, addrs, nil

This means two things:

The child policy will just ignore this update if it previously had healthy addresses, and
The error that comes out, if the child policy never had valid addresses, says something about a resolver error, but it should say something like "cluster has no healthy endpoints".

Both of these should be considered bugs.

cc @markdroth @ejona86

markdroth · 2024-10-16T19:19:40Z

The child policy will just ignore this update if it previously had healthy addresses

Yeah, that definitely seems like a bug in the child policy, even completely independent of xDS. I think the child policy should report TF and its Update() method should return non-OK, thus triggering the resolver to re-resolve.

The error that comes out, if the child policy never had valid addresses, says something about a resolver error, but it should say something like "cluster has no healthy endpoints".

Yeah, the more useful we can make the error message, the better. I did a bunch of work for cases like this in C-core as part of grpc/grpc#22883. The way we're handling this is:

The resolver API allows the resolver to return a resolution_note string that provides human-readable context that can be used in cases where it returns a valid address list and service config that the LB policy may not like in combination (e.g., if the address list is empty).
For EDS, we set the resolution_note to indicate when there's a problem. For example, we set if when the EDS resource contains no localities or when we get an error for the EDS resource.
All of our parent policies propagate this field through updates to children as appropriate. (Prior to implementing gRFC A75, there was a case where we needed to concatenate resolution_note strings from different clusters for aggregate clusters, but that's no longer necessary after the changes from that design.)
In both pick_first and our petiole policies (e.g., round_robin), if we get a valid but empty address list, we include the resolution_note in the error message we return.

dfawley · 2024-10-16T19:27:31Z

Yeah, that definitely seems like a bug in the child policy, even completely independent of xDS. I think the child policy should report TF and its Update() method should return non-OK, thus triggering the resolver to re-resolve.

PF will return a non-OK result and that will trigger a re-resolution, but the point here is that the policy will keep using the old addresses, which is the defined behavior of PF.

dfawley · 2024-10-16T19:31:37Z

The way we're handling this is:

I was expecting this needs to be handled by some custom logic in clusterresolver that detects when there are no addresses and substitutes the child picker for another one with a "no endpoints" error.

If you're pushing the EDS resolver update to the child policy, then how do you avoid the same situation where we keep using old addresses when zero address errors are encountered?

dfawley · 2024-10-16T19:47:41Z

I think the child policy should report TF

Oops, I guess I misread this part the first time. I don't think the child policy is supposed to report TF in these cases, right? IIUC, PF is supposed to treat a resolver result of 0 addresses the same as a resolver error, which we ignore if we were already READY.

markdroth · 2024-10-16T21:11:18Z

If you remember back when we agreed on the "always trust the control plane" principle (internally, see go/grpc-client-channel-principles-revamp), we decided that if the control plane sends us zero addresses, then we need to honor that immediately. Our currently agreed behavior of record is that PF should report TF in this case. It should not ignore the update if it was already READY -- we do that only if we can an error (as opposed to a valid but empty address list).

As you know, I'm not a big fan of the "always trust the control plane" principle, and I would still like us to reevaluate whether that's the right thing, especially now that we have more o11y infrastructure in place in OSS. But until we do that, the above is the behavior that we should be providing in all languages.

easwars · 2024-10-21T16:53:42Z

The child policy will just ignore this update if it previously had healthy addresses, and

clusterresolver's child policy is priority, and if priority receives an update with no addresses, it should error saying no priority is provided, all priorities are removed.

Is that not what are you are seeing?

Where did you run into this? And what exact error were you seeing?

I did check our codebase, and looks like we do not have an e2e style test for a scenario where the clusterresolver receives endpoints, all of which are unhealthy. We should add a test for it, for sure.

dfawley · 2024-10-22T15:44:36Z

Where did you run into this? And what exact error were you seeing?

It's a theoretical case that I quickly modified the existing clusterresolver e2e test of TestEDS_EndpointsHealth to stimulate. The error it gives when you set all addresses to UNHEALTHY is rpc error: code = Unavailable desc = last resolver error: produced zero addresses, which is what I mention in (2) and implies (1) is likely true, but it's technially possible something else happens.

dfawley · 2024-10-22T15:45:23Z

Also, an update: we may actually want to change the expected behavior of some of these things. I'll follow up here when I know move.

dfawley · 2024-11-01T22:54:20Z

It looks like my understanding of our PF implementation was wrong, and that (1) is false. We will accept a zero-address update from the resolver and apply it, making future RPCs fail. This was implemented in #5274. Sorry for the mistake.

We still need to improve our error messages when a cluster has no routable endpoints, however.

dfawley added the Type: Bug label Oct 16, 2024

purnesh42H added the Area: xDS Includes everything xDS related, including LB policies used with xDS. label Oct 21, 2024

dfawley changed the title ~~No healthy EDS addresses should result in picker errors~~ Cluster resolver should produce better errors when there are no healthy endpoints Nov 19, 2024

dfawley added the P2 label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster resolver should produce better errors when there are no healthy endpoints #7749

Cluster resolver should produce better errors when there are no healthy endpoints #7749

dfawley commented Oct 16, 2024 •

edited

Loading

markdroth commented Oct 16, 2024 •

edited

Loading

dfawley commented Oct 16, 2024

dfawley commented Oct 16, 2024

dfawley commented Oct 16, 2024

markdroth commented Oct 16, 2024

easwars commented Oct 21, 2024

dfawley commented Oct 22, 2024

dfawley commented Oct 22, 2024

dfawley commented Nov 1, 2024 •

edited

Loading

Cluster resolver should produce better errors when there are no healthy endpoints #7749

Cluster resolver should produce better errors when there are no healthy endpoints #7749

Comments

dfawley commented Oct 16, 2024 • edited Loading

markdroth commented Oct 16, 2024 • edited Loading

dfawley commented Oct 16, 2024

dfawley commented Oct 16, 2024

dfawley commented Oct 16, 2024

markdroth commented Oct 16, 2024

easwars commented Oct 21, 2024

dfawley commented Oct 22, 2024

dfawley commented Oct 22, 2024

dfawley commented Nov 1, 2024 • edited Loading

dfawley commented Oct 16, 2024 •

edited

Loading

markdroth commented Oct 16, 2024 •

edited

Loading

dfawley commented Nov 1, 2024 •

edited

Loading