Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locality Based Routing Support #1909

Open
tanujd11 opened this issue Sep 27, 2023 · 28 comments
Open

Locality Based Routing Support #1909

tanujd11 opened this issue Sep 27, 2023 · 28 comments
Labels
kind/enhancement New feature or request
Milestone

Comments

@tanujd11
Copy link
Member

Description:
Implement locality based routing support by default in EG. Now that we we can have individual endpoints as backend to EG. Can we support region/zone/subzone based routing based on EndpointSlice information, node labels etc.?

@tanujd11 tanujd11 added the kind/enhancement New feature or request label Sep 27, 2023
@arkodg
Copy link
Contributor

arkodg commented Sep 27, 2023

Hey @tanujd11 from a user perspective can you share what you like to happen on the data plane ( from gateway to multiple backend endpoints with different topology info )

@arkodg
Copy link
Contributor

arkodg commented Sep 27, 2023

I understand this is very useful for optimizing East West traffic within a cluster, is that also the case for north South ?

@tanujd11
Copy link
Member Author

I think for an Envoy gateway running in us-east-1/us-east-1a should prefer the same zone backend to prevent cross zonal traffic. I think this behaviour could be made as default as cross zone communication is obviously costly. WDYT?

@arkodg
Copy link
Contributor

arkodg commented Sep 27, 2023

thanks, here's something more to think about

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Oct 28, 2023
@tanujd11 tanujd11 removed the stale label Oct 29, 2023
@tanujd11 tanujd11 self-assigned this Nov 2, 2023
Copy link

github-actions bot commented Dec 2, 2023

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@arkodg
Copy link
Contributor

arkodg commented May 23, 2024

there's a new field in the Service spec (trafficDistribution.preferClose) https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution that we could consider using to automate priority amongst endpoints within a Service

@github-actions github-actions bot removed the stale label May 23, 2024
@aoledk
Copy link
Contributor

aoledk commented May 23, 2024

there's a new field in the Service spec (trafficDistribution.preferClose) https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution that we could consider using to automate priority amongst endpoints within a Service

Could be an option when this new field is stable and corresponding K8s version is adopted by massive companies.

Before that, IMO it's better to do load balancing accross endpoints in the cluster via Envoy's capability.

Currently EG has implemented locality weighted load balancing 1, one BackendRef is translated to one LocalityLbEndpoints.

locality := &endpointv3.LocalityLbEndpoints{
	Locality: &corev3.Locality{
		Region: fmt.Sprintf("%s/backend/%d", clusterName, i),
  	},
	LbEndpoints: endpoints,
	Priority:    0,
}
  
// Set locality weight
var weight uint32
if ds.Weight != nil {
	weight = *ds.Weight
} else {
	weight = 1
}

Actually endpoints inside a LocalityLbEndpoints may be running in different zone, cross zone cost can't be saved in this way.


Through Envoy's capability, priority levels 2 or zone aware routing 3 4 can archive the goal to save cross zone cost.

priority levels

  1. Backend endpoint should be set with correct zone, it can be retrieved from EndpointSlice, inherited from Node topology.kubernetes.io/zone label.
  2. Envoy's command options should be set with --service-zone option, means which zone Envoy Pod is running in.
  3. EG rearranges EDS resources for each Envoy, if Envoy and Backend endpoint are in same zone, priority as 0, else 1.

zone aware routing

This approach is mutually exclusive with locality weighted load balancing, since in the case of locality aware LB, we rely on the management server to provide the locality weighting, rather than the Envoy-side heuristics used in zone aware routing.

  1. Backend endpoint should be set with correct zone, it can be retrieved from EndpointSlice, inherited from Node topology.kubernetes.io/zone label.
  2. Envoy's command options should be set with --service-zone option, value meaning which zone Envoy Pod is running in.
  3. Envoy's bootstrap config should be set with cluster_manager. local_cluster_name, means which fleet Envoy Pod belongs to, it will be irKey in implementation.
  4. Add cluster corresponding to cluster_manager. local_cluster_name to CDS resources.
  5. Design a mechanism to discover Envoy Pods belongs to cluster_manager. local_cluster_name as endpoints and add them to EDS resources.
  6. Both Envoy and Backend cluster are not in panic mode 5.

personal preference

Since step 1 and 2 is required by both, priority levels can work with implemented locality weighed load balancing, but zone aware routing can't. Apparently priority levels are easier to implement. But it requires EDS resources should be arranged in xds/cache module for individual Envoy. No matter EG do this, or create new xDS Hook API, like PostEndpointModify(ClusterLoadAssignment, Node) which allow extension server to do this.

Footnotes

  1. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/locality_weight

  2. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/priority

  3. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/zone_aware

  4. https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/zone_aware_routing

  5. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/panic_threshold#arch-overview-load-balancing-panic-threshold

@arkodg
Copy link
Contributor

arkodg commented May 23, 2024

thanks for outlining the steps @aoledk ! we currently have #3055 open to get explicit priority per backendRef and program that into the xds cluster resource.

In the future, we can use this issue to make sure we track the auto priority work, the field in k8s preferClose could be the knob for users to say they want to opt in to this feature

@guydc
Copy link
Contributor

guydc commented Jun 6, 2024

Hi @aoledk, regarding:

priority levels
[...]
EG rearranges EDS resources for each Envoy, if Envoy and Backend endpoint are in same zone, priority as 0, else 1.

Is this option viable? Can our XDS server produce different EDS for different envoy pods that are part of the same Envoy deployment?

@modatwork
Copy link

I think it's possible. xDS server can read the locality info of envoy node.

The cache will be keyed based on a pre-defined hash function whose keys are based on the Node information.

// Identifies a specific Envoy instance. Remote server may have per Envoy configuration.
message Node {
  // An opaque node identifier for the Envoy node. This must be set.
  string id = 1;
  // The cluster that the Envoy node belongs to. This must be set.
  string cluster = 2;
  google.protobuf.Struct metadata = 3;
  Locality locality = 4;
  // This is motivated by informing a management server during canary which
  // version of Envoy is being tested in a heterogeneous fleet.
  string build_version = 5;
}

@guydc
Copy link
Contributor

guydc commented Jun 7, 2024

Thanks for pointing that out @modatwork. My other concerns wrt. to this approach are:

  • Conflict with general-purpose use cases of priorities (e.g. to support things like active/passive failover)
  • Possible impact on memory consumption if we have to maintain a copy of the cache for each locality. Not sure if that's already the situation today. @arkodg - do you know?

In general:

  • I'm +1 to supporting zone-aware routing in EG.
  • I would avoid using priorities for this feature in EG's built-in feature set.
  • The the extension-server approach to EP priority manipulation could work. If we don't have a per-locality cache, maybe that should be an opt-in feature.

Is there a reason to prefer the Priority-based approach? I'm not sure that it's significantly simpler than enabling zone-aware routing.

@arkodg
Copy link
Contributor

arkodg commented Jun 7, 2024

is @modatwork the same person as @aoledk :) ?

Possible impact on memory consumption if we have to maintain a copy of the cache for each locality. Not sure if that's already the situation today. @arkodg - do you know?

@guydc we have are dumuxing on gateway/IR, with locality it would add another dimension lookup and would increase memory by num localities * total (xds per gateway * gateway resources)

@aoledk
Copy link
Contributor

aoledk commented Jun 8, 2024

@arkodg I work together with @modatwork

Copy link

github-actions bot commented Jul 8, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Jul 8, 2024
@arkodg arkodg removed the stale label Jul 31, 2024
@arkodg arkodg assigned aoledk and unassigned tanujd11 Jul 31, 2024
@arkodg arkodg modified the milestones: Backlog, v1.2.0-rc1 Jul 31, 2024
@arkodg
Copy link
Contributor

arkodg commented Jul 31, 2024

hey @aoledk , adding this issue to the v1.2 milestone, is this something you can help with ?

  1. Lets configure zone aware routing in envoy by default https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/zone_aware_routing
  2. If a Service has TrafficDistribution set to PreferClose https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution, lets rearrange the EDS endpoint (so Service opts in)

@aoledk
Copy link
Contributor

aoledk commented Aug 1, 2024

hey @aoledk , adding this issue to the v1.2 milestone, is this something you can help with ?

  1. Lets configure zone aware routing in envoy by default https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/zone_aware_routing
  2. If a Service has TrafficDistribution set to PreferClose https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution, lets rearrange the EDS endpoint (so Service opts in)

@arkodg I can help.

@arkodg
Copy link
Contributor

arkodg commented Aug 1, 2024

awesome thanks @aoledk !

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Aug 31, 2024
@arkodg
Copy link
Contributor

arkodg commented Sep 19, 2024

hey @aoledk still planning on working on this one for v1.2 ?

@arkodg arkodg removed the stale label Sep 19, 2024
@aoledk
Copy link
Contributor

aoledk commented Sep 20, 2024

hey @aoledk still planning on working on this one for v1.2 ?

Hi @arkodg nowadays I'm working on bring in EG v1.1., next month I will continue on this feature, but not sure whether it can be merged into v1.2 (Due by October 30, 2024), maybe v1.3.

@arkodg
Copy link
Contributor

arkodg commented Sep 20, 2024

thanks for the update @aoledk, let us know if you hit any issues while running EG v1.1
moving this issue into backlog

@arkodg arkodg modified the milestones: v1.2.0-rc1, Backlog Sep 20, 2024
@aoledk
Copy link
Contributor

aoledk commented Sep 23, 2024

@arkodg LGTM.

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Oct 23, 2024
@flyik
Copy link

flyik commented Dec 4, 2024

Hi @aoledk! Are you still looking into implementing it yourself? If not, I’m interested in this feature and can work on bringing it to life.

@aoledk
Copy link
Contributor

aoledk commented Dec 4, 2024

@flyik recently I'm busy with bringing in EG, you can go ahead.

@aoledk aoledk removed their assignment Dec 4, 2024
@aoledk
Copy link
Contributor

aoledk commented Dec 4, 2024

@flyik I've unassigned myself, you can assign to yourself.

@github-actions github-actions bot removed the stale label Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants