Not able to proceed consolidation on big cluster #1970

Hawksbk · 2025-02-06T14:33:06Z

Hey, we are faced with the issue that Karpenter not feeling well with consolidation on the big cluster. (~700 nodes):

And such behavior was together with consolidation timeouts:

Based on where timeout comes from: https://github.com/kubernetes-sigs/karpenter/blob/8d819c1ddaea095d4fa9abb53b20b536c0c45802/pkg/controllers/disruption/singlenodeconsolidation.go#L53-L52

Looks like Karpenter not able to proceed fast enough in case of the big queue of candidates for disruption.

What is bothering me as well, is that Karpenter runs on separate nodes with enough computing (GOMAXPROCS tooks right 8 cores).

But doesn't seem that it's trying to parallel operation much due to the usage of 3 cores being quite stable and not going higher:

We temporarily mitigated it first by moving karpenter to nodes with more performance CPU (c7g -> c8g), and limited nodepools to use bigger hosts to decrease the overall number of hosts, but it's a kind of temporary solution.

So following questions:

Is it possible to make efforts to improve the performance of karpenter in a matter of parallel operation or any other things that can be done in such a situation?
Any recommendations to better understand the limit of how close we are to the red line?

nantiferov · 2025-02-06T14:39:58Z

I think if this proposal #1733 will be implemented, it might help with cases like this.

Hawksbk · 2025-02-06T15:01:49Z

Just in case:

EKS version:

v1.31.4-eks-2d5f260
public.ecr.aws/karpenter/controller:1.1.1@sha256:fe383abf1dbc79f164d1cbcfd8edaaf7ce97a43fbd6cb70176011ff99ce57523

jonathan-innis · 2025-02-06T17:31:19Z

The other thing here is just the improvement of scheduling in general to be parallelized a bit so that we can use more cores. I'd agree that the timeout is probably something that we should either consider parameterizing or just removing entirely

jonathan-innis · 2025-02-06T18:29:17Z

/triage accepted

jonathan-innis · 2025-02-13T18:58:06Z

cc: @rschalo

I know that he has some thoughts on how we might be able to make this better

jonathan-innis · 2025-02-13T19:47:26Z

/priority important-longterm

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 6, 2025

jonathan-innis added consolidation performance Issues relating to performance (memory usage, cpu usage, timing) labels Feb 6, 2025

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 6, 2025

jonathan-innis removed the consolidation label Feb 6, 2025

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to proceed consolidation on big cluster #1970

Not able to proceed consolidation on big cluster #1970

Hawksbk commented Feb 6, 2025

nantiferov commented Feb 6, 2025

Hawksbk commented Feb 6, 2025

jonathan-innis commented Feb 6, 2025

jonathan-innis commented Feb 6, 2025

jonathan-innis commented Feb 13, 2025

jonathan-innis commented Feb 13, 2025

Not able to proceed consolidation on big cluster #1970

Not able to proceed consolidation on big cluster #1970

Comments

Hawksbk commented Feb 6, 2025

nantiferov commented Feb 6, 2025

Hawksbk commented Feb 6, 2025

jonathan-innis commented Feb 6, 2025

jonathan-innis commented Feb 6, 2025

jonathan-innis commented Feb 13, 2025

jonathan-innis commented Feb 13, 2025