Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to proceed consolidation on big cluster #1970

Open
Hawksbk opened this issue Feb 6, 2025 · 6 comments
Open

Not able to proceed consolidation on big cluster #1970

Hawksbk opened this issue Feb 6, 2025 · 6 comments
Labels
performance Issues relating to performance (memory usage, cpu usage, timing) priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Hawksbk
Copy link

Hawksbk commented Feb 6, 2025

Hey, we are faced with the issue that Karpenter not feeling well with consolidation on the big cluster. (~700 nodes):

Image

And such behavior was together with consolidation timeouts:

Image

Based on where timeout comes from: https://github.com/kubernetes-sigs/karpenter/blob/8d819c1ddaea095d4fa9abb53b20b536c0c45802/pkg/controllers/disruption/singlenodeconsolidation.go#L53-L52

Looks like Karpenter not able to proceed fast enough in case of the big queue of candidates for disruption.

What is bothering me as well, is that Karpenter runs on separate nodes with enough computing (GOMAXPROCS tooks right 8 cores).

Image

But doesn't seem that it's trying to parallel operation much due to the usage of 3 cores being quite stable and not going higher:

Image


We temporarily mitigated it first by moving karpenter to nodes with more performance CPU (c7g -> c8g), and limited nodepools to use bigger hosts to decrease the overall number of hosts, but it's a kind of temporary solution.


So following questions:

  • Is it possible to make efforts to improve the performance of karpenter in a matter of parallel operation or any other things that can be done in such a situation?
  • Any recommendations to better understand the limit of how close we are to the red line?
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 6, 2025
@nantiferov
Copy link

I think if this proposal #1733 will be implemented, it might help with cases like this.

@Hawksbk
Copy link
Author

Hawksbk commented Feb 6, 2025

Just in case:

EKS version:

  • v1.31.4-eks-2d5f260
  • public.ecr.aws/karpenter/controller:1.1.1@sha256:fe383abf1dbc79f164d1cbcfd8edaaf7ce97a43fbd6cb70176011ff99ce57523

@jonathan-innis
Copy link
Member

The other thing here is just the improvement of scheduling in general to be parallelized a bit so that we can use more cores. I'd agree that the timeout is probably something that we should either consider parameterizing or just removing entirely

@jonathan-innis jonathan-innis added consolidation performance Issues relating to performance (memory usage, cpu usage, timing) labels Feb 6, 2025
@jonathan-innis
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 6, 2025
@jonathan-innis
Copy link
Member

cc: @rschalo

I know that he has some thoughts on how we might be able to make this better

@jonathan-innis
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues relating to performance (memory usage, cpu usage, timing) priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants