Not able to proceed consolidation on big cluster #1970
Labels
performance
Issues relating to performance (memory usage, cpu usage, timing)
priority/important-longterm
Important over the long term, but may not be staffed and/or may need multiple releases to complete.
triage/accepted
Indicates an issue or PR is ready to be actively worked on.
Hey, we are faced with the issue that Karpenter not feeling well with consolidation on the big cluster. (~700 nodes):
And such behavior was together with consolidation timeouts:
Based on where timeout comes from: https://github.com/kubernetes-sigs/karpenter/blob/8d819c1ddaea095d4fa9abb53b20b536c0c45802/pkg/controllers/disruption/singlenodeconsolidation.go#L53-L52
Looks like Karpenter not able to proceed fast enough in case of the big queue of candidates for disruption.
What is bothering me as well, is that Karpenter runs on separate nodes with enough computing (GOMAXPROCS tooks right 8 cores).
But doesn't seem that it's trying to parallel operation much due to the usage of 3 cores being quite stable and not going higher:
We temporarily mitigated it first by moving karpenter to nodes with more performance CPU (c7g -> c8g), and limited nodepools to use bigger hosts to decrease the overall number of hosts, but it's a kind of temporary solution.
So following questions:
The text was updated successfully, but these errors were encountered: