Karpenter cannot provision node for application when use Volcano #4030

tieungao88 · 2025-02-21T02:29:50Z

Hi everyone,
I have a problem: "Karpenter cannot provision node for application when use Volcano"
Detail:
Volcano version: volcano-1.11.0
EKS version: 1.30.0

I deploy a deployment. When bootup, Karpenter worked and provisioned for me one node. After 2 mins, I scaled deployment from one replica to two replica, while the node had cpu ~ 100%. Then, Pod is pendding infinity!.
Deployment:

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: cpu-hog-group
  namespace: default
  annotations:
    scheduling.volcano.sh/pod-group-type: "deployment"
spec:
  minMember: 1
  queue: default
  # priorityClassName: high-priority  # Tùy chọn
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cpu-hog-active-01
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cpu-hog-active-01
  template:
    metadata:
      labels:
        app: cpu-hog-active-01
      annotations:
        scheduling.k8s.io/group-name: cpu-hog-group
        scheduling.volcano.sh/pod-group-type: "deployment"
    spec:
      schedulerName: volcano    # Chỉ định sử dụng Volcano scheduler
      # priorityClassName: high-priority    # Tùy chọn
      nodeSelector:
        workload-type: app-schplugin
      tolerations:
        - key: workload-type
          operator: Equal
          value: app-schplugin
          effect: NoSchedule
      containers:
      - name: cpu-hog-active-01
        image: busybox
        resources:
          requests:
            cpu: "100m"
          # limits:
          #   cpu: "1500m"
        command: ["/bin/sh", "-c"]
        args:
        - |
          N=$(nproc)
          for i in $(seq 1 $N); do
            yes > /dev/null &
          done
          wait

volcano-scheduler-configmap:

actions: "enqueue, allocate, backfill"  
tiers:
  - plugins:
      - name: priority
      - name: gang
      - name: conformance
      - name: usage  # usage based scheduling plugin
        enablePredicate: true  # If the value is false, new pod scheduling is not disabled when the node load reaches the threshold. If the value is true or left blank, new pod scheduling is disabled.
        arguments:
          usage.weight: 5
          cpu.weight: 1
          memory.weight: 1
          thresholds:
            cpu: 80    # The actual CPU load of a node reaches 80%, and the node cannot schedule new pods.
            prometheusMetrics:
              - name: "cpu_usage"
                query: "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)"
                step: 5
  - plugins:
      - name: overcommit
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
metrics:                               # metrics server related configuration
  type: prometheus                     # Optional, The metrics source type, prometheus by default, support "prometheus", "prometheus_adaptor" and "elasticsearch"
  address: http://dev-ext-prometheus-kube-pr-prometheus.prometheus:9090    # Mandatory, The metrics source address
  interval: 30s                        # Optional, The scheduler pull metrics from Prometheus with this interval, 30s by default

Please help me!

Monokaix · 2025-02-21T06:38:44Z

Hi, have you reported it to karpenter community or aws customer service?

Monokaix · 2025-02-21T08:50:02Z

Seems Karpenter has not adapted volcano scheduler yet.

Vacant2333 · 2025-02-21T08:52:27Z

@tieungao88 Karpenter provisions nodes based on the resource requests of Pods. In other words, it only adds nodes to your cluster when there are Pods pending due to insufficient resources. However, your issue is that you haven’t set a request limit, and the request values are too small. Karpenter does not scale nodes based on actual resource usage.

tieungao88 · 2025-02-22T04:18:30Z

Hi @Vacant2333 ,

I scaled the next pod when the CPU of the current node was at 100% (2vcpu).

First time:
I tried to adjust the CPU request to 500m. At this level, the system did not trigger any event for Karpenter to scale the node, but only generated this event:

I0222 04:01:41.172368       1 predicate_helper.go:81] Predicates failed: task default/cpu-hog-active-01-585b44bcb7-h677f on node ip-10-26-5-108.ap-southeast-1.compute.internal fit failed: the CPU load of the node exceeds the upper limit.

Second time:
I adjusted the CPU request to 900m. At this level, the system triggered the event Insufficient cpu => Karpenter provisioned a new node:

I0222 04:05:41.591894       1 predicate_helper.go:81] Predicates failed: task default/cpu-hog-active-01-7bf5f5c4d7-dmmm5 on node ip-10-26-18-179.ap-southeast-1.compute.internal fit failed: Insufficient cpu

I do not understand why there is this difference and why the Insufficient cpu event was not triggered in the first case?

Thanks.

Vacant2333 · 2025-02-24T02:34:07Z

@tieungao88
Karpenter does not expand capacity according to the usage rate of nodes, but according to the allocation rate. Your request and limit should be set to a reasonable value, so that Karpenter can handle new nodes.

Monokaix · 2025-02-24T03:19:41Z

You have enabled usage plugin in volcano, which will schedule pods based on actual node load, and your deloyment consumes a lot of cpu as it's a for loopn and there is no cpu limit, so it will faile to be scheduled, but Karpenter doesn't know the plugin usage in volcano, it will only scale nodes when there is no enough cpu instead of high load.

tieungao88 · 2025-02-24T04:19:50Z

Hi,
I want to set it up so that if a Node has 500m CPU left, it won't allow scheduling.

Thanks.

Monokaix · 2025-03-04T08:03:42Z

Hi, I want to set it up so that if a Node has 500m CPU left, it won't allow scheduling.

Thanks.

As said before, you can construct a case that the new pod is insufficient cpu, so that the karpenter can be aware of that can scale nodes.

Monokaix · 2025-03-05T06:37:25Z

Hi, the Karpenter community is willing to solve gang related issue and support custom scheduler, feel free to give some feedbacks to the Karpenter community to make some progress！ kubernetes-sigs/karpenter#742 (comment)

Monokaix mentioned this issue Mar 7, 2025

Custom k8s scheduler support for Karpenter e.g., Apache YuniKorn, Volcano kubernetes-sigs/karpenter#742

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter cannot provision node for application when use Volcano #4030

Karpenter cannot provision node for application when use Volcano #4030

tieungao88 commented Feb 21, 2025

Monokaix commented Feb 21, 2025

Monokaix commented Feb 21, 2025

Vacant2333 commented Feb 21, 2025

tieungao88 commented Feb 22, 2025 •

edited

Loading

Vacant2333 commented Feb 24, 2025

Monokaix commented Feb 24, 2025 •

edited

Loading

tieungao88 commented Feb 24, 2025 •

edited

Loading

Monokaix commented Mar 4, 2025

Monokaix commented Mar 5, 2025

Karpenter cannot provision node for application when use Volcano #4030

Karpenter cannot provision node for application when use Volcano #4030

Comments

tieungao88 commented Feb 21, 2025

Monokaix commented Feb 21, 2025

Monokaix commented Feb 21, 2025

Vacant2333 commented Feb 21, 2025

tieungao88 commented Feb 22, 2025 • edited Loading

Vacant2333 commented Feb 24, 2025

Monokaix commented Feb 24, 2025 • edited Loading

tieungao88 commented Feb 24, 2025 • edited Loading

Monokaix commented Mar 4, 2025

Monokaix commented Mar 5, 2025

tieungao88 commented Feb 22, 2025 •

edited

Loading

Monokaix commented Feb 24, 2025 •

edited

Loading

tieungao88 commented Feb 24, 2025 •

edited

Loading