Capacity plugin does not reclaim #3990

qGentry · 2025-01-30T13:14:04Z

Description

Hi, I'm digging into new capacity plugin and following this tutorial capacity plugin user guide.
I'm using latest volcano release (1.11.0) but for some reason my setup does no reclaim from overcommited queue.

Steps to reproduce the issue

I've created following queues:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: queue1
spec:
  reclaimable: true
  deserved: # set the deserved field.
    nvidia.com/gpu: 16
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: queue2
spec:
  reclaimable: true
  deserved: # set the deserved field.
    nvidia.com/gpu: 24

This is volcano-scheduler-configmap (after editing it, I've restarted all the volcano system pods - scheduler, admission, controllers and have seen that capacity plugin is enabled in scheduler logs.

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, reclaim"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: capacity
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: volcano
    meta.helm.sh/release-namespace: volcano-system
  creationTimestamp: "2025-01-30T10:41:15Z"
  labels:
    app.kubernetes.io/managed-by: Helm
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "61231342"
  uid: 63e4af02-7a50-4425-ae2c-e5ffc8df9d4f

Then, I'm creating deployment1 that would create 3 pods each requiring 8GPUs.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo-1
spec:
  selector:
    matchLabels:
      app: demo-1
  replicas: 3
  template:
    metadata:
      labels:
        app: demo-1
      annotations:
        scheduling.volcano.sh/queue-name: "queue1" # set the queue
    spec:
      schedulerName: volcano
      containers:
      - name: nginx
        image: nginx:1.14.2
        resources:
          limits:
            nvidia.com/gpu: 8
        ports:
        - containerPort: 80
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

My cluster has 4 free 8GPU nodes (and also 4 busy ones, scheduled without volcano), so all of those has been scheduled

kubectl get pods
NAME                                                      READY   STATUS    RESTARTS   AGE
demo-1-69b88856fd-5889g                                   1/1     Running   0          5m26s
demo-1-69b88856fd-588s2                                   1/1     Running   0          5m26s
demo-1-69b88856fd-gx6tr                                   1/1     Running   0          5m26s

After that, I've created another deployment, also with 3 pods 8GPUs each:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo-2
spec:
  selector:
    matchLabels:
      app: demo-2
  replicas: 3
  template:
    metadata:
      labels:
        app: demo-2
      annotations:
        scheduling.volcano.sh/queue-name: "queue2" # set the queue
    spec:
      schedulerName: volcano
      containers:
      - name: nginx
        image: nginx:1.14.2
        resources:
          limits:
            nvidia.com/gpu: 8
        ports:
        - containerPort: 80
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

But for some reason it does not reclaim its resources from queue1

kubectl get pods
demo-1-69b88856fd-5889g                                   1/1     Running   0          7m
demo-1-69b88856fd-588s2                                   1/1     Running   0          7m
demo-1-69b88856fd-gx6tr                                   1/1     Running   0          7m
demo-2-c98bdd458-5szhc                                    0/1     Pending   0          6m56s
demo-2-c98bdd458-jv7rz                                    1/1     Running   0          6m56s
demo-2-c98bdd458-kzxv4                                    0/1     Pending   0          6m56s

Describe the results you received and expected

At least 2 pods from queue2 must be scheduled - 1 pod on free 8GPU node and 1 must reclaim resource from overcommited queue1 (as queue1 has only 16 deserved GPUs)

What version of Volcano are you using?

1.11.0

Any other relevant information

No response

Monokaix · 2025-02-05T02:44:17Z

Seems that in v1.11，reclaim only happen when job is starving, a job is starving when non-pending pods nums < minavailiable, because the deployment's default minavailiable=1, and there is already a running pod in demo-2, so demo-2 is not starving, hence reclaim won't happen.

Monokaix · 2025-02-05T03:14:11Z

ref: #3951

JesseStutler · 2025-02-05T06:47:20Z

Correct, this is because vc-controller creates a default MinAvailable=1 podgroup for deployment, currently we don't have API to specify this MinAvailable for non vc-job workloads, we already have a feature issue to track it: #3970. If you really need this feature urgently (based on v1.11), we can release it as a patch later.

qGentry · 2025-02-05T12:00:20Z

Oh, I see, this makes sense then. Looking forward for the patch.

qGentry added the kind/bug Categorizes issue or PR as related to a bug. label Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capacity plugin does not reclaim #3990

Capacity plugin does not reclaim #3990

qGentry commented Jan 30, 2025 •

edited

Loading

Monokaix commented Feb 5, 2025

Monokaix commented Feb 5, 2025

JesseStutler commented Feb 5, 2025

qGentry commented Feb 5, 2025

Capacity plugin does not reclaim #3990

Capacity plugin does not reclaim #3990

Comments

qGentry commented Jan 30, 2025 • edited Loading

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of Volcano are you using?

Any other relevant information

Monokaix commented Feb 5, 2025

Monokaix commented Feb 5, 2025

JesseStutler commented Feb 5, 2025

qGentry commented Feb 5, 2025

qGentry commented Jan 30, 2025 •

edited

Loading