xpk Cluster Queue resource group "cpu" resource quota incorrect for a CPU-only cluster #158

bernardhan33 · 2024-07-14T22:45:31Z

We have a CPU-only cluster -- n2-standard-32-1024 that has 1024 of n2-standard-32 nodes. There, we technically should have a rough 1024 * 32 CPU resources but I'm seeing 1024 nominated quota from kubectl describe clusterqueue cluster-queue:

Name:         cluster-queue
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  kueue.x-k8s.io/v1beta1
Kind:         ClusterQueue
Metadata:
  Creation Timestamp:  2024-07-02T17:34:06Z
  Finalizers:
    kueue.x-k8s.io/resource-in-use
  Generation:        2
  Resource Version:  116469727
  UID:               2db06149-9a17-45a2-bae0-5ada11705b68
Spec:
  Flavor Fungibility:
    When Can Borrow:   Borrow
    When Can Preempt:  TryNextFlavor
  Namespace Selector:
  Preemption:
    Borrow Within Cohort:
      Policy:               Never
    Reclaim Within Cohort:  Never
    Within Cluster Queue:   LowerPriority
  Queueing Strategy:        BestEffortFIFO
  Resource Groups:
    Covered Resources:
      cpu
    Flavors:
      Name:  1xn2-standard-32-1024
      Resources:
        Name:           cpu
        Nominal Quota:  1024
  Stop Policy:          None
Status:
  Admitted Workloads:  0
  Conditions:
    Last Transition Time:  2024-07-02T17:34:06Z
    Message:               Can admit new workloads
    Reason:                Ready
    Status:                True
    Type:                  Active
  Flavors Reservation:
    Name:  1xn2-standard-32-1024
    Resources:
      Borrowed:  0
      Name:      cpu
      Total:     0
  Flavors Usage:
    Name:  1xn2-standard-32-1024
    Resources:
      Borrowed:         0
      Name:             cpu
      Total:            0
  Pending Workloads:    0
  Reserving Workloads:  0
Events:                 <none>

The text was updated successfully, but these errors were encountered:

Obliviour · 2024-07-16T17:12:41Z

@RoshaniN curious if you can confirm my understanding?

Does nominal quota here represent the number of vms/GKE nodes which in this case = 1024 or should it equal the number of CPU (chips/cores?)? From my understanding, Kueue would want the number of vms/ GKE nodes (1024) value in order to schedule workloads right? Or should Kueue be getting the 1024 *32 value?

bernardhan33 · 2024-07-16T17:18:43Z

Per https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/, I would think it's the CPU resources with that resource group so it should roughly 1024*32.

It seeks that XPK assigns it here. It is probably true when the resource type is tpu or gpu where the number of chips represent the resources but for CPU it is different?

RoshaniN · 2024-07-16T17:37:25Z

I believe this implementation for CPUs is working as intended. Based on the examples mentioned in https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#resources , for CPUs, this amounts to number of CPUs (or number of VMs).

There is no correlation between the type of the CPUs (n2-standard-32 (32 vCPUs) or e2-standard-4 (4 vCPUs)) and the nominal quota.

TPUs and GPUs have physical chips and the resources can be more granularly partitioned, if required.

Is there a problem that we are seeing or is the kueue accepting and queueing CPU requests as intended?

bernardhan33 · 2024-07-16T17:40:15Z

Shouldn't "number of CPUs" equal to the number of VMs * the CPUs of each VM?

Is there a problem that we are seeing or is the kueue accepting and queueing CPU requests as intended?
Correct, taking a n2-standard-32-1024 as an example, resource request of

resources:
  requests:
    cpu: 20000m

fails to get accepted which it should, because each pod is asking for 20 CPU resources and can fit on a single node.

bernardhan33 · 2024-07-16T17:42:33Z

Also https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu.

RoshaniN · 2024-07-16T17:51:27Z

I found this - https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu

Edit: refreshed this issue to see that Bernard also posted this link :)
Looks like in Kubernetes CPU notion, number of CPUs = number of VMs * the virtual CPUs of each CPU machine.

Requests and limits (mentioned in the workload) should then also follow this notation.

bernardhan33 · 2024-07-16T17:59:34Z

yep, thanks @RoshaniN ! So it should be equal to number of VMs * the virtual CPUs of each CPU machine in the nominalquota.

I guess I was also sure about it because submitting the job with

resources:
  requests:
    cpu: 20000m

through the queue fails but without the queue it schedules successfully ;)

RoshaniN · 2024-07-16T18:04:12Z

Thanks @bernardhan33 for checking that it fails with cpu: 20000m, could you check if this passes/fails with cpu: 20 ? Trying to understand if the notions / quantities are different.

bernardhan33 · 2024-07-16T18:06:00Z

yeah 20 fails too. 20000m == 20 so it will be evaluated to equivalent resource.

bernardhan33 · 2024-07-16T18:06:31Z

btw I'm not blocked on this -- I can kubectl edit the queue configuration. But this ticket is for future usage of the CPU-only cluster spun up by xpk ;)

RoshaniN · 2024-07-16T18:07:44Z

yeah 20 fails too. 20000m == 20 so it will be evaluated to equivalent resource.

Yes, wanted to be sure of that. Thanks @bernardhan33

Obliviour assigned RoshaniN Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xpk Cluster Queue resource group "cpu" resource quota incorrect for a CPU-only cluster #158

xpk Cluster Queue resource group "cpu" resource quota incorrect for a CPU-only cluster #158

bernardhan33 commented Jul 14, 2024

Obliviour commented Jul 16, 2024

bernardhan33 commented Jul 16, 2024

RoshaniN commented Jul 16, 2024

bernardhan33 commented Jul 16, 2024

bernardhan33 commented Jul 16, 2024

RoshaniN commented Jul 16, 2024 •

edited

Loading

bernardhan33 commented Jul 16, 2024

RoshaniN commented Jul 16, 2024

bernardhan33 commented Jul 16, 2024

bernardhan33 commented Jul 16, 2024 •

edited

Loading

RoshaniN commented Jul 16, 2024

xpk Cluster Queue resource group "cpu" resource quota incorrect for a CPU-only cluster #158

xpk Cluster Queue resource group "cpu" resource quota incorrect for a CPU-only cluster #158

Comments

bernardhan33 commented Jul 14, 2024

Obliviour commented Jul 16, 2024

bernardhan33 commented Jul 16, 2024

RoshaniN commented Jul 16, 2024

bernardhan33 commented Jul 16, 2024

bernardhan33 commented Jul 16, 2024

RoshaniN commented Jul 16, 2024 • edited Loading

bernardhan33 commented Jul 16, 2024

RoshaniN commented Jul 16, 2024

bernardhan33 commented Jul 16, 2024

bernardhan33 commented Jul 16, 2024 • edited Loading

RoshaniN commented Jul 16, 2024

RoshaniN commented Jul 16, 2024 •

edited

Loading

bernardhan33 commented Jul 16, 2024 •

edited

Loading