Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device Resources Not Reflected in Allocatable Count After Pod Creation #11377

Closed
RonaldFletcher opened this issue Nov 27, 2024 · 2 comments
Closed

Comments

@RonaldFletcher
Copy link

RonaldFletcher commented Nov 27, 2024

Environmental Info:

K3s Version:

k3s version v1.28.14+k3s1 (3ef2bdb)

Environment="K3S_ARGS= --cluster-cidr=10.44.0.0/16 --disable-network-policy --disable-cloud-controller --disable=local-storage,servicelb,traefik --kubelet-arg=eviction-hard=memory.available<5%,nodefs.available<10% --kubelet-arg=kube-reserved=cpu=200m,memory=250Mi,ephemeral-storage=1Gi --kubelet-arg=system-reserved=cpu=200m,memory=250Mi,ephemeral-storage=1Gi --kube-proxy-arg=proxy-mode=ipvs

Describe the bug:

After creating and running a Pod with resource requests, the allocatable device count on the node does not change as expected, even though the Pod is running successfully. Additionally, when attempting to create another Pod requesting the same resources, the new Pod remains in a Pending state.

pod resources detals :

    imagePullPolicy: IfNotPresent
    name: compute
    resources:
      limits:
        devices.kubevirt.io/kvm: "1"
        devices.kubevirt.io/tun: "1"
        devices.kubevirt.io/vhost-net: "1"
        nvidia.com/TU104GL_TESLA_T4: "1"
      requests:
        cpu: 200m
        devices.kubevirt.io/kvm: "1"
        devices.kubevirt.io/tun: "1"
        devices.kubevirt.io/vhost-net: "1"
        ephemeral-storage: 50M
        memory: "5439885824"
        nvidia.com/TU104GL_TESLA_T4: "1"

node details:

    allocatable:
      cpu: 47600m
      memory: "127036041936"
      nvidia.com/TU104GL_TESLA_T4: "1"

    capacity:
      cpu: "48"
      memory: 131126984Ki
      nvidia.com/TU104GL_TESLA_T4: "1"

Steps To Reproduce:

  • Installed K3s:

Expected behavior:

    allocatable:
      cpu: 47600m
      memory: "127036041936"
      nvidia.com/TU104GL_TESLA_T4: "0 "

    capacity:
      cpu: "48"
      memory: 131126984Ki
      nvidia.com/TU104GL_TESLA_T4: "1"

Actual behavior:
The allocatable resources on the node did not change after the first Pod was running successfully. The second Pod, which requested the same resources, remained in a Pending state, suggesting that the resources were not available, but the allocatable count did not reflect this.

image
@brandond
Copy link
Member

brandond commented Nov 27, 2024

I don't see any evidence that this is a k3s issue. None of the affected code lives in this project. Have you verified that this only affects k3s?

Most likely you should report this to the upstream Kubernetes project, or Nvidia operator. Probably the latter since that is what is responsible for tracking the GPU resource.

@brandond
Copy link
Member

brandond commented Nov 27, 2024

That's said, I don't think that is how the allocatable resources work. If I remember correctly, it is not a measure of what remains after resources currently in use by pods are subtracted. It the total usable by the kubelet, after the kubelet and system reserved resources are subtracted.

You should read the Kubernetes docs for more information on what these fields indicate.

@brandond brandond closed this as not planned Won't fix, can't repro, duplicate, stale Nov 27, 2024
@github-project-automation github-project-automation bot moved this from New to Done Issue in K3s Development Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done Issue
Development

No branches or pull requests

2 participants