Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insufficient memory for daemonset - no new node #7406

Open
JulesClaussen opened this issue Nov 19, 2024 · 6 comments
Open

Insufficient memory for daemonset - no new node #7406

JulesClaussen opened this issue Nov 19, 2024 · 6 comments
Labels
bug Something isn't working needs-triage Issues that need to be triaged

Comments

@JulesClaussen
Copy link

Description

Observed Behavior:

Sometimes, (I can't understand when) we have some daemonsets pods that cannot be scheduled, due to "Insufficient Memory".
Karpenter doesn't start a new (bigger) node that would allow the daemon set to be scheduled.
This is not the same issue as kubernetes-sigs/karpenter#731. This occurs for older daemonsets as well, and for nodes created recently

Expected Behavior:
Karpenter should start a new node that would allow all daemonsets and workload pods to be scheduled.

Reproduction Steps (Please include YAML):
I don't know how to reproduce. This happens often, but I can't pinpoint why it does, and why it doesn't.
A node in question contains some kube-system pods (aws-node, kube-proxy, secrets-store) but also some workload pods (that have PDB)

Versions:

  • Chart Version: 1.0.8
  • Kubernetes Version (kubectl version): 1.31.0
@JulesClaussen JulesClaussen added bug Something isn't working needs-triage Issues that need to be triaged labels Nov 19, 2024
@mariuskimmina
Copy link

Just a daemonset? Karpenter will not create a new node for just a daemonset, that would be the same as an empty node. Also, I believe daemonsets should be first on the node before any workload, so having insufficient memory for that sounds very odd - would be great if you could capture some logs and events the next time this happens.

@JulesClaussen
Copy link
Author

JulesClaussen commented Nov 19, 2024

I agree that a daemonset on it's own is pointless. But the daemonset here is used to collect some instance network metrics (it's ethtool-exporter). So, when not running on the node, we miss those metrics there. I believe Karpenter should calculate the requests required for all pods to be scheduled (daemonsets and others) in order to size the instance it will select.

Not sure what kinda logs you would like to see. Those are my failed pods events:

  Warning  FailedScheduling  57m (x161 over 85m)    default-scheduler  0/23 nodes are available: 1 Insufficient memory. preemption: 0/23 nodes are available: 1 No preemption victims found for incoming pod, 22 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  2m54s (x315 over 86m)  default-scheduler  0/22 nodes are available: 1 Insufficient memory. preemption: 0/22 nodes are available: 1 No preemption victims found for incoming pod, 21 Preemption is not helpful for scheduling.

Karpenter does not show anything. Those events might not be picked up by Karpenter, so I don't have much logs from karpenter side unfortunately.
I have looked for Karpenter logs over the last days, and I seem to have logs only about registered and initialized.
Should I look at something more specific @mariuskimmina ?

@mariuskimmina
Copy link

mariuskimmina commented Nov 19, 2024

You could try to add priorityClassName: system-node-critical to your daemonset, this way it should evict other workloads and make room for the daemonset. The workload being unschedulable would then trigger Karpenter to create new nodes (where as a dangling daemonset does not).

@JulesClaussen
Copy link
Author

Yes, this is indeed a good workaround that we implemented for other daemonsets as they're more critical. But shouldn't Karpenter be able to handle that? I'm surprised there are no logs in karpenter, so it looks like the event is not processed (or it is but nor verbose)

@gnuletik
Copy link

gnuletik commented Nov 21, 2024

@mariuskimmina Thanks for the fast feedback!

I work with @JulesClaussen and we are unable to find a good solution for this issue that happens quite often recently.

Regarding the solution you mentioned (set a priorityClassName), I checked the issue you linked before the edit and multiple people agree that this is not an accurate solution for this.

We should NOT have to set priority classes on every single resource in the cluster.

kubernetes-sigs/karpenter#731 (comment)

I think that this comment also applies here: forcing the scheduler to reschedule our pods with a priorityClassName is not a reliable solution.

Is it possible that Karpenter incorrectly compute the daemonset resources? Is there a way to check that?
All of resources requests are properly set on our daemonsets.

But we have multiple tolerations on the daemonset that triggers the issue:

  tolerations:
    - key: some-custom-key
      operator: Equal
      value: some-value
      effect: NoSchedule
    - key: node.kubernetes.io/not-ready
      operator: Exists
      effect: NoExecute
    - key: node.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
    - key: node.kubernetes.io/disk-pressure
      operator: Exists
      effect: NoSchedule
    - key: node.kubernetes.io/memory-pressure
      operator: Exists
      effect: NoSchedule
    - key: node.kubernetes.io/pid-pressure
      operator: Exists
      effect: NoSchedule
    - key: node.kubernetes.io/unschedulable
      operator: Exists
      effect: NoSchedule
    - key: node.kubernetes.io/network-unavailable
      operator: Exists
      effect: NoSchedule

Is that something taken into account by Karpenter (when it compute the resources required for an upcoming node) ?
Some nodes are created by a NodePool with a taint some-custom-key:some-value:NoSchedule.

Another thing that could trigger the issue: We are setting a startupTaint when the node is created.
This startupTaint is removed by a controller once our monitoring daemonset's pods are ready.
I guess that could mess up with the way Karpenter compute the daemonset resources.

Here's a sample of our NodePool spec:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: custom-pool
spec:
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: tasks

      expireAfter: Never

      taints:
        - key: some-custom-key
          value: some-value
          effect: NoSchedule

      startupTaints:
        # this taint is removed by nodetaint when critical pods are ready (agent that report logs)
        - key: node-starting
          value: "true"
          effect: NoSchedule

WDYT ?

@ebizboy
Copy link

ebizboy commented Nov 26, 2024

I also encountered an issue where pods couldn't be scheduled on nodes after assigning resources to daemonsets. To avoid this problem, I didn't assign resources to the daemonsets and instead allocated slightly more generous resources to deployments and statefulsets. It would be great if Karpenter could handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage Issues that need to be triaged
Projects
None yet
Development

No branches or pull requests

4 participants