Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFD master logging gets spammed "failed to update node" #573

Open
bobbeeke opened this issue Sep 1, 2023 · 3 comments
Open

NFD master logging gets spammed "failed to update node" #573

bobbeeke opened this issue Sep 1, 2023 · 3 comments

Comments

@bobbeeke
Copy link

bobbeeke commented Sep 1, 2023

1. Quick Debug Information

  • OS/Version: Ubuntu22.04 LTS
  • Kernel Version: 5.19.0-1025-aws
  • Container Runtime: Containerd v1.7.2
  • K8s Flavor/Version: K8s v1.27.5
  • GPU Operator Version: 23.6.1
  • GPU Operator NFD Version: v0.13.1
  • Helm deploy

2. Issue or feature description

Issue:

(Not sure if this issue should be posted here or in 'https://github.com/kubernetes-sigs/node-feature-discovery', please let me know if so.)

Deployed gpu-operator with helm on a k8s cluster (currently) without GPU nodes.
Supplied no special helm values.

gpu-operator deployment deploys
gpu-operator-node-feature-discovery-master deployment deploys
gpu-operator-node-feature-discovery-worker daemonset deploys

However, the gpu-operator-node-feature-discovery-master pod logging keeps spamming log lines with what seems like old nodes which don't exist (anymore) in my cluster:

[...]
E0901 14:06:18.039362 1 nfd-master.go:782] failed to update node "i-XXXXXXXXXXXXXXXXX": nodes "i-XXXXXXXXXXXXXXXXX" not found
E0901 14:06:18.039381 1 nfd-master.go:344] nodes "i-XXXXXXXXXXXXXXXXX" not found
E0901 14:06:18.044052 1 nfd-master.go:782] failed to update node "i-YYYYYYYYYYYYYYYYY": nodes "i-YYYYYYYYYYYYYYYYY" not found
E0901 14:06:18.044073 1 nfd-master.go:344] nodes "i-YYYYYYYYYYYYYYYYY" not found
E0901 14:06:18.047933 1 nfd-master.go:782] failed to update node "i-ZZZZZZZZZZZZZZZZZ": nodes "i-ZZZZZZZZZZZZZZZZZ" not found
E0901 14:06:18.047963 1 nfd-master.go:344] nodes "i-ZZZZZZZZZZZZZZZZZ" not found
[...]

Till version v23.3.2 (NFD Version: v0.12.1) I got this list of old nodes just once after startup.
Starting with v23.6.0 (NFD Version: v0.13.1) it keep spamming this complete list of nodes almost every second which floods my logging system.

3. Steps to reproduce the issue

Described above

4. Information to [attach]

@shivamerla
Copy link
Contributor

@bobbeeke you need to run kubectl get nodefeatures -n gpu-operator and cleanup those objects for stale nodes.

@shivamerla
Copy link
Contributor

@ArangoGutierrez looks like an improvement to NFD to ignore those for stale nodes.

@chiragjn
Copy link

chiragjn commented Sep 13, 2023

I am facing similar issues, NFD master memory usage keeps increasing and it gets very unstable sometimes making existing GPU nodes unusable if it dies.

My current hypothesis is that it enters a relabelling iteration on a node, removes older labels, and marks gpu.present=false, dies before it gets to label the node correctly, which ultimately kills the device plugin because it has node selector with gpu.deploy.device-plugin label which is now removed, making any nvidia.com/gpu device unhealthy.

Perhaps upgrading to NFD v0.14.0 might help? It has implemented garbage collection for NodeFeature objects
https://github.com/kubernetes-sigs/node-feature-discovery/releases/tag/v0.14.0

I have upgraded mine, will report back in a few days 🤞

Ah, I just realised, gpu-operator vendors its own version of the feature discovery chart 🙃 , opening a new issue then.

For now, I'll try deleting the nodefeatures objects manually and check if memory usage decreases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants