-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NFD master logging gets spammed "failed to update node" #573
Comments
@bobbeeke you need to run |
@ArangoGutierrez looks like an improvement to NFD to ignore those for stale nodes. |
I am facing similar issues, NFD master memory usage keeps increasing and it gets very unstable sometimes making existing GPU nodes unusable if it dies. My current hypothesis is that it enters a relabelling iteration on a node, removes older labels, and marks Perhaps upgrading to NFD v0.14.0 might help? It has implemented garbage collection for NodeFeature objects Ah, I just realised, gpu-operator vendors its own version of the feature discovery chart 🙃 , opening a new issue then. For now, I'll try deleting the nodefeatures objects manually and check if memory usage decreases |
1. Quick Debug Information
2. Issue or feature description
Issue:
(Not sure if this issue should be posted here or in 'https://github.com/kubernetes-sigs/node-feature-discovery', please let me know if so.)
Deployed gpu-operator with helm on a k8s cluster (currently) without GPU nodes.
Supplied no special helm values.
gpu-operator deployment deploys
gpu-operator-node-feature-discovery-master deployment deploys
gpu-operator-node-feature-discovery-worker daemonset deploys
However, the gpu-operator-node-feature-discovery-master pod logging keeps spamming log lines with what seems like old nodes which don't exist (anymore) in my cluster:
[...]
E0901 14:06:18.039362 1 nfd-master.go:782] failed to update node "i-XXXXXXXXXXXXXXXXX": nodes "i-XXXXXXXXXXXXXXXXX" not found
E0901 14:06:18.039381 1 nfd-master.go:344] nodes "i-XXXXXXXXXXXXXXXXX" not found
E0901 14:06:18.044052 1 nfd-master.go:782] failed to update node "i-YYYYYYYYYYYYYYYYY": nodes "i-YYYYYYYYYYYYYYYYY" not found
E0901 14:06:18.044073 1 nfd-master.go:344] nodes "i-YYYYYYYYYYYYYYYYY" not found
E0901 14:06:18.047933 1 nfd-master.go:782] failed to update node "i-ZZZZZZZZZZZZZZZZZ": nodes "i-ZZZZZZZZZZZZZZZZZ" not found
E0901 14:06:18.047963 1 nfd-master.go:344] nodes "i-ZZZZZZZZZZZZZZZZZ" not found
[...]
Till version v23.3.2 (NFD Version: v0.12.1) I got this list of old nodes just once after startup.
Starting with v23.6.0 (NFD Version: v0.13.1) it keep spamming this complete list of nodes almost every second which floods my logging system.
3. Steps to reproduce the issue
Described above
4. Information to [attach]
The text was updated successfully, but these errors were encountered: