NFD master logging gets spammed "failed to update node" #573

bobbeeke · 2023-09-01T14:50:57Z

1. Quick Debug Information

OS/Version: Ubuntu22.04 LTS
Kernel Version: 5.19.0-1025-aws
Container Runtime: Containerd v1.7.2
K8s Flavor/Version: K8s v1.27.5
GPU Operator Version: 23.6.1
GPU Operator NFD Version: v0.13.1
Helm deploy

2. Issue or feature description

Issue:

(Not sure if this issue should be posted here or in 'https://github.com/kubernetes-sigs/node-feature-discovery', please let me know if so.)

Deployed gpu-operator with helm on a k8s cluster (currently) without GPU nodes.
Supplied no special helm values.

gpu-operator deployment deploys
gpu-operator-node-feature-discovery-master deployment deploys
gpu-operator-node-feature-discovery-worker daemonset deploys

However, the gpu-operator-node-feature-discovery-master pod logging keeps spamming log lines with what seems like old nodes which don't exist (anymore) in my cluster:

[...]
E0901 14:06:18.039362 1 nfd-master.go:782] failed to update node "i-XXXXXXXXXXXXXXXXX": nodes "i-XXXXXXXXXXXXXXXXX" not found
E0901 14:06:18.039381 1 nfd-master.go:344] nodes "i-XXXXXXXXXXXXXXXXX" not found
E0901 14:06:18.044052 1 nfd-master.go:782] failed to update node "i-YYYYYYYYYYYYYYYYY": nodes "i-YYYYYYYYYYYYYYYYY" not found
E0901 14:06:18.044073 1 nfd-master.go:344] nodes "i-YYYYYYYYYYYYYYYYY" not found
E0901 14:06:18.047933 1 nfd-master.go:782] failed to update node "i-ZZZZZZZZZZZZZZZZZ": nodes "i-ZZZZZZZZZZZZZZZZZ" not found
E0901 14:06:18.047963 1 nfd-master.go:344] nodes "i-ZZZZZZZZZZZZZZZZZ" not found
[...]

Till version v23.3.2 (NFD Version: v0.12.1) I got this list of old nodes just once after startup.
Starting with v23.6.0 (NFD Version: v0.13.1) it keep spamming this complete list of nodes almost every second which floods my logging system.

3. Steps to reproduce the issue

Described above

4. Information to [attach]

shivamerla · 2023-09-01T15:55:37Z

@bobbeeke you need to run kubectl get nodefeatures -n gpu-operator and cleanup those objects for stale nodes.

shivamerla · 2023-09-01T15:56:36Z

@ArangoGutierrez looks like an improvement to NFD to ignore those for stale nodes.

chiragjn · 2023-09-13T17:10:32Z

I am facing similar issues, NFD master memory usage keeps increasing and it gets very unstable sometimes making existing GPU nodes unusable if it dies.

My current hypothesis is that it enters a relabelling iteration on a node, removes older labels, and marks gpu.present=false, dies before it gets to label the node correctly, which ultimately kills the device plugin because it has node selector with gpu.deploy.device-plugin label which is now removed, making any nvidia.com/gpu device unhealthy.

Perhaps upgrading to NFD v0.14.0 might help? It has implemented garbage collection for NodeFeature objects
https://github.com/kubernetes-sigs/node-feature-discovery/releases/tag/v0.14.0

~~I have upgraded mine, will report back in a few days 🤞~~

Ah, I just realised, gpu-operator vendors its own version of the feature discovery chart 🙃 , opening a new issue then.

For now, I'll try deleting the nodefeatures objects manually and check if memory usage decreases

chiragjn mentioned this issue Sep 13, 2023

Update Node Feature Discovery to 0.14.x to enable NodeFeature GC #580

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFD master logging gets spammed "failed to update node" #573

NFD master logging gets spammed "failed to update node" #573

bobbeeke commented Sep 1, 2023

shivamerla commented Sep 1, 2023

shivamerla commented Sep 1, 2023

chiragjn commented Sep 13, 2023 •

edited

Loading

NFD master logging gets spammed "failed to update node" #573

NFD master logging gets spammed "failed to update node" #573

Comments

bobbeeke commented Sep 1, 2023

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to [attach]

shivamerla commented Sep 1, 2023

shivamerla commented Sep 1, 2023

chiragjn commented Sep 13, 2023 • edited Loading

chiragjn commented Sep 13, 2023 •

edited

Loading