Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod keep restarting because of there isn't a configuration file in federated-learning-surface-defect example with image v0.5.1 #388

Open
xinzongyan opened this issue Jan 11, 2023 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@xinzongyan
Copy link

xinzongyan commented Jan 11, 2023

What happened:
federated-learning-surface-defect-detection-train worker need a configuration file, but there is no configuration file, and the doc didn't mention it.

the logs of aggregation worker

[root@board1 ~]# kubectl logs -f federated-learning-surface-defect-detection-aggregation-c67cq
2023-01-11 03:17:01.881770: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2023-01-11 03:17:01.881802: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-01-11 03:17:03.286409: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-01-11 03:17:03.286435: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2023-01-11 03:17:03.286457: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (federated-learning-surface-defect-detection-aggregation-c67cq): /proc/driver/nvidia/version does not exist
2023-01-11 03:17:03.286692: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-11 03:17:03.292278: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3312000000 Hz
2023-01-11 03:17:03.292504: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x224f970 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-01-11 03:17:03.292518: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Traceback (most recent call last):
File "aggregate.py", line 35, in
run_server()
File "aggregate.py", line 29, in run_server
chooser=simple_chooser)
File "/home/lib/sedna/service/server/aggregation.py", line 280, in init
server = Config().server._asdict()
File "/home/plato/plato/config.py", line 136, in new
raise ValueError("A configuration file must be supplied.")
ValueError: A configuration file must be supplied.

the logs of train worker

[root@board2 ~]# docker logs -f k8s_train-worker_federated-learning-surface-defect-detection-train-94npn_default_9b39d8bf-c0bc-48b7-b929-d6a646e8b60d_2
2023-01-11 03:17:46.909824: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2023-01-11 03:17:46.909851: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-01-11 03:17:49.472157: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-01-11 03:17:49.472186: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2023-01-11 03:17:49.472209: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (board2): /proc/driver/nvidia/version does not exist
2023-01-11 03:17:49.472517: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-11 03:17:49.477353: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3312000000 Hz
2023-01-11 03:17:49.477533: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x267cac0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-01-11 03:17:49.477543: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2023-01-11 03:17:49.478930: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 102629376 exceeds 10% of free system memory.
2023-01-11 03:17:49.537921: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 102629376 exceeds 10% of free system memory.
Traceback (most recent call last):
File "train.py", line 60, in
main()
File "train.py", line 55, in main
transmitter=s3_transmitter)
File "/home/lib/sedna/core/federated_learning/federated_learning.py", line 196, in init
server = Config().server._asdict()
File "/home/plato/plato/config.py", line 136, in new
raise ValueError("A configuration file must be supplied.")
ValueError: A configuration file must be supplied.

What you expected to happen:
The container operates normally, and train with the dataset.

How to reproduce it (as minimally and precisely as possible):
I used the image of kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.5.1
please uses this image to reproduce it.

Anything else we need to know?:

Environment:

Sedna Version
$ kubectl get -n sedna deploy gm -o jsonpath='{.spec.template.spec.containers[0].image}'
kubeedge/sedna-gm:v0.5.1

$ kubectl get -n sedna ds lc -o jsonpath='{.spec.template.spec.containers[0].image}'
kubeedge/sedna-lc:v0.5.1
Kubernets Version
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T17:57:25Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
KubeEdge Version
$ cloudcore --version
Version: v1.12.1

$ edgecore --version
Version: v1.12.1
@xinzongyan xinzongyan added the kind/bug Categorizes issue or PR as related to a bug. label Jan 11, 2023
@ylfbx329
Copy link

have you solved it? how to do? @xinzongyan

@xinzongyan
Copy link
Author

no ,I change image to v0.4.0 ,this version can run complated.

@ylfbx329
Copy link

ylfbx329 commented Jan 11, 2023

change image to v0.4.0 in build_image.sh and kubectl create -f xxx.yaml? @xinzongyan

@xinzongyan
Copy link
Author

use this image kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0 @ylfbx329

@JoeyHwong-gk
Copy link
Contributor

/cc @XinYao1994 Would you mind taking a look at this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants