Replies: 2 comments 1 reply
-
I think I will specify my problem a bit better. I am working with Tensorflow 2.13.0 and with the scatter and gather workflow. The script I am working on is very similar to the hello-tf2 example. However, I use an experimental dataset, which is split across 7 clients. I am running my script on a workstation with 4x Nvidia Tesla 32GB. When I submit my job in poc mode on only 4 clients with When I run I have already tried the following:
Is there a way to assign exactly one GPU with max. GB to a client in POC mode? The clients do not have to run in parallel, can I tell the server to only train e.g. max. 4 clients in parallel and then train the remaining 3 and only then average the weights? As far as I understood CUDA_VISIBLE_DEVICES is only possible in production/secure mode? I have no idea how to implement this in POC mode.. |
Beta Was this translation helpful? Give feedback.
-
Hi @luiji2425 Thanks for the details. Can you try to run in simulator first? This command: TF_FORCE_GPU_ALLOW_GROWTH=true nvflare simulator -w /tmp/nvflare/ -n 7 -t 4 -gpu 0,1,2,3 ./jobs/hello-tf2 Please also make sure you modify the min_clients to 7 in meta.json and app/config_fed_server I will explain the resource management in POC/real world mode a bit later. |
Beta Was this translation helpful? Give feedback.
-
Python version (
python3 -V
)3.8
NVFlare version (
python3 -m pip list | grep "nvflare"
)2.4
NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version,
git branch
)2.3
Operating system
Ubuntu 20.04
Have you successfully run any of the following examples?
Please describe your question
Hello everyone, I'm relatively new to nvflare and have encountered a specific challenge while trying to run my TensorFlow model on more than four GPUs. Currently, I am able to simulate 4 clients with automatic allocation, assigning one client per GPU, and it works seamlessly. However, when attempting to run 5 or more clients I get the same error as here:
https://github.com/NVIDIA/NVFlare/discussions/2159
My question is regarding the implementation of the CUDA_VISIBLE_DEVICES environment variable mentioned by @yhwen. I'm unsure about where and how to define this variable. Do I need to make any changes in the source code for this to take effect?
Thank you in advance!
Beta Was this translation helpful? Give feedback.
All reactions