[Q&A] Issue with Running TensorFlow Model on Multiple GPUs #2362

luiji2425 · 2024-02-13T17:02:56Z

luiji2425
Feb 13, 2024

Python version (`python3 -V`)

3.8

NVFlare version (`python3 -m pip list | grep "nvflare"`)

2.4

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, `git branch`)

2.3

Operating system

Ubuntu 20.04

Have you successfully run any of the following examples?

hello-numpy-sag with simulator
hello-pt with simulator
hello-numpy-sag with POC
hello-pt with POC

Please describe your question

Hello everyone, I'm relatively new to nvflare and have encountered a specific challenge while trying to run my TensorFlow model on more than four GPUs. Currently, I am able to simulate 4 clients with automatic allocation, assigning one client per GPU, and it works seamlessly. However, when attempting to run 5 or more clients I get the same error as here:

https://github.com/NVIDIA/NVFlare/discussions/2159

My question is regarding the implementation of the CUDA_VISIBLE_DEVICES environment variable mentioned by @yhwen. I'm unsure about where and how to define this variable. Do I need to make any changes in the source code for this to take effect?

Thank you in advance!

luiji2425 · 2024-02-15T14:54:38Z

luiji2425
Feb 15, 2024
Author

I think I will specify my problem a bit better. I am working with Tensorflow 2.13.0 and with the scatter and gather workflow. The script I am working on is very similar to the hello-tf2 example. However, I use an experimental dataset, which is split across 7 clients. I am running my script on a workstation with 4x Nvidia Tesla 32GB.

When I submit my job in poc mode on only 4 clients with nvflare poc prepare -n 4 and nvflare poc start -gpu 0 1 2 3 I have to reduce the batch_size considerably to prevent the 'CUDA out of memory' error. It is strange that the first GPU runs at full 32GB memory usage, but the other 3GPUs run at approx. 1.6 GB.

When I run nvflare poc prepare -n 5 for 5 clients, I immediately get the cuda out of memory error. And I don't know why!? It looks as if nvflare allocates the 5th client directly to GPU1, although the resource manager actually checks whether resources are free.

I have already tried the following:

set num_of_gpus=1, mem_per_gpu_in_GiB=32 in meta.json file --> yielding to (not enough sites have enough resources (ok sites 0 < min sites 5)
Decrease GPU usage through tensorflow in the trainer.py file --> Doesn't affect anything

Is there a way to assign exactly one GPU with max. GB to a client in POC mode? The clients do not have to run in parallel, can I tell the server to only train e.g. max. 4 clients in parallel and then train the remaining 3 and only then average the weights? As far as I understood CUDA_VISIBLE_DEVICES is only possible in production/secure mode? I have no idea how to implement this in POC mode..

0 replies

YuanTingHsieh · 2024-02-21T17:59:38Z

YuanTingHsieh
Feb 21, 2024
Maintainer

Hi @luiji2425

Thanks for the details.

Can you try to run in simulator first?

This command:

TF_FORCE_GPU_ALLOW_GROWTH=true nvflare simulator -w /tmp/nvflare/ -n 7 -t 4 -gpu 0,1,2,3 ./jobs/hello-tf2

Please also make sure you modify the min_clients to 7 in meta.json and app/config_fed_server

I will explain the resource management in POC/real world mode a bit later.

1 reply

YuanTingHsieh Apr 30, 2024
Maintainer

@luiji2425 did you get a change to try?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q&A] Issue with Running TensorFlow Model on Multiple GPUs #2362

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

[Q&A] Issue with Running TensorFlow Model on Multiple GPUs #2362

luiji2425 Feb 13, 2024

Python version (python3 -V)

NVFlare version (python3 -m pip list | grep "nvflare")

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, git branch)

Operating system

Have you successfully run any of the following examples?

Please describe your question

Replies: 2 comments · 1 reply

luiji2425 Feb 15, 2024 Author

YuanTingHsieh Feb 21, 2024 Maintainer

YuanTingHsieh Apr 30, 2024 Maintainer

luiji2425
Feb 13, 2024

Python version (`python3 -V`)

NVFlare version (`python3 -m pip list | grep "nvflare"`)

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, `git branch`)

Replies: 2 comments 1 reply

luiji2425
Feb 15, 2024
Author

YuanTingHsieh
Feb 21, 2024
Maintainer

YuanTingHsieh Apr 30, 2024
Maintainer