[Q&A] job is not execute after submit the job #3065

Gozilasim · 2024-11-19T03:51:57Z

Gozilasim
Nov 19, 2024

Python version (`python3 -V`)

3.8.10

NVFlare version (`python3 -m pip list | grep "nvflare"`)

2.5.0

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, `git branch`)

2.5

Operating system

Ubuntu 20.04.6LTS

Have you successfully run any of the following examples?

hello-numpy-sag with simulator
hello-pt with simulator
hello-numpy-sag with POC
hello-pt with POC

Please describe your question

Hello guys, i try to run the cifar10-real-world example, i had successfully connect server with client , when i try to submit the job via admin or the .sh file, the server prompt out this message

DefaultJobScheduler - INFO - [identity=secure_project, run=?]: Try to schedule job c8c22eab-9c9c-404a-9716-ae8389948583, get result: (not enough sites have enough resources (ok sites 0 < min sites 1)).

my device is jetson agx orin, when i set the resource,json for each site with the num_of_gpus = 1, it also error and came with this error

ConfigError: ConfigError: Error processing '/mnt/5958b632-9b58-4b1e-9b06-aaa3d99c9dcd/cifar10/cifar10-real-world/workspaces/secure_workspace/site-1/startup/../startup/fed_client.json' in element '{"id": "resource_manager", "path": "nvflare.app_common.resource_managers.gpu_resource_manager.GPUResourceManager", "args": {"num_of_gpus": 1, "mem_per_gpu_in_GiB": 1}}': path: 'components.#1', exception: 'ValueError: num_of_gpus specified (1) exceeds available GPUs: 0.'

the num_of_gpus should be the gpu id or the number of gpu available? also, i try to seek ChatGPT for help, i get the answer is that this GPUResourceManager its GPU detection method is not suitable for Jetson, and ChatGPT help me to solve it but other method, then i can set num_of_gpus = 1 in my rescoure,json in each site, after that i attempt to submit the job again, i still get same result , no site have enough resource , this is because of why ?

these is meta.json
{
"name": "cifar10_fedavg_stream_tb_alpha1.0",
"resource_spec": {
"site-1": {
"num_of_gpus": 1,
"mem_per_gpu_in_GiB": 1
},
"site-2": {
"num_of_gpus": 0,
"mem_per_gpu_in_GiB": 1
},
"site-3": {
"num_of_gpus": 0,
"mem_per_gpu_in_GiB": 1
},
"site-4": {
"num_of_gpus": 0,
"mem_per_gpu_in_GiB": 1
},
"site-5": {
"num_of_gpus": 0,
"mem_per_gpu_in_GiB": 1
},
"site-6": {
"num_of_gpus": 0,
"mem_per_gpu_in_GiB": 1
},
"site-7": {
"num_of_gpus": 0,
"mem_per_gpu_in_GiB": 1
},
"site-8": {
"num_of_gpus": 0,
"mem_per_gpu_in_GiB": 1
}
},
"deploy_map": {
"cifar10_fedavg_stream_tb": [
"@ALL"
]
},
"min_clients": 1
}

and these is resource.json

{
"format_version": 2,
"client": {
"retry_timeout": 30,
"compression": "Gzip"
},
"components": [
{
"id": "resource_manager",
"path": "nvflare.app_common.resource_managers.gpu_resource_manager.GPUResourceManager",
"args": {
"num_of_gpus": 1,
"mem_per_gpu_in_GiB": 8
}
},
{
"id": "resource_consumer",
"path": "nvflare.app_common.resource_consumers.gpu_resource_consumer.GPUResourceConsumer",
"args": {}
}

]
}

please help , tqvm

YuanTingHsieh · 2024-11-27T01:58:00Z

YuanTingHsieh
Nov 27, 2024
Maintainer

@Gozilasim thanks for the interest and the question.

So our GPUResourceManager implementation is using this method to determine number of GPU: https://github.com/NVIDIA/NVFlare/blob/main/nvflare/fuel/utils/gpu_utils.py#L60
which in the back is using "nvidia-smi" to find it.
When you type nvidia-smi in your Jetson environment what happened?
Did you install CUDA there?
Another thing you can try is just don't put all these components and leave "resource_spec" blank and then your job should run.

1 reply

Gozilasim Nov 27, 2024
Author

@Gozilasim thanks for the interest and the question.

So our GPUResourceManager implementation is using this method to determine number of GPU: https://github.com/NVIDIA/NVFlare/blob/main/nvflare/fuel/utils/gpu_utils.py#L60 which in the back is using "nvidia-smi" to find it. When you type nvidia-smi in your Jetson environment what happened? Did you install CUDA there? Another thing you can try is just don't put all these components and leave "resource_spec" blank and then your job should run.

thanks for the reply @YuanTingHsieh, but i had resolved this problem last week, since jetson orin was designed using IGPU, therefore it doest not have pci bus, nvidia-smi is not work for it, i resolved this problem by overwrite my GPU detection code inside that file, for those who also faced same problem with nvidia-smi , i suggest to use pycuda.driver, but before this, please make sure cuda was installed in the device

for ur suggestion , i dont think it works, becuz my first idea is to skip the GPU detection and its memory by configure the code inside GPUResourceManager, then make the resource_spec all component = 0, then i able to connect between the server and client, but when i submit the job, it will become these
DefaultJobScheduler - INFO - [identity=secure_project, run=?]: Try to schedule job c8c22eab-9c9c-404a-9716-ae8389948583, get result: (not enough sites have enough resources (ok sites 0 < min sites 1))

So for my suggestion, modify this file is better
https://github.com/NVIDIA/NVFlare/blob/main/nvflare/fuel/utils/gpu_utils.py#L60

anyone need the code can leave the comment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q&A] job is not execute after submit the job #3065

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

[Q&A] job is not execute after submit the job #3065

Gozilasim Nov 19, 2024

Python version (python3 -V)

NVFlare version (python3 -m pip list | grep "nvflare")

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, git branch)

Operating system

Have you successfully run any of the following examples?

Please describe your question

Replies: 1 comment · 1 reply

YuanTingHsieh Nov 27, 2024 Maintainer

Gozilasim Nov 27, 2024 Author

Gozilasim
Nov 19, 2024

Python version (`python3 -V`)

NVFlare version (`python3 -m pip list | grep "nvflare"`)

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, `git branch`)

Replies: 1 comment 1 reply

YuanTingHsieh
Nov 27, 2024
Maintainer

Gozilasim Nov 27, 2024
Author