Replies: 1 comment 1 reply
-
@Gozilasim thanks for the interest and the question. So our GPUResourceManager implementation is using this method to determine number of GPU: https://github.com/NVIDIA/NVFlare/blob/main/nvflare/fuel/utils/gpu_utils.py#L60 |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Python version (
python3 -V
)3.8.10
NVFlare version (
python3 -m pip list | grep "nvflare"
)2.5.0
NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version,
git branch
)2.5
Operating system
Ubuntu 20.04.6LTS
Have you successfully run any of the following examples?
Please describe your question
Hello guys, i try to run the cifar10-real-world example, i had successfully connect server with client , when i try to submit the job via admin or the .sh file, the server prompt out this message
DefaultJobScheduler - INFO - [identity=secure_project, run=?]: Try to schedule job c8c22eab-9c9c-404a-9716-ae8389948583, get result: (not enough sites have enough resources (ok sites 0 < min sites 1)).
my device is jetson agx orin, when i set the resource,json for each site with the num_of_gpus = 1, it also error and came with this error
ConfigError: ConfigError: Error processing '/mnt/5958b632-9b58-4b1e-9b06-aaa3d99c9dcd/cifar10/cifar10-real-world/workspaces/secure_workspace/site-1/startup/../startup/fed_client.json' in element '{"id": "resource_manager", "path": "nvflare.app_common.resource_managers.gpu_resource_manager.GPUResourceManager", "args": {"num_of_gpus": 1, "mem_per_gpu_in_GiB": 1}}': path: 'components.#1', exception: 'ValueError: num_of_gpus specified (1) exceeds available GPUs: 0.'
the num_of_gpus should be the gpu id or the number of gpu available? also, i try to seek ChatGPT for help, i get the answer is that this GPUResourceManager its GPU detection method is not suitable for Jetson, and ChatGPT help me to solve it but other method, then i can set num_of_gpus = 1 in my rescoure,json in each site, after that i attempt to submit the job again, i still get same result , no site have enough resource , this is because of why ?
these is meta.json
{
"name": "cifar10_fedavg_stream_tb_alpha1.0",
"resource_spec": {
"site-1": {
"num_of_gpus": 1,
"mem_per_gpu_in_GiB": 1
},
"site-2": {
"num_of_gpus": 0,
"mem_per_gpu_in_GiB": 1
},
"site-3": {
"num_of_gpus": 0,
"mem_per_gpu_in_GiB": 1
},
"site-4": {
"num_of_gpus": 0,
"mem_per_gpu_in_GiB": 1
},
"site-5": {
"num_of_gpus": 0,
"mem_per_gpu_in_GiB": 1
},
"site-6": {
"num_of_gpus": 0,
"mem_per_gpu_in_GiB": 1
},
"site-7": {
"num_of_gpus": 0,
"mem_per_gpu_in_GiB": 1
},
"site-8": {
"num_of_gpus": 0,
"mem_per_gpu_in_GiB": 1
}
},
"deploy_map": {
"cifar10_fedavg_stream_tb": [
"@ALL"
]
},
"min_clients": 1
}
and these is resource.json
{
"format_version": 2,
"client": {
"retry_timeout": 30,
"compression": "Gzip"
},
"components": [
{
"id": "resource_manager",
"path": "nvflare.app_common.resource_managers.gpu_resource_manager.GPUResourceManager",
"args": {
"num_of_gpus": 1,
"mem_per_gpu_in_GiB": 8
}
},
{
"id": "resource_consumer",
"path": "nvflare.app_common.resource_consumers.gpu_resource_consumer.GPUResourceConsumer",
"args": {}
}
]
}
please help , tqvm
Beta Was this translation helpful? Give feedback.
All reactions