[Q&A] Problems with 'config_fed_client.json' processing! #2710
Replies: 2 comments 6 replies
-
@nunziosorrentino thanks for the interest. Is your server and client process running on the same machine? How do you run your simulator command? can you share? Can you also share the command / steps you do to get this result^ It does seem that there is some path missing or dependency missing that causes the problem. |
Beta Was this translation helpful? Give feedback.
-
Dear @YuanTingHsieh, thank you again, in my job directory, there are 5 apps, some previously used for other tests and real applications. These are shown by typing ls -all in the job directory: (base) nsorrentino@rampage:~/work/nvflare-for-federated-learning/DEMO/chest-xray_classifier/poc/jobs$ ls -all The configuration files that I sent to you are in chest-xray_fedavg directory, and this is the structure: (base) nsorrentino@rampage: During the real federation, I explicitly send the job/chest-xray_fedavg directory, which is also located on the client and server. No the code of "chest-xray_classifier" is put outside job/chest-xray_fedavg directory and it's structure is here shown: (base) nsorrentino@rampage:~/work/nvflare-for-federated-learning/DEMO/chest-xray_classifier$ ls -all As you can see the structure follows the example provided in: https://github.com/NVIDIA/NVFlare/tree/2.2/examples/cifar10. In the poc directory there is the jobs directory! Yes, I tried "nvflare simulator jobs/chest_x-ray_fedavg --workspace output_1client_test --threads 1 --n_clients 1" (sorry for the previous error, the real name of the job is chest_x-ray_fedavg not chest_x-ray_binary_class) both on client and server and they work perfectly. I attach a file with the output of real federation (the one that does not read correctly the client configuration file) and the outputs of simulations on client and server sites with no error of client initialization. Do you think that could be a machine configuration problem? In this case, I should have also seen an error in the simulation run on the server, but instead the simulation runs as smoothly as on the client. |
Beta Was this translation helpful? Give feedback.
-
Python version (
python3 -V
)3.8
NVFlare version (
python3 -m pip list | grep "nvflare"
)2.2.5
NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version,
git branch
)2.2
Operating system
Ubuntu 18.04
Have you successfully run any of the following examples?
Please describe your question
I was running a job on a real federation with 1 server and 1 client (just for testing, this will be extended to other clients) and an error occurred that I do not understand the cause of it.
Starting the first round, the client executor raises an error when processing the client configuration JSON file. This is the log:
2024-07-17 11:42:01,756 - ClientEngine - INFO - Starting client app. rank: 0
2024-07-17 11:42:01,765 - ProcessExecutor - INFO - Worker child process ID: 31787
2024-07-17 11:42:02,984 - worker_process - INFO - Worker_process started.
2024-07-17 11:42:04,457 - worker_process - ERROR - FL client execution exception: Config error in ['/workspace/DEMO/chest-xray_classifier/poc/workspaces/mise_federation_project/iit_1/startup/../e8f15c02-1dc3-4fe3-93c2-819ebb22d147/app_iit_1/config/config_fed_client.json']:
Error processing ['/workspace/DEMO/chest-xray_classifier/poc/workspaces/mise_federation_project/iit_1/startup/../e8f15c02-1dc3-4fe3-93c2-819ebb22d147/app_iit_1/config/config_fed_client.json'] in JSON element {'id': 'chest-xray-learner', 'path': 'chest_xray_learner.Xray_Chests_Learner', 'args': {'aggregation_epochs': 4, 'lr': 0.001, 'batch_size': 64, 'unbalanced': True, 'sigmoid': True}}: path: components.#1, exception: too many values to unpack (expected 2)
2024-07-17 11:42:04,457 - FederatedClient - INFO - Shutting down client: iit_1
2024-07-17 11:42:16,787 - ProcessExecutor - INFO - run (e8f15c02-1dc3-4fe3-93c2-819ebb22d147): waiting for child worker process to finish.
2024-07-17 11:42:16,788 - ProcessExecutor - INFO - run (e8f15c02-1dc3-4fe3-93c2-819ebb22d147): child worker process finished with execution code: -9
and so the federation stops. Seems to be a decoding error related to the argument "path" of the learner that I wrote for the local training. This is strange because when I run the configuration file with the same learner and the same "path" value with "nvflare simulator" it does not return any error, so I think it is not a problem related to the PYTHONPATH definition (I checked it and it is correct). I saw that "worker_process" is not used during the simulation, so could be related to a bug of that nvflare function? Is this problem been solved on a newer release? Please, every other help will be appreciated.
I attach the configuration files that I use for the federation, let me know if you need more information.
config_fed_client.json
config_fed_server.json
Beta Was this translation helpful? Give feedback.
All reactions