[Q&A] Problems with 'config_fed_client.json' processing! #2710

nunziosorrentino · 2024-07-17T10:23:40Z

nunziosorrentino
Jul 17, 2024

Python version (`python3 -V`)

3.8

NVFlare version (`python3 -m pip list | grep "nvflare"`)

2.2.5

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, `git branch`)

2.2

Operating system

Ubuntu 18.04

Have you successfully run any of the following examples?

hello-numpy-sag with simulator
hello-pt with simulator
hello-numpy-sag with POC
hello-pt with POC

Please describe your question

I was running a job on a real federation with 1 server and 1 client (just for testing, this will be extended to other clients) and an error occurred that I do not understand the cause of it.

Starting the first round, the client executor raises an error when processing the client configuration JSON file. This is the log:

2024-07-17 11:42:01,756 - ClientEngine - INFO - Starting client app. rank: 0
2024-07-17 11:42:01,765 - ProcessExecutor - INFO - Worker child process ID: 31787
2024-07-17 11:42:02,984 - worker_process - INFO - Worker_process started.
2024-07-17 11:42:04,457 - worker_process - ERROR - FL client execution exception: Config error in ['/workspace/DEMO/chest-xray_classifier/poc/workspaces/mise_federation_project/iit_1/startup/../e8f15c02-1dc3-4fe3-93c2-819ebb22d147/app_iit_1/config/config_fed_client.json']:
Error processing ['/workspace/DEMO/chest-xray_classifier/poc/workspaces/mise_federation_project/iit_1/startup/../e8f15c02-1dc3-4fe3-93c2-819ebb22d147/app_iit_1/config/config_fed_client.json'] in JSON element {'id': 'chest-xray-learner', 'path': 'chest_xray_learner.Xray_Chests_Learner', 'args': {'aggregation_epochs': 4, 'lr': 0.001, 'batch_size': 64, 'unbalanced': True, 'sigmoid': True}}: path: components.#1, exception: too many values to unpack (expected 2)
2024-07-17 11:42:04,457 - FederatedClient - INFO - Shutting down client: iit_1
2024-07-17 11:42:16,787 - ProcessExecutor - INFO - run (e8f15c02-1dc3-4fe3-93c2-819ebb22d147): waiting for child worker process to finish.
2024-07-17 11:42:16,788 - ProcessExecutor - INFO - run (e8f15c02-1dc3-4fe3-93c2-819ebb22d147): child worker process finished with execution code: -9

and so the federation stops. Seems to be a decoding error related to the argument "path" of the learner that I wrote for the local training. This is strange because when I run the configuration file with the same learner and the same "path" value with "nvflare simulator" it does not return any error, so I think it is not a problem related to the PYTHONPATH definition (I checked it and it is correct). I saw that "worker_process" is not used during the simulation, so could be related to a bug of that nvflare function? Is this problem been solved on a newer release? Please, every other help will be appreciated.

I attach the configuration files that I use for the federation, let me know if you need more information.

config_fed_client.json
config_fed_server.json

YuanTingHsieh · 2024-07-17T20:07:00Z

YuanTingHsieh
Jul 17, 2024
Maintainer

@nunziosorrentino thanks for the interest.

Is your server and client process running on the same machine?

How do you run your simulator command? can you share?

Can you also share the command / steps you do to get this result^

It does seem that there is some path missing or dependency missing that causes the problem.

2 replies

nunziosorrentino Jul 18, 2024
Author

@YuanTingHsieh thank you again for the quick response.

No server and client are on two different machines.

The command that I use for simulation is:
nvflare simulator jobs/chest_x-ray_binary_class --workspace output_1client_test --threads 1 --n_clients 1

where chest_x-ray_binary_class is the simulation job directory where config_fed_client and config_fed_server are put.

Instead, in the real federation I first activate the server with the command:
"./${workspace}/${servername}/startup/start.sh"

where workspace is the directory containing the server and the console admin start-up kits and server_name is the server DNS.

Then on the client machine, I run:
"${workspace}/${site_name}/startup/start.sh"
here workspace contains client startup kits and site_name is the name with which the client machine has been registered on nvflare dashboard.

After these two steps, the tokens are exchanged without any error. This is the output log of the client:
PYTHONPATH is /workspace/DEMO/chest-xray_classifier: FED_PYPATH IS /workspace/DEMO/chest-xray_classifier STARTING iit_1 CLIENT WORKSPACE set to /workspace/DEMO/chest-xray_classifier/poc/workspaces//<site_name>/startup/.. PYTHONPATH is /local/custom:/workspace/DEMO/chest-xray_classifier: start fl because of no pid.fl
new pid 21315 Waiting for SP.... 2024-07-18 09:34:38,340 - FederatedClient - INFO - Got the new primary SP: :8002 2024-07-18 09:34:39,406 - FederatedClient - INFO - Successfully registered client:<site_name> for project . Token:1cadbb61-797a-460e-af8c-cc38dc28d500 SSID:ebc6125d-0a56-4688-9b08-355fe9e4d61a

This is what appears on the server:
2024-07-18 07:34:39,396 - ClientManager - INFO - Client: New client <site_name>@<client_ip> joined. Sent token: 1cadbb61-797a-460e-af8c-cc38dc28d500. Total clients: 1

Then once running the job returns the log I gave you in the first message on the client side. Now I notice that the PYTHONPATH is set to /local/custom:/workspace/DEMO/chest-xray_classifier: when I start the client, but if I type "echo $PYHTONPATH" it returns me: "/workspace/DEMO/chest-xray_classifier:"!

Could it be that the PYTHONPATH variable is modified during the job running?

Let me know if you need other information and thank you so much for the help.

YuanTingHsieh Jul 18, 2024
Maintainer

@nunziosorrentino thanks for more info.

how many "apps" in your job? is server and client using the same "app"?
Did you put your code "chest-xray_classifier" in the app's custom folder? or you put it on your system?
Does that same code "chest-xray_classifier" and the Network exist on both your server machine and client machine?
Can you try "nvflare simulator jobs/chest_x-ray_binary_class --workspace output_1client_test --threads 1 --n_clients 1" on server machine and see if it runs? and also try the same command in the client machine.
Can you do: ls -all in your job directory so we can see the structure of it?

When NVFlare is running a job, we will insert " /local/custom" and the job's custom folder into PYTHONPATH so it is easier for user to import from their own code in custom folder.

nunziosorrentino · 2024-07-22T09:31:15Z

nunziosorrentino
Jul 22, 2024
Author

Dear @YuanTingHsieh, thank you again,

in my job directory, there are 5 apps, some previously used for other tests and real applications. These are shown by typing ls -all in the job directory:

(base) nsorrentino@rampage:~/work/nvflare-for-federated-learning/DEMO/chest-xray_classifier/poc/jobs$ ls -all
total 28 drwxr-xr-x 7 nsorrentino domain^users 4096 giu 28 12:27 . drwxr-xr-x 5 nsorrentino domain^users 4096 lug 22 10:36 .. drwxr-xr-x 3 nsorrentino domain^users 4096 lug 4 17:41 chest-xray_4-labels_fedavg drwxr-xr-x 3 nsorrentino domain^users 4096 lug 17 11:52 chest-xray_fedavg drwxr-xr-x 3 nsorrentino domain^users 4096 ago 1 2023 chest-xray_fedavg_he drwxr-xr-x 3 nsorrentino domain^users 4096 mag 3 10:49 gan_trained_aggregation drwxr-xr-x 3 nsorrentino domain^users 4096 lug 14 2023 test

The configuration files that I sent to you are in chest-xray_fedavg directory, and this is the structure:

(base) nsorrentino@rampage:/work/nvflare-for-federated-learning/DEMO/chest-xray_classifier/poc/jobs/chest-xray_fedavg$ ls -all
total 20 drwxr-xr-x 3 nsorrentino domain^users 4096 lug 17 11:52 . drwxr-xr-x 7 nsorrentino domain^users 4096 giu 28 12:27 .. drwxr-xr-x 3 nsorrentino domain^users 4096 lug 14 2023 chest-xray_fedavg -rw-r--r-- 1 nsorrentino domain^users 467 lug 17 11:52 meta.json -rw-r--r-- 1 nsorrentino domain^users 463 lug 14 2023 meta.json.default
(base) nsorrentino@rampage:/work/nvflare-for-federated-learning/DEMO/chest-xray_classifier/poc/jobs/chest-xray_fedavg/chest-xray_fedavg$ ls -all
total 12 drwxr-xr-x 3 nsorrentino domain^users 4096 lug 14 2023 . drwxr-xr-x 3 nsorrentino domain^users 4096 lug 17 11:52 .. drwxr-xr-x 2 nsorrentino domain^users 4096 lug 17 11:52 config
(base) nsorrentino@rampage:~/work/nvflare-for-federated-learning/DEMO/chest-xray_classifier/poc/jobs/chest-xray_fedavg/chest-xray_fedavg/config$ ls -all
total 16 drwxr-xr-x 2 nsorrentino domain^users 4096 lug 17 11:52 . drwxr-xr-x 3 nsorrentino domain^users 4096 lug 14 2023 .. -rw-r--r-- 1 nsorrentino domain^users 1709 lug 17 11:52 config_fed_client.json -rw-r--r-- 1 nsorrentino domain^users 2527 lug 17 11:52 config_fed_server.json

During the real federation, I explicitly send the job/chest-xray_fedavg directory, which is also located on the client and server.

No the code of "chest-xray_classifier" is put outside job/chest-xray_fedavg directory and it's structure is here shown:

(base) nsorrentino@rampage:~/work/nvflare-for-federated-learning/DEMO/chest-xray_classifier$ ls -all
total 553204 drwxr-xr-x 12 nsorrentino domain^users 4096 lug 22 10:36 . drwxr-xr-x 4 nsorrentino domain^users 4096 lug 3 15:54 .. -rw-r--r-- 1 nsorrentino domain^users 843 mar 13 18:01 check.py -rw-r--r-- 1 nsorrentino domain^users 5034 lug 4 17:17 chest_xray_cnn.py -rw-r--r-- 1 nsorrentino domain^users 12693 lug 17 10:26 chest_xray_datamanager.py -rw-r--r-- 1 nsorrentino domain^users 23915 lug 17 10:32 chest_xray_learner.py -rw-r--r-- 1 nsorrentino domain^users 7554 lug 9 11:45 chest_xray_splitter.py drwxr-xr-x 2 nsorrentino domain^users 4096 lug 14 2023 compare_methods -rw-r--r-- 1 nsorrentino domain^users 115 lug 22 10:36 dirs_list.obj -rw-r--r-- 1 nsorrentino domain^users 2074 mar 13 12:20 distribution_builder.py -rw-r--r-- 1 nsorrentino domain^users 459 lug 17 10:05 init.py -rw-r--r-- 1 nsorrentino domain^users 283134292 mag 27 16:21 init-weights-lo.pkl -rw-r--r-- 1 nsorrentino domain^users 283134295 mag 27 16:21 init-weights-vp.pkl -rw-r--r-- 1 nsorrentino domain^users 1068 mag 21 09:35 main.py -rw-r--r-- 1 nsorrentino domain^users 8200 lug 9 11:00 pid_list.obj drwxr-xr-x 5 nsorrentino domain^users 4096 lug 22 10:36 poc drwxr-xr-x 2 root root 4096 lug 1 13:48 pycache -rw-r--r-- 1 nsorrentino domain^users 411 lug 14 2023 README.md drwxr-xr-x 5 nsorrentino domain^users 4096 giu 10 14:34 run drwxr-xr-x 5 nsorrentino domain^users 4096 mag 8 18:18 run_site-1 drwxr-xr-x 5 nsorrentino domain^users 4096 apr 16 17:35 run_site-2 drwxr-xr-x 5 nsorrentino domain^users 4096 apr 16 17:39 run_site-3 drwxr-xr-x 5 nsorrentino domain^users 4096 apr 16 17:42 run_site-4 -rw-r--r-- 1 nsorrentino domain^users 10819 nov 28 2023 save_model.py -rw-r--r-- 1 nsorrentino domain^users 187 mar 7 11:04 script.py

As you can see the structure follows the example provided in: https://github.com/NVIDIA/NVFlare/tree/2.2/examples/cifar10. In the poc directory there is the jobs directory!

Yes, I tried "nvflare simulator jobs/chest_x-ray_fedavg --workspace output_1client_test --threads 1 --n_clients 1" (sorry for the previous error, the real name of the job is chest_x-ray_fedavg not chest_x-ray_binary_class) both on client and server and they work perfectly. I attach a file with the output of real federation (the one that does not read correctly the client configuration file) and the outputs of simulations on client and server sites with no error of client initialization.
real_and_sim_output_federation.txt

Do you think that could be a machine configuration problem? In this case, I should have also seen an error in the simulation run on the server, but instead the simulation runs as smoothly as on the client.

4 replies

nunziosorrentino Jul 22, 2024
Author

PS @YuanTingHsieh To make the outputs of ls -all more readable I create a file with all these outputs in order:
ls-all.docx

YuanTingHsieh Jul 22, 2024
Maintainer

@nunziosorrentino thanks for these logs.

Could you try this:

Inside your app create a custom folder
Put all the custom code inside that custom folder

So the job folder will be something like:

chest-xray_fedavg\
        chest-xray_fedavg\
            config\
                config_fed_server.json
                config_fed_client.json
            custom\
                chest_x_ray_learner.py
                chest_x_ray_cnn.py
        meta.json

And then you try both simulator and real-world federation with this job

nunziosorrentino Jul 29, 2024
Author

Dear @YuanTingHsieh, thank you for your help, and sorry for the late response.

in the meantime, I was modifying the chest_xray_learner.Xray_Chests_Learner class and unexpectedly, after making these changes, the real federation did not give any unpacking problem during the client configuration file reading. I just modified some sql queries to a database with data pointers, so I do not understand why it works now. The only thing I noticed, looking at the git tracking, is that the server modified the config file when transmitted to the client, removing spaces in the file to fix the indentation. It doesn't seem a big deal to me, but maybe you might tell me if this could be the cause.

YuanTingHsieh Jul 29, 2024
Maintainer

Hi @nunziosorrentino thanks for getting back.

The removing of space should not be a problem.

I will guess the problem was somewhere in this chest_xray_learner.Xray_Chests_Learner class you had a logic error in the constructor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q&A] Problems with 'config_fed_client.json' processing! #2710

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[Q&A] Problems with 'config_fed_client.json' processing! #2710

nunziosorrentino Jul 17, 2024

Python version (python3 -V)

NVFlare version (python3 -m pip list | grep "nvflare")

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, git branch)

Operating system

Have you successfully run any of the following examples?

Please describe your question

Replies: 2 comments · 6 replies

YuanTingHsieh Jul 17, 2024 Maintainer

nunziosorrentino Jul 18, 2024 Author

YuanTingHsieh Jul 18, 2024 Maintainer

nunziosorrentino Jul 22, 2024 Author

nunziosorrentino Jul 22, 2024 Author

YuanTingHsieh Jul 22, 2024 Maintainer

nunziosorrentino Jul 29, 2024 Author

YuanTingHsieh Jul 29, 2024 Maintainer

nunziosorrentino
Jul 17, 2024

Python version (`python3 -V`)

NVFlare version (`python3 -m pip list | grep "nvflare"`)

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, `git branch`)

Replies: 2 comments 6 replies

YuanTingHsieh
Jul 17, 2024
Maintainer

nunziosorrentino Jul 18, 2024
Author

YuanTingHsieh Jul 18, 2024
Maintainer

nunziosorrentino
Jul 22, 2024
Author

nunziosorrentino Jul 22, 2024
Author

YuanTingHsieh Jul 22, 2024
Maintainer

nunziosorrentino Jul 29, 2024
Author

YuanTingHsieh Jul 29, 2024
Maintainer