How to avoids lost tracking messages and lost events #2477

m-volk · 2024-04-08T12:07:46Z

m-volk
Apr 8, 2024

When using experiment tracking, I notice that every once in a while messages get lost. The underlying problem seems to be that events are not propagated. The problem exists in latest main (commit d152605). Can you give an advice on how to avoid this in a good way? My actual intend is to send log-messages from the clients to the server (my code is similar to the experiment tracking code), and I want to make sure the messages arrive.

To reproduce the problem, add following into NVFlare/examples/advanced/experiment-tracking/pt/learner_with_tb.py, line 154.

for n in range(100):
   self.writer.add_scalar("scalar_value", n, n)

Reduce the number of epochs to 1 in examples/advanced/experiment-tracking/tensorboard/jobs/tensorboard-streaming/app/config/config_fed_client.json, line 30.

Run the example NVFlare/examples/advanced/experiment-tracking/tensorboard/jobs/tensorboard-streaming in a simulator. (See example instructions.)

Then explore the tensorboard results from Python.

# Use your workspace path
fpath = ".../simulate_job/tb_events/site-1/events.out.tfevents.1712239308.martin-ThinkPad-P73.147581.0"

from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
event_acc = EventAccumulator(fpath)
event_acc.Reload()

print([x.value for x in event_acc.Scalars("scalar_value")])

An example output is

[0.0, 1.0, 3.0, 4.0, 5.0, 6.0, 8.0, 10.0, 11.0, 12.0, 13.0, 18.0, 19.0, 20.0, 23.0, 27.0, 30.0, 32.0, 35.0, 43.0, 59.0, 61.0, 66.0, 72.0, 73.0, 79.0, 80.0, 85.0, 89.0, 91.0, 94.0, 98.0, 99.0]

I would expect to find all numbers from 0 to 99, but a lot is missing.

When I wait a bit between submission of analytic information (e.g. 0.01s), messages don't get lost. I did so by adding following to learner_with_tb.py, line 154.

import time
   for n in range(100):
      self.writer.add_scalar("scalar_value", n, n)
      time.sleep(0.01)

YuanTingHsieh · 2024-04-18T23:32:25Z

YuanTingHsieh
Apr 18, 2024
Maintainer

Hi @m-volk thanks for the question!

Your observation is correct, the reason is that the workflows ends BEFORE all records have been sent.
One way to avoid that is you can add a time.sleep(5.0) at the end of your training to make sure all the streaming records are sent.

for example

   import time
   for n in range(100):
      self.writer.add_scalar("scalar_value", n, n)
   time.sleep(5.0)

We will work on an enhancement later that will be easier to all users.

1 reply

m-volk Apr 19, 2024
Author

Thank you very much for the explanation and the advice!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to avoids lost tracking messages and lost events #2477

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to avoids lost tracking messages and lost events #2477

m-volk Apr 8, 2024

Replies: 1 comment · 1 reply

YuanTingHsieh Apr 18, 2024 Maintainer

m-volk Apr 19, 2024 Author

m-volk
Apr 8, 2024

Replies: 1 comment 1 reply

YuanTingHsieh
Apr 18, 2024
Maintainer

m-volk Apr 19, 2024
Author