Fixed gather method to work with distributed training in metrics.py and loss.py #450

mstapelberg · 2024-07-16T02:57:18Z

Description

In the DDP branch, the gather method uses a .get_state() method that does not exist in the torch_runstats package. I implemented a simple work around to directly access the state of the running statistics (_state) and the number of samples (_n).

In the metrics.py version of gather, I also explicitly made sure the tensors are on the same device prior to accumulating the state.

Motivation and Context

Resolves: #449

How Has This Been Tested?

I tested my changes by comparing the output metrics from minimal.yaml and minimal_distributed.yaml.
aspirin_pr_test.txt
distributed_aspirin_pr_test.txt

Training time with 2 GPUs (Quadro RTX6000) was 248 seconds with
minimal_distributed.txt
:

  Train      #    Epoch      wal       LR       loss_f         loss        f_mae       f_rmse
! Train              50  248.091     0.01      0.00573      0.00573         1.76          2.4
! Validation         50  248.091     0.01      0.00808      0.00808         2.01         2.85
Wall time: 248.0915699005127
! Best model       50    0.008
! Stop training: max epochs
Wall time: 248.14249654859304
Cumulative wall time: 248.14249654859304

Training time with 1 GPU (Quadro RTX6000) was 566 seconds with
minimal.txt
:

  Train      #    Epoch      wal       LR       loss_f         loss        f_mae       f_rmse
! Train              50  566.047     0.01      0.00684      0.00684         1.95         2.62
! Validation         50  566.047     0.01      0.00963      0.00963         2.23         3.11
Wall time: 566.0472663491964
! Stop training: max epochs
Wall time: 566.0749344825745
Cumulative wall time: 566.0749344825745

I tried another training attempt with 1 GPU
aspirin_pr_test_t2.txt
:

  Train      #    Epoch      wal       LR       loss_f         loss        f_mae       f_rmse
! Train              50  461.492     0.01      0.00684      0.00684         1.95         2.62
! Validation         50  461.492     0.01      0.00963      0.00963         2.23         3.11
Wall time: 461.4917606860399
! Stop training: max epochs
Wall time: 461.5175565779209
Cumulative wall time: 461.5175565779209

Using the state-run runstats branch
minimal_distributed_fixed_runstats.txt
:

  Train      #    Epoch      wal       LR       loss_f         loss        f_mae       f_rmse
! Train              50  258.864     0.01      0.00573      0.00573         1.76          2.4
! Validation         50  258.864     0.01      0.00808      0.00808         2.01         2.85
Wall time: 258.8642472475767
! Best model       50    0.008
! Stop training: max epochs
Wall time: 258.9035141095519
Cumulative wall time: 258.9035141095519

The configs were setup such that the distributed_batch_size / world_size = batch_size. Not sure what happened between the first and second single gpu test, but perhaps I should repeat the trainings a few times to compare the loss and wall times.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds or improves functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation improvement (updates to user guides, docstrings, or developer docs)

Checklist:

My code follows the code style of this project and has been formatted using black.
All new and existing tests passed, including on GPU (if relevant).
I have added tests that cover my changes (if relevant).
The option documentation (docs/options) has been updated with new or changed options.
I have updated CHANGELOG.md.
I have updated the documentation (if relevant).

kavanase · 2024-07-16T16:35:07Z

Hi @mstapelberg!
For the DDP version of nequip, the state-reduce branch of torch_runstats needs to be used as it has the get_state() methods defined. This should definitely be made more clear as I ran into the same issue too (the ddp branch is still under development so the documentation/instructions are still a work in progress).

Not sure if this is a better way of implementing though. @Linux-cpp-lisp will know!

mstapelberg · 2024-07-16T17:02:50Z

Hi @kavanase thanks for your reply and help! I'll give that a go, likely will be a much better implementation than what I hacked together. Will try the new branch for pytorch_runstats now

mstapelberg · 2024-07-16T18:15:22Z

Hi @kavanase I gave the updated pytorch_runstats a go; it works (I've updated my initial pull request). However when I try it on a more realistic problem the following error happens (full log here -
distributed_error_log_2gpus_vcrti_100.txt
and
distributed_vcrti_config.txt

training
# Epoch batch         loss       loss_f       loss_e        f_mae       f_rmse     Ti_f_mae      V_f_mae     Cr_f_mae  psavg_f_mae    Ti_f_rmse     V_f_rmse    Cr_f_rmse psavg_f_rmse        e_mae      e/N_mae
Traceback (most recent call last):
  File "/home/myless/.mambaforge/envs/allegro-ddp/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 119, in main
    trainer.train()
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 841, in train
    self.epoch_step()
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 997, in epoch_step
    self.batch_step(
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 874, in batch_step
    out = self.model(data_for_loss)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1026, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 10
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
  File "/home/myless/.mambaforge/envs/allegro-ddp/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 119, in main
    trainer.train()
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 841, in train
    self.epoch_step()
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 997, in epoch_step
    self.batch_step(
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 874, in batch_step
    out = self.model(data_for_loss)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1026, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 10
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1196340) of binary: /home/myless/.mambaforge/envs/allegro-ddp/bin/python3.10
Traceback (most recent call last):
  File "/home/myless/.mambaforge/envs/allegro-ddp/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/myless/.mambaforge/envs/allegro-ddp/bin/nequip-train FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-16_14:11:51
  host      : gpu-rtx6000-04.psfc.mit.edu
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1196341)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-16_14:11:51
  host      : gpu-rtx6000-04.psfc.mit.edu
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1196340)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I'm fairly novice when it comes to Github so would it be best to open another issue with this? Or is this something you have seen in the past as well?

Thanks!
Myles

kavanase · 2024-07-16T18:56:42Z

Hmm ok I haven't seen that error yet with my tests for the ddp branch (though I am using nequip ddp rather than allegro ddp, so not sure if this will also cause some differences). There's discussion in the linked issue (#210) there on using the ddp branches, so that might help solve this issue?

Maybe worth trying that with the un-edited ddp and state_reduce branches of nequip & torch_runstats and seeing if it works? Then can close this PR and keep the discussion to #210?

mstapelberg added 2 commits July 15, 2024 20:55

fixed gather in metrics and loss

e1fb172

reformatted to fit black

891a0ca

mstapelberg mentioned this pull request Jul 16, 2024

🐛 [BUG] 'RunningStats' has no attribute 'get_state'. Did you mean: '_state' ? in both Metrics.gather() and LossStats.gather() #449

Closed

kavanase mentioned this pull request Jul 16, 2024

Multi-GPU support exists❓ [QUESTION] #210

Open

Linux-cpp-lisp closed this Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed gather method to work with distributed training in metrics.py and loss.py #450

Fixed gather method to work with distributed training in metrics.py and loss.py #450

mstapelberg commented Jul 16, 2024 •

edited

Loading

kavanase commented Jul 16, 2024

mstapelberg commented Jul 16, 2024

mstapelberg commented Jul 16, 2024 •

edited

Loading

kavanase commented Jul 16, 2024

Fixed gather method to work with distributed training in metrics.py and loss.py #450

Fixed gather method to work with distributed training in metrics.py and loss.py #450

Conversation

mstapelberg commented Jul 16, 2024 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

kavanase commented Jul 16, 2024

mstapelberg commented Jul 16, 2024

mstapelberg commented Jul 16, 2024 • edited Loading

kavanase commented Jul 16, 2024

mstapelberg commented Jul 16, 2024 •

edited

Loading

mstapelberg commented Jul 16, 2024 •

edited

Loading