Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed gather method to work with distributed training in metrics.py and loss.py #450

Closed
wants to merge 2 commits into from

Conversation

mstapelberg
Copy link

@mstapelberg mstapelberg commented Jul 16, 2024

Description

In the DDP branch, the gather method uses a .get_state() method that does not exist in the torch_runstats package. I implemented a simple work around to directly access the state of the running statistics (_state) and the number of samples (_n).

In the metrics.py version of gather, I also explicitly made sure the tensors are on the same device prior to accumulating the state.

Motivation and Context

Resolves: #449

How Has This Been Tested?

I tested my changes by comparing the output metrics from minimal.yaml and minimal_distributed.yaml.
aspirin_pr_test.txt
distributed_aspirin_pr_test.txt

Training time with 2 GPUs (Quadro RTX6000) was 248 seconds with
minimal_distributed.txt
:

  Train      #    Epoch      wal       LR       loss_f         loss        f_mae       f_rmse
! Train              50  248.091     0.01      0.00573      0.00573         1.76          2.4
! Validation         50  248.091     0.01      0.00808      0.00808         2.01         2.85
Wall time: 248.0915699005127
! Best model       50    0.008
! Stop training: max epochs
Wall time: 248.14249654859304
Cumulative wall time: 248.14249654859304

Training time with 1 GPU (Quadro RTX6000) was 566 seconds with
minimal.txt
:

  Train      #    Epoch      wal       LR       loss_f         loss        f_mae       f_rmse
! Train              50  566.047     0.01      0.00684      0.00684         1.95         2.62
! Validation         50  566.047     0.01      0.00963      0.00963         2.23         3.11
Wall time: 566.0472663491964
! Stop training: max epochs
Wall time: 566.0749344825745
Cumulative wall time: 566.0749344825745

I tried another training attempt with 1 GPU
aspirin_pr_test_t2.txt
:

  Train      #    Epoch      wal       LR       loss_f         loss        f_mae       f_rmse
! Train              50  461.492     0.01      0.00684      0.00684         1.95         2.62
! Validation         50  461.492     0.01      0.00963      0.00963         2.23         3.11
Wall time: 461.4917606860399
! Stop training: max epochs
Wall time: 461.5175565779209
Cumulative wall time: 461.5175565779209

Using the state-run runstats branch
minimal_distributed_fixed_runstats.txt
:

  Train      #    Epoch      wal       LR       loss_f         loss        f_mae       f_rmse
! Train              50  258.864     0.01      0.00573      0.00573         1.76          2.4
! Validation         50  258.864     0.01      0.00808      0.00808         2.01         2.85
Wall time: 258.8642472475767
! Best model       50    0.008
! Stop training: max epochs
Wall time: 258.9035141095519
Cumulative wall time: 258.9035141095519

The configs were setup such that the distributed_batch_size / world_size = batch_size. Not sure what happened between the first and second single gpu test, but perhaps I should repeat the trainings a few times to compare the loss and wall times.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds or improves functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation improvement (updates to user guides, docstrings, or developer docs)

Checklist:

  • My code follows the code style of this project and has been formatted using black.
  • All new and existing tests passed, including on GPU (if relevant).
  • I have added tests that cover my changes (if relevant).
  • The option documentation (docs/options) has been updated with new or changed options.
  • I have updated CHANGELOG.md.
  • I have updated the documentation (if relevant).

@kavanase
Copy link
Contributor

Hi @mstapelberg!
For the DDP version of nequip, the state-reduce branch of torch_runstats needs to be used as it has the get_state() methods defined. This should definitely be made more clear as I ran into the same issue too (the ddp branch is still under development so the documentation/instructions are still a work in progress).

Not sure if this is a better way of implementing though. @Linux-cpp-lisp will know!

@mstapelberg
Copy link
Author

Hi @kavanase thanks for your reply and help! I'll give that a go, likely will be a much better implementation than what I hacked together. Will try the new branch for pytorch_runstats now

@mstapelberg
Copy link
Author

mstapelberg commented Jul 16, 2024

Hi @kavanase I gave the updated pytorch_runstats a go; it works (I've updated my initial pull request). However when I try it on a more realistic problem the following error happens (full log here -
distributed_error_log_2gpus_vcrti_100.txt
and
distributed_vcrti_config.txt

training
# Epoch batch         loss       loss_f       loss_e        f_mae       f_rmse     Ti_f_mae      V_f_mae     Cr_f_mae  psavg_f_mae    Ti_f_rmse     V_f_rmse    Cr_f_rmse psavg_f_rmse        e_mae      e/N_mae
Traceback (most recent call last):
  File "/home/myless/.mambaforge/envs/allegro-ddp/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 119, in main
    trainer.train()
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 841, in train
    self.epoch_step()
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 997, in epoch_step
    self.batch_step(
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 874, in batch_step
    out = self.model(data_for_loss)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1026, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 10
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
  File "/home/myless/.mambaforge/envs/allegro-ddp/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 119, in main
    trainer.train()
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 841, in train
    self.epoch_step()
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 997, in epoch_step
    self.batch_step(
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 874, in batch_step
    out = self.model(data_for_loss)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1026, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 10
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1196340) of binary: /home/myless/.mambaforge/envs/allegro-ddp/bin/python3.10
Traceback (most recent call last):
  File "/home/myless/.mambaforge/envs/allegro-ddp/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/myless/.mambaforge/envs/allegro-ddp/bin/nequip-train FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-16_14:11:51
  host      : gpu-rtx6000-04.psfc.mit.edu
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1196341)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-16_14:11:51
  host      : gpu-rtx6000-04.psfc.mit.edu
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1196340)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I'm fairly novice when it comes to Github so would it be best to open another issue with this? Or is this something you have seen in the past as well?

Thanks!
Myles

@kavanase
Copy link
Contributor

Hmm ok I haven't seen that error yet with my tests for the ddp branch (though I am using nequip ddp rather than allegro ddp, so not sure if this will also cause some differences). There's discussion in the linked issue (#210) there on using the ddp branches, so that might help solve this issue?

Maybe worth trying that with the un-edited ddp and state_reduce branches of nequip & torch_runstats and seeing if it works? Then can close this PR and keep the discussion to #210?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants