Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 [BUG] 'RunningStats' has no attribute 'get_state'. Did you mean: '_state' ? in both Metrics.gather() and LossStats.gather() #449

Closed
mstapelberg opened this issue Jul 16, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@mstapelberg
Copy link

mstapelberg commented Jul 16, 2024

Describe the bug
Hi there. It seems like both Metrics and LossStats are using a .get_state method that does not exist in RunningStats. My interpretation of this issue was that the gather method was trying to accumulate the metrics and losses after the optimizer step to rank0 for output/results.

In nequip/train/metrics.py:

def gather(self):
"""Use `torch.distributed` to gather and accumulate state of this Metrics across nodes to rank 0."""
state = (
dist.get_rank(),
{
k1: {k2: rs.get_state() for k2, rs in v1.items()}
for k1, v1 in self.running_stats.items()
},
)
states = [None for _ in range(dist.get_world_size())]
dist.all_gather_object(states, state) # list of dict
if dist.get_rank() == 0:
# accumulate on rank 0
for from_rank, state in states:
if from_rank == 0:
# we already have this don't accumulate it
continue
for k1, v1 in state.items():
for k2, rs_state in v1.items():
self.running_stats[k1][k2].accumulate_state(rs_state)

In nequip/train/loss.py :

nequip/nequip/train/loss.py

Lines 190 to 205 in 61b525f

def gather(self):
"""Use `torch.distributed` to gather and accumulate state of this LossStat across nodes to rank 0."""
state = (
dist.get_rank(),
{k: rs.get_state() for k, rs in self.loss_stat.items()},
)
states = [None for _ in range(dist.get_world_size())]
dist.all_gather_object(states, state) # list of dict
if dist.get_rank() == 0:
# accumulate on rank 0
for from_rank, state in states:
if from_rank == 0:
# we already have this don't accumulate it
continue
for k, rs_state in state.items():
self.loss_stat[k].accumulate_state(rs_state)

To Reproduce
When using the ddp branch of nequip:

torchrun --nnodes 1 --nproc_per_node 2 `which nequip-train` configs/minimal_distributed.yaml --distributed

Expected behavior
I would expect these methods to properly copy the loss stats and metrics after the optimizer is run. This is definitely easier said than done! I'm not too sure how to validate that the approach is working besides comparing it to the single gpu results for aspirin.

Environment (please complete the following information):

  • OS: Ubuntu
  • python version (python --version) : 3.10.4
  • python environment (commands are given for python interpreter):
    • nequip version (import nequip; nequip.__version__) : 0.61 - ddp branch
    • e3nn version (import e3nn; e3nn.__version__) : 0.51
    • pytorch version (import torch; torch.__version__) 1.13cu+117
  • (if relevant) GPU support with CUDA
    • cuda Version according to nvcc (nvcc --version) :
      Built on Tue_May__3_18:49:52_PDT_2022
      Cuda compilation tools, release 11.7, V11.7.64
      Build cuda_11.7.r11.7/compiler.31294372_0
    • cuda version according to PyTorch (import torch; torch.version.cuda) : 11.7

Additional context
Here is the error message when I first ran the torchrun:

torchrun --nnodes 1 --nproc_per_node 2 `which nequip-train` configs/minimal_distributed.yaml -
-distributed
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/__init__.py:20: UserWarning: !! PyTorch version 1.13.0+cu117 found. Upstream issues in PyTorch versions 1.13.* and 2.* have been seen to cause unusual performance degredations on some CUDA systems that become worse over time; see https://github.com/mir-group/nequip/discussions/311. The best tested PyTorch version to use with CUDA devices is 1.11; while using other versions if you observe this problem, an unexpected lack of this problem, or other strange behavior, please post in the linked GitHub issue.
  warnings.warn(
/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/__init__.py:20: UserWarning: !! PyTorch version 1.13.0+cu117 found. Upstream issues in PyTorch versions 1.13.* and 2.* have been seen to cause unusual performance degredations on some CUDA systems that become worse over time; see https://github.com/mir-group/nequip/discussions/311. The best tested PyTorch version to use with CUDA devices is 1.11; while using other versions if you observe this problem, an unexpected lack of this problem, or other strange behavior, please post in the linked GitHub issue.
  warnings.warn(
Using `torch.distributed`; this is rank 0/2 (local rank: 0)
Using `torch.distributed`; this is rank 1/2 (local rank: 1)
Torch device: cuda
Processing dataset...
Loaded data: Batch(atomic_numbers=[21000, 1], batch=[21000], cell=[1000, 3, 3], edge_cell_shift=[220186, 3], edge_index=[2, 220186], forces=[21000, 3], pbc=[1000, 3], pos=[21000, 3], ptr=[1001], total_energy=[1000, 1])
    processed data size: ~9.77 MB
Cached processed data to disk
Done!
Successfully loaded dataset `dataset` of type NpzDataset(1000)...
/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/jit/_check.py:181: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
  warnings.warn("The TorchScript type system doesn't support "
/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/jit/_check.py:181: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
  warnings.warn("The TorchScript type system doesn't support "
Replace string dataset_per_atom_total_energy_mean to -19318.260077821487
Atomic outputs are scaled by: [H, C, O: None], shifted by [H, C, O: -19318.260078].
Replace string dataset_forces_rms to 31.499698153708387
Initially outputs are globally scaled by: 31.499698153708387, total_energy are globally shifted by None.
Successfully built the network...
Number of weights: 41216
Number of trainable weights: 41216
! Starting training ...

validation
# Epoch batch         loss       loss_f        f_mae       f_rmse
Traceback (most recent call last):
  File "/home/myless/.mambaforge/envs/allegro-ddp/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 119, in main
    trainer.train()
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 841, in train
    self.epoch_step()
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 1006, in epoch_step
    self.metrics.gather()
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/metrics.py", line 270, in gather
    {
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/metrics.py", line 271, in <dictcomp>
    k1: {k2: rs.get_state() for k2, rs in v1.items()}
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/metrics.py", line 271, in <dictcomp>
    k1: {k2: rs.get_state() for k2, rs in v1.items()}
AttributeError: 'RunningStats' object has no attribute 'get_state'. Did you mean: '_state'?
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 706198 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 706199) of binary: /home/myless/.mambaforge/envs/allegro-ddp/bin/python3.10
Traceback (most recent call last):
  File "/home/myless/.mambaforge/envs/allegro-ddp/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/myless/.mambaforge/envs/allegro-ddp/bin/nequip-train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-15_20:16:16
  host      : gpu-rtx6000-04.psfc.mit.edu
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 706199)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@mstapelberg
Copy link
Author

Closed as per #450 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant