🐛 [BUG] 'RunningStats' has no attribute 'get_state'. Did you mean: '_state' ? in both Metrics.gather() and LossStats.gather() #449

mstapelberg · 2024-07-16T00:50:44Z

Describe the bug
Hi there. It seems like both Metrics and LossStats are using a .get_state method that does not exist in RunningStats. My interpretation of this issue was that the gather method was trying to accumulate the metrics and losses after the optimizer step to rank0 for output/results.

In nequip/train/metrics.py:

nequip/nequip/train/metrics.py

Lines 266 to 285 in 61b525f

    
           def gather(self): 
        
               """Use `torch.distributed` to gather and accumulate state of this Metrics across nodes to rank 0.""" 
        
               state = ( 
        
                   dist.get_rank(), 
        
                   { 
        
                       k1: {k2: rs.get_state() for k2, rs in v1.items()} 
        
                       for k1, v1 in self.running_stats.items() 
        
                   }, 
        
               ) 
        
               states = [None for _ in range(dist.get_world_size())] 
        
               dist.all_gather_object(states, state)  # list of dict 
        
               if dist.get_rank() == 0: 
        
                   # accumulate on rank 0 
        
                   for from_rank, state in states: 
        
                       if from_rank == 0: 
        
                           # we already have this don't accumulate it 
        
                           continue 
        
                       for k1, v1 in state.items(): 
        
                           for k2, rs_state in v1.items(): 
        
                               self.running_stats[k1][k2].accumulate_state(rs_state)

In nequip/train/loss.py :

nequip/nequip/train/loss.py

Lines 190 to 205 in 61b525f

    
           def gather(self): 
        
               """Use `torch.distributed` to gather and accumulate state of this LossStat across nodes to rank 0.""" 
        
               state = ( 
        
                   dist.get_rank(), 
        
                   {k: rs.get_state() for k, rs in self.loss_stat.items()}, 
        
               ) 
        
               states = [None for _ in range(dist.get_world_size())] 
        
               dist.all_gather_object(states, state)  # list of dict 
        
               if dist.get_rank() == 0: 
        
                   # accumulate on rank 0 
        
                   for from_rank, state in states: 
        
                       if from_rank == 0: 
        
                           # we already have this don't accumulate it 
        
                           continue 
        
                       for k, rs_state in state.items(): 
        
                           self.loss_stat[k].accumulate_state(rs_state)

To Reproduce
When using the ddp branch of nequip:

torchrun --nnodes 1 --nproc_per_node 2 `which nequip-train` configs/minimal_distributed.yaml --distributed

Expected behavior
I would expect these methods to properly copy the loss stats and metrics after the optimizer is run. This is definitely easier said than done! I'm not too sure how to validate that the approach is working besides comparing it to the single gpu results for aspirin.

Environment (please complete the following information):

OS: Ubuntu
python version (python --version) : 3.10.4
python environment (commands are given for python interpreter):
- nequip version (import nequip; nequip.__version__) : 0.61 - ddp branch
- e3nn version (import e3nn; e3nn.__version__) : 0.51
- pytorch version (import torch; torch.__version__) 1.13cu+117
(if relevant) GPU support with CUDA
- cuda Version according to nvcc (nvcc --version) :
  Built on Tue_May__3_18:49:52_PDT_2022
  Cuda compilation tools, release 11.7, V11.7.64
  Build cuda_11.7.r11.7/compiler.31294372_0
- cuda version according to PyTorch (import torch; torch.version.cuda) : 11.7

Additional context
Here is the error message when I first ran the torchrun:

torchrun --nnodes 1 --nproc_per_node 2 `which nequip-train` configs/minimal_distributed.yaml -
-distributed
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/__init__.py:20: UserWarning: !! PyTorch version 1.13.0+cu117 found. Upstream issues in PyTorch versions 1.13.* and 2.* have been seen to cause unusual performance degredations on some CUDA systems that become worse over time; see https://github.com/mir-group/nequip/discussions/311. The best tested PyTorch version to use with CUDA devices is 1.11; while using other versions if you observe this problem, an unexpected lack of this problem, or other strange behavior, please post in the linked GitHub issue.
  warnings.warn(
/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/__init__.py:20: UserWarning: !! PyTorch version 1.13.0+cu117 found. Upstream issues in PyTorch versions 1.13.* and 2.* have been seen to cause unusual performance degredations on some CUDA systems that become worse over time; see https://github.com/mir-group/nequip/discussions/311. The best tested PyTorch version to use with CUDA devices is 1.11; while using other versions if you observe this problem, an unexpected lack of this problem, or other strange behavior, please post in the linked GitHub issue.
  warnings.warn(
Using `torch.distributed`; this is rank 0/2 (local rank: 0)
Using `torch.distributed`; this is rank 1/2 (local rank: 1)
Torch device: cuda
Processing dataset...
Loaded data: Batch(atomic_numbers=[21000, 1], batch=[21000], cell=[1000, 3, 3], edge_cell_shift=[220186, 3], edge_index=[2, 220186], forces=[21000, 3], pbc=[1000, 3], pos=[21000, 3], ptr=[1001], total_energy=[1000, 1])
    processed data size: ~9.77 MB
Cached processed data to disk
Done!
Successfully loaded dataset `dataset` of type NpzDataset(1000)...
/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/jit/_check.py:181: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
  warnings.warn("The TorchScript type system doesn't support "
/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/jit/_check.py:181: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
  warnings.warn("The TorchScript type system doesn't support "
Replace string dataset_per_atom_total_energy_mean to -19318.260077821487
Atomic outputs are scaled by: [H, C, O: None], shifted by [H, C, O: -19318.260078].
Replace string dataset_forces_rms to 31.499698153708387
Initially outputs are globally scaled by: 31.499698153708387, total_energy are globally shifted by None.
Successfully built the network...
Number of weights: 41216
Number of trainable weights: 41216
! Starting training ...

validation
# Epoch batch         loss       loss_f        f_mae       f_rmse
Traceback (most recent call last):
  File "/home/myless/.mambaforge/envs/allegro-ddp/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 119, in main
    trainer.train()
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 841, in train
    self.epoch_step()
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 1006, in epoch_step
    self.metrics.gather()
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/metrics.py", line 270, in gather
    {
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/metrics.py", line 271, in <dictcomp>
    k1: {k2: rs.get_state() for k2, rs in v1.items()}
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/metrics.py", line 271, in <dictcomp>
    k1: {k2: rs.get_state() for k2, rs in v1.items()}
AttributeError: 'RunningStats' object has no attribute 'get_state'. Did you mean: '_state'?
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 706198 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 706199) of binary: /home/myless/.mambaforge/envs/allegro-ddp/bin/python3.10
Traceback (most recent call last):
  File "/home/myless/.mambaforge/envs/allegro-ddp/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/myless/.mambaforge/envs/allegro-ddp/bin/nequip-train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-15_20:16:16
  host      : gpu-rtx6000-04.psfc.mit.edu
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 706199)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

The text was updated successfully, but these errors were encountered:

mstapelberg · 2024-07-16T18:11:39Z

Closed as per #450 (comment)

mstapelberg added the bug Something isn't working label Jul 16, 2024

mstapelberg mentioned this issue Jul 16, 2024

Fixed gather method to work with distributed training in metrics.py and loss.py #450

Closed

10 tasks

mstapelberg closed this as completed Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [BUG] 'RunningStats' has no attribute 'get_state'. Did you mean: '_state' ? in both Metrics.gather() and LossStats.gather() #449

🐛 [BUG] 'RunningStats' has no attribute 'get_state'. Did you mean: '_state' ? in both Metrics.gather() and LossStats.gather() #449

mstapelberg commented Jul 16, 2024 •

edited

Loading

mstapelberg commented Jul 16, 2024

🐛 [BUG] 'RunningStats' has no attribute 'get_state'. Did you mean: '_state' ? in both Metrics.gather() and LossStats.gather() #449

🐛 [BUG] 'RunningStats' has no attribute 'get_state'. Did you mean: '_state' ? in both Metrics.gather() and LossStats.gather() #449

Comments

mstapelberg commented Jul 16, 2024 • edited Loading

mstapelberg commented Jul 16, 2024

mstapelberg commented Jul 16, 2024 •

edited

Loading