You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Hi there. It seems like both Metrics and LossStats are using a .get_state method that does not exist in RunningStats. My interpretation of this issue was that the gather method was trying to accumulate the metrics and losses after the optimizer step to rank0 for output/results.
Expected behavior
I would expect these methods to properly copy the loss stats and metrics after the optimizer is run. This is definitely easier said than done! I'm not too sure how to validate that the approach is working besides comparing it to the single gpu results for aspirin.
Environment (please complete the following information):
OS: Ubuntu
python version (python --version) : 3.10.4
python environment (commands are given for python interpreter):
nequip version (import nequip; nequip.__version__) : 0.61 - ddp branch
e3nn version (import e3nn; e3nn.__version__) : 0.51
pytorch version (import torch; torch.__version__) 1.13cu+117
(if relevant) GPU support with CUDA
cuda Version according to nvcc (nvcc --version) :
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
cuda version according to PyTorch (import torch; torch.version.cuda) : 11.7
Additional context
Here is the error message when I first ran the torchrun:
torchrun --nnodes 1 --nproc_per_node 2 `which nequip-train` configs/minimal_distributed.yaml -
-distributed
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/__init__.py:20: UserWarning: !! PyTorch version 1.13.0+cu117 found. Upstream issues in PyTorch versions 1.13.* and 2.* have been seen to cause unusual performance degredations on some CUDA systems that become worse over time; see https://github.com/mir-group/nequip/discussions/311. The best tested PyTorch version to use with CUDA devices is 1.11; while using other versions if you observe this problem, an unexpected lack of this problem, or other strange behavior, please post in the linked GitHub issue.
warnings.warn(
/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/__init__.py:20: UserWarning: !! PyTorch version 1.13.0+cu117 found. Upstream issues in PyTorch versions 1.13.* and 2.* have been seen to cause unusual performance degredations on some CUDA systems that become worse over time; see https://github.com/mir-group/nequip/discussions/311. The best tested PyTorch version to use with CUDA devices is 1.11; while using other versions if you observe this problem, an unexpected lack of this problem, or other strange behavior, please post in the linked GitHub issue.
warnings.warn(
Using `torch.distributed`; this is rank 0/2 (local rank: 0)
Using `torch.distributed`; this is rank 1/2 (local rank: 1)
Torch device: cuda
Processing dataset...
Loaded data: Batch(atomic_numbers=[21000, 1], batch=[21000], cell=[1000, 3, 3], edge_cell_shift=[220186, 3], edge_index=[2, 220186], forces=[21000, 3], pbc=[1000, 3], pos=[21000, 3], ptr=[1001], total_energy=[1000, 1])
processed data size: ~9.77 MB
Cached processed data to disk
Done!
Successfully loaded dataset `dataset` of type NpzDataset(1000)...
/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/jit/_check.py:181: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
warnings.warn("The TorchScript type system doesn't support "
/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/jit/_check.py:181: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
warnings.warn("The TorchScript type system doesn't support "
Replace string dataset_per_atom_total_energy_mean to -19318.260077821487
Atomic outputs are scaled by: [H, C, O: None], shifted by [H, C, O: -19318.260078].
Replace string dataset_forces_rms to 31.499698153708387
Initially outputs are globally scaled by: 31.499698153708387, total_energy are globally shifted by None.
Successfully built the network...
Number of weights: 41216
Number of trainable weights: 41216
! Starting training ...
validation
# Epoch batch loss loss_f f_mae f_rmse
Traceback (most recent call last):
File "/home/myless/.mambaforge/envs/allegro-ddp/bin/nequip-train", line 8, in <module>
sys.exit(main())
File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 119, in main
trainer.train()
File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 841, in train
self.epoch_step()
File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/trainer.py", line 1006, in epoch_step
self.metrics.gather()
File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/metrics.py", line 270, in gather
{
File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/metrics.py", line 271, in <dictcomp>
k1: {k2: rs.get_state() for k2, rs in v1.items()}
File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/nequip/train/metrics.py", line 271, in <dictcomp>
k1: {k2: rs.get_state() for k2, rs in v1.items()}
AttributeError: 'RunningStats' object has no attribute 'get_state'. Did you mean: '_state'?
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 706198 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 706199) of binary: /home/myless/.mambaforge/envs/allegro-ddp/bin/python3.10
Traceback (most recent call last):
File "/home/myless/.mambaforge/envs/allegro-ddp/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/myless/.mambaforge/envs/allegro-ddp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/myless/.mambaforge/envs/allegro-ddp/bin/nequip-train FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-15_20:16:16
host : gpu-rtx6000-04.psfc.mit.edu
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 706199)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
The text was updated successfully, but these errors were encountered:
Describe the bug
Hi there. It seems like both Metrics and LossStats are using a .get_state method that does not exist in RunningStats. My interpretation of this issue was that the gather method was trying to accumulate the metrics and losses after the optimizer step to rank0 for output/results.
In nequip/train/metrics.py:
nequip/nequip/train/metrics.py
Lines 266 to 285 in 61b525f
In nequip/train/loss.py :
nequip/nequip/train/loss.py
Lines 190 to 205 in 61b525f
To Reproduce
When using the ddp branch of nequip:
Expected behavior
I would expect these methods to properly copy the loss stats and metrics after the optimizer is run. This is definitely easier said than done! I'm not too sure how to validate that the approach is working besides comparing it to the single gpu results for aspirin.
Environment (please complete the following information):
python --version
) : 3.10.4import nequip; nequip.__version__
) : 0.61 - ddp branchimport e3nn; e3nn.__version__
) : 0.51import torch; torch.__version__
) 1.13cu+117nvcc --version
) :Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
import torch; torch.version.cuda
) : 11.7Additional context
Here is the error message when I first ran the torchrun:
The text was updated successfully, but these errors were encountered: