Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU support exists❓ [QUESTION] #210

Open
JonathanSchmidt1 opened this issue May 11, 2022 · 26 comments
Open

Multi-GPU support exists❓ [QUESTION] #210

JonathanSchmidt1 opened this issue May 11, 2022 · 26 comments
Assignees
Labels
enhancement New feature or request

Comments

@JonathanSchmidt1
Copy link

We are interested in training nequip potentials on large datasets of several million structures.
Consequently we wanted to know whether multi-gpu support exists or if someone knows whether the networks can be integrated into pytorch lightning.
best regards and thank you very much,
Jonathan
Ps: this might be related to #126

@JonathanSchmidt1 JonathanSchmidt1 added the question Further information is requested label May 11, 2022
@Linux-cpp-lisp
Copy link
Collaborator

Hi @JonathanSchmidt1 ,

Thanks for your interest in our code/method for your project! Sounds like an interesting application; please feel free to get in touch by email and let us know how it's going (we're always interested to hear about what people are working on using our methods).

Re multi-GPU training: I have a draft branch horovod using the Horovod distributed training framework. This is an in-progress draft, and has only been successfully tested so for for a few epochs on multiple CPUs. The branch is also a little out-of-sync with the latest version, but I will try to merge that back in in the coming days. If you are interested, you are more than welcome to use this branch, just understanding that it would as a sort of an "alpha tester." If you do use the branch, please carefully check any results you get for sanity and against those with Horovod disabled, and also please report any issues/suspicions here or by email. (One disclaimer is that the horovod branch is not a development priority for us this summer and I will likely be slow to respond.) PRs are also welcome, though I appreciate people reaching out to discuss first if the PR involves major development or restructuring.

PyTorch Lightning is a lot more difficult to integrate with. Getting a simple training loop going would be easy, but it would use a different configuration file, and integrating it with the full set of important nequip features, such as correctly calculated and averaged metrics, careful data normalization, EMA, correct global numerical precision and JIT settings, etc., etc. would be difficult and involve a lot of subtle stumbling blocks we have already dealt with in the nequip code. For this reason I would really recommend against this path unless you want to deal carefully with all of this. (If you do, of course, it would be great if you could share that work!)

Thanks!

@Linux-cpp-lisp
Copy link
Collaborator

OK, I've merged the latest develop -> horovod, see #211.

@Linux-cpp-lisp
Copy link
Collaborator

If you try this, please run the Horovod unit tests tests/integration/test_train_horovod.py and confirm that they (1) are not skipped (i.e. horovod is installed) and (2) pass.

@JonathanSchmidt1
Copy link
Author

thank you very much. I will see how it goes.

@JonathanSchmidt1
Copy link
Author

As usual, other things got in the way but I could finally test it.
Running tests/integration/test_train_horovod.py worked.
I also confirmed that the normal training on gpu worked (nequip-train configs/minimal.yaml).

Now if I run with --horovod the training of the first epoch seems fine but there is a problem with the metrics.
I checked the torch_runstats lib and could not find any get_state, are you maybe using a modified version?

Epoch batch loss loss_f f_mae f_rmse
0 1 1.06 1.06 24.3 32.5
Traceback (most recent call last):
File "/home/test_user/.conda/envs/nequip2/bin/nequip-train", line 33, in
sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
File "/raid/scratch/testuser/nequip/nequip/scripts/train.py", line 87, in main
trainer.train()
File "/raid/scratch/testuser/nequip/nequip/train/trainer.py", line 827, in train
self.epoch_step()
File "/raid/scratch/testuser/nequip/nequip/train/trainer.py", line 991, in epoch_step
self.metrics.gather()
File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 274, in gather
{
File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 275, in
k1: {k2: rs.get_state() for k2, rs in v1.items()}
File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 275, in
k1: {k2: rs.get_state() for k2, rs in v1.items()}
AttributeError: 'RunningStats' object has no attribute 'get_state'

@Linux-cpp-lisp
Copy link
Collaborator

Hi @JonathanSchmidt1 ,

Surprised that the tests run if the training won't... that sounds like a sign that the tests are broken 😄

Whoops yes I forgot to mention, I haven't merged the code I was writing to enable multi-GPU training in torch_runstats yet; you can find it on the branch https://github.com/mir-group/pytorch_runstats/tree/state-reduce.

@JonathanSchmidt1
Copy link
Author

Thank you that fixed it for one gpu.
horovodrun -np 1 nequip-train configs/example.yaml --horovod
works now.
If I use two gpus we get an error message as some tensors during the metric evaluation are on the wrong devices.
File "/raid/scratch/testuser/nequip/nequip/train/trainer.py", line 993, in epoch_step
[1,0]: self.metrics.gather()
[1,0]: File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 288, in gather
[1,0]: self.running_stats[k1][k2].accumulate_state(rs_state)
[1,0]: File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch_runstats/_runstats.py", line 331, in accumulate_state
[1,0]: self._state += n * (state - self._state) / (self._n + n)
[1,0]:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

I checked and "n" and "state" are on cuda:1 and "self._state", "self._n" are on cuda:0 .
Not sure how it's supposed to be. Are they all expected to be on cuda:0 for this step or all on their own gpu?

@Linux-cpp-lisp
Copy link
Collaborator

Aha... here's that "this is very untested" 😁 I think PyTorch / Horovod may be too smart for its own good and reloading transmitted tensors onto different CUDA devices when they are all available to the same host... I will look into this when I get a chance.

@JonathanSchmidt1
Copy link
Author

That would be great, I will also try to find the time to look into it but I think I will need some time to understand the whole codebase.

@Linux-cpp-lisp Linux-cpp-lisp added enhancement New feature or request and removed question Further information is requested labels Feb 20, 2023
@JonathanSchmidt1
Copy link
Author

JonathanSchmidt1 commented Mar 22, 2023

I thought reviving the issues might be more convenient than continuing by email.
So some quick notes about some issues I noticed when testing the ddp branch.

  • Every process seems to get its own wandb log. It's not possible to restart because wandb finds an existing run in each process and then crashes.

  • Sometimes random crash after a few 100 epochs have no idea yet why. Was also not reproducible.
    -WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215968 closing signal SIGTERM
    WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215970 closing signal SIGTERM
    WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215971 closing signal SIGTERM
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -15) local_rank: 1 (pid: 215969) of binary: /home/test_user/.conda/envs/nequip2/bin/python
    Traceback (most recent call last):
    File "/home/test_user/.conda/envs/nequip2/bin/torchrun", line 8, in
    sys.exit(main())
    File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
    return f(*args, **kwargs)
    File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
    File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
    File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
    /home/test_user/.conda/envs/nequip2/bin/nequip-train FAILED
    Failures:
    <NO_OTHER_FAILURES>
    Root Cause (first observed failure):
    [0]:
    time : 2023-03-21_21:38:56
    host : dgx2
    rank : 1 (local_rank: 1)
    exitcode : -15 (pid: 215969)
    error_file: <N/A>
    traceback : Signal 15 (SIGTERM) received by PID 215969
    /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '
    /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '
    /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '
    /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '

  • At the moment each process seems to load the network on each gpu e.g. running with 8 gpus I get this output from nvidia-smi:

    0 N/A N/A 804401 C ...a/envs/nequip2/bin/python 18145MiB |
    | 0 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB |
    | 0 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB |
    | 0 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB |
    | 0 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB |
    | 0 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB |
    | 0 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB |
    | 0 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB |
    | 1 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB |
    | 1 N/A N/A 804402 C ...a/envs/nequip2/bin/python 19101MiB |
    | 1 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB |
    | 1 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB |
    | 1 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB |
    | 1 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB |
    | 1 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB |
    | 1 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB |
    | 2 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB |
    | 2 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB |
    | 2 N/A N/A 804403 C ...a/envs/nequip2/bin/python 17937MiB |
    | 2 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB |
    | 2 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB |
    | 2 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB |
    | 2 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB |
    | 2 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB |
    | 3 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB |
    | 3 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB |
    | 3 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB |
    ......

@Linux-cpp-lisp
Copy link
Collaborator

Hi @JonathanSchmidt1 ,

Thanks!

Every process seems to get its own wandb log. It's not possible to restart because wandb finds an existing run in each process and then crashes.

Hm yes... this one will be a little nontrivial, since need to not only prevent wandb init on other ranks but probably also sync the wandb updated config to the nonzero ranks.

Sometimes random crash after a few 100 epochs have no idea yet why. Was also not reproducible.

Weird... usually when we see something like this it means out-of-memory, or that the cluster's scheduler went crazy.

At the moment each process seems to load the network on each gpu e.g. running with 8 gpus I get this output from nvidia-smi:

Not sure exactly what I'm looking at here, but yes, every GPU will get its own copy of the model as hinted by the name "Distributed Data Parallel"

@JonathanSchmidt1
Copy link
Author

Out of memory errors could make sense and might be connected to the last issue as with the same batch size per GPU I did not produce OOM errors when running on a single gpu.

The output basically says that each worker process uses up memory (most likely a copy of the model) on each gpu, however with DDP each worker is supposed to have a copy only on its own gpu. Then gradient updates are sent all-to-all. Basically I would expect the output to look like this from previous experience with ddp:
0 N/A N/A 804401 C ...a/envs/nequip2/bin/python 18145MiB |
1 N/A N/A 804402 C ...a/envs/nequip2/bin/python 19101MiB |
2 N/A N/A 804403 C ...a/envs/nequip2/bin/python 17937MiB |

@peastman
Copy link
Contributor

I'd also be very interested in this feature. I have access to a system with four A100s on each node. Being able to use all four would make training go a lot faster.

@JonathanSchmidt1
Copy link
Author

JonathanSchmidt1 commented Apr 5, 2023

I spend some time debugging the issue and it seems that the metrics.gather and loss.gather calls cause the extra processes to spawn. If I remove these calls it's only one process per gpu and I can scale to 16 gpus (before it would run oom because of the extra processes). However continuing the training after stopping still somehow causes extra processes to spawn but just on the zeroth gpu.

@rschireman
Copy link

Hi all,

Any updates on this feature? I also have some rather large datasets.

@JonathanSchmidt1
Copy link
Author

Just a small update. As I had access to a different cluster with HOROVOD I tested the horovod branch again and with the fixed runstats version and a few small changes it ran without the issues of the ddp version. I also got descent speedups, despite using single gpu nodes.
N_nodes (1 P100 per node) [1, 2, 4, 8, 16, 32]
[1.0, 1.6286277105250644, 3.3867286549788127, 6.642094103901569, 9.572247883815873, 17.38443770824977]
ps: I did not confirm whether the loss is the same for different node numbers yet for HOROVOD

@rschireman
Copy link

Hi @JonathanSchmidt1,

Did you also receive a message like this when using the horovod branch on 2 gpus:

[1,0]<stderr>:Processing dataset...
[1,1]<stderr>:Processing dataset...

@JonathanSchmidt1
Copy link
Author

JonathanSchmidt1 commented Oct 27, 2023

The dataset processing only seems to happen in process for me, so I only get the message once. Anyway if that is causing problems for you it might work to process the dataset before and then start the training.
Ps: I have tested some of the models now and the loss reported during training seems correct.

@sklenard
Copy link

sklenard commented Feb 9, 2024

Hi,

I am also quite interested in the multi-GPU training capbility. I did some tests with the ddp branch using PyTorch 2.1.1 up to 16 GPUs (4 V100 per node) on a dataset with ~5k configurations. In all my tests I achieved the same results compared to a single GPU reference. I was wondering whether this feature is still under active development and if there is any plan to merge it with the develop branch ?

@beidouamg
Copy link

beidouamg commented Apr 25, 2024

Hi @sklenard,

I am trying to utilizing the multi-GPU feature, but I have some trouble with it.
I install the ddp branch with pytorch 2.1.1 by changing
"torch>=1.8,<=1.12,!=1.9.0", # torch.fx added in 1.8 to "torch>=1.8,<=2.1.1,!=1.9.0", # torch.fx added in 1.8
in ''setup.py'' nequip folder.

in this way, ddp branch can be installed without any error.
However, when I try to run nequip-train, i get this error:

[W init.cpp:842] Warning: Use _jit_set_fusion_strategy, bailout depth is deprecated. Setting to (STATIC, 2) (function operator())
Traceback (most recent call last):
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 76, in main
    trainer = fresh_start(config)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 189, in fresh_start
    config = init_n_update(config)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/utils/wandb.py", line 17, in init_n_update
    wandb.init(
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1200, in init
    raise e
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1177, in init
    wi.setup(kwargs)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 190, in setup
    self._wl = wandb_setup.setup(settings=setup_settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 327, in setup
    ret = _setup(settings=settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 320, in _setup
    wl = _WandbSetup(settings=settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 303, in __init__
    _WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 114, in __init__
    self._setup()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 250, in _setup
    self._setup_manager()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 277, in _setup_manager
    self._manager = wandb_manager._Manager(settings=self._settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 139, in __init__
    self._service.start()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 250, in start
    self._launch_server()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 244, in _launch_server
    _sentry.reraise(e)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/analytics/sentry.py", line 154, in reraise
    raise exc.with_traceback(sys.exc_info()[2])
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 242, in _launch_server
    self._wait_for_ports(fname, proc=internal_proc)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 132, in _wait_for_ports
    raise ServiceStartTimeoutError(
wandb.sdk.service.service.ServiceStartTimeoutError: Timed out waiting for wandb service to start after 30.0 seconds. Try increasing the timeout with the `_service_wait` setting.

it seems that there is something wrong with wandb.
I wonder how you install this branch, Maybe there is some difference between the version you installed and I installed since more than 2 months had passed. It would be great if you could recall and tell how you installed or share the version you installed.
Thank you very much!

@Linux-cpp-lisp
Copy link
Collaborator

@beidouamg this looks like a network error unrelated to the ddp branch, but maybe there is a race condition. Have you tried to run without wandb enabled?

@kavanase
Copy link
Contributor

@JonathanSchmidt1 I'm trying to run multi-GPU testing now using the ddp branch (based on the horovod branch) as this is now under active development. For this:

I spend some time debugging the issue and it seems that the metrics.gather and loss.gather calls cause the extra processes to spawn. If I remove these calls it's only one process per gpu and I can scale to 16 gpus (before it would run oom because of the extra processes). However continuing the training after stopping still somehow causes extra processes to spawn but just on the zeroth GPU.
So if you comment out these calls, is it still working as expected? Or were there other changes you made?

You mentioned that you got it working with this, the updated pytorch_runstats and some other small changes. I'm currently trying to do this, and seem to have the multi-GPU training up and running with the ddp branch, but the training seems to be going quite slow (i.e. with 2 GPUs and batch_size: 4 it's 50% slower than 1 GPU with batch_size: 5 – had to change batch size to make it divisible by num GPUs). If I print nvidia-smi on the compute node I get:

(base) Perlmutter: sean > nvidia-smi
Wed Jul 10 14:15:52 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:03:00.0 Off |                    0 |
| N/A   34C    P0    82W / 400W |   4063MiB / 40960MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   34C    P0    95W / 400W |   2735MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:82:00.0 Off |                    0 |
| N/A   35C    P0   112W / 400W |   2771MiB / 40960MiB |     70%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:C1:00.0 Off |                    0 |
| N/A   35C    P0    93W / 400W |   2673MiB / 40960MiB |     78%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    736896      C   ...lti_gpu_nequip/bin/python     2740MiB |
|    0   N/A  N/A    736897      C   ...lti_gpu_nequip/bin/python      440MiB |
|    0   N/A  N/A    736898      C   ...lti_gpu_nequip/bin/python      440MiB |
|    0   N/A  N/A    736899      C   ...lti_gpu_nequip/bin/python      440MiB |
|    1   N/A  N/A    736897      C   ...lti_gpu_nequip/bin/python     2732MiB |
|    2   N/A  N/A    736898      C   ...lti_gpu_nequip/bin/python     2768MiB |
|    3   N/A  N/A    736899      C   ...lti_gpu_nequip/bin/python     2670MiB |
+-----------------------------------------------------------------------------+

which seems to be intermediate between what you posted before with horovod (X processes each on each of X GPUs) and what you said should happen (1 process each), here I get X processes on GPU 0, 1 process on each other GPU.

I tried commenting out the metrics.gather() and loss.gather() methods as you suggested above, but this doesn't seem to have made any difference to the run times or the nvidia-smi output 🤔

@rschireman
Copy link

@kavanase, I'm also involved in this issue, is there anyway you could share your run (or nequip-train) command to get the ddp branch to actually work on multiple GPUs?

@kavanase kavanase self-assigned this Jul 12, 2024
@kavanase
Copy link
Contributor

kavanase commented Jul 16, 2024

Hi @rschireman, sorry for the delay in replying!
This is the current job script I'm using:

#!/bin/bash
#SBATCH -J Nequip_training_
#SBATCH -C gpu
#SBATCH -q shared
#SBATCH -N 1                                # nodes
#SBATCH --ntasks-per-node=2   # one per GPU
#SBATCH -c 32
#SBATCH --gres=gpu:2              # GPUs per node
#SBATCH -t 0-02:40                          # runtime in D-HH:MM, minimum of 10 minutes
#SBATCH --output=stdout_%j.txt
#SBATCH --error=stderr_%j.txt

master_port=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export MASTER_PORT=$master_port
# - Master node address
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
world_size=$(($SLURM_NTASKS_PER_NODE * $SLURM_NNODES))
export nproc_per_node=$SLURM_NTASKS_PER_NODE

echo "MASTER_ADDR="$master_addr
echo "MASTER_PORT="$master_port
echo "WORLD_SIZE="$world_size
echo "NNODES="$SLURM_NNODES
echo "NODE LIST="$SLURM_JOB_NODELIST
echo "NPROC_PER_NODE="$nproc_per_node
echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"
echo "PYTHON VERSION=$(python --version)"
ngpu=2  # can later set this to an environment variable

source ~/.bashrc
export LANG=en_US.utf8
export LC_ALL=en_US.utf8
source activate multi_gpu_nequip

source export_DDP_vars.sh
export PYTORCH_VERSION_WARNING=0
torchrun --nnodes 1 --nproc_per_node $ngpu `which nequip-train` nequip*.yaml --distributed

This is running on NERSC Perlmutter which uses Slurm as the scheduler. I'm not sure which settings here are actually necessary for the job to run, as I'm still in the trial and error stage and plan to prune down to figure out which ones are actually needed, once I get some consistency in the jobs running. Some of these choices were motivated by what I read here:

My export_DDP_vars.sh is: (slightly modified from the nersc-dl-wandb one)

export RANK=$SLURM_PROCID
export WORLD_RANK=$SLURM_PROCID
export LOCAL_RANK=$SLURM_LOCALID
export WORLD_SIZE=$SLURM_NTASKS
#export MASTER_PORT=29500 # default from torch launcher
export WANDB_START_METHOD="thread"

This now seems to be mostly up and running, but as mentioned above it currently seems slower than expected and I'm not sure if the rank distribution shown in the nvidia-smi output is as it should be... Still testing this out.
As noted in #450, the state-reduce branch of pytorch_runstats also currently needs to be used with the DDP branch. In the above links, it is also recommended to use srun rather than torchrun, but this was causing issues for me at first, but I will try switching back to srun to see if I can get it working properly.

Currently I'm seeing some runs failing apparently randomly, with these being some error outputs I'm getting:

2024-07-11 03:48:21,160] torch.distributed.run: [WARNING]
[2024-07-11 03:48:21,160] torch.distributed.run: [WARNING] *****************************************
[2024-07-11 03:48:21,160] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, pl
ease further tune the variable for optimal performance in your application as needed.
[2024-07-11 03:48:21,160] torch.distributed.run: [WARNING] *****************************************
Using `torch.distributed`; this is rank 0/2 (local rank: 0)
Using `torch.distributed`; this is rank 1/2 (local rank: 1)
Torch device: cuda
Number of weights: 809016
Number of trainable weights: 809016
Traceback (most recent call last):
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/nequip-train", line 33, in <module>
    sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
             ^^^^^^^^^^^^^Traceback (most recent call last):
^^^^  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/nequip-train", line 33, in <module>
^^^^^^^^^^^    ^^sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
^^ ^^  ^^  ^^  ^^  ^^  ^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/scripts/train.py", line 113, in main
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^trainer = restart(config)
^^ ^^  ^^  ^
    File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/scripts/train.py", line 113, in main
       ^^^^    ^^trainer = restart(config)
^^ ^^  ^^  ^^  ^
    File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/scripts/train.py", line 372, in restart
     ^^^^^^^    ^^trainer = Trainer.from_dict(dictionary)
^^ ^^  ^^
    File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/scripts/train.py", line 372, in restart
       ^^    ^^trainer = Trainer.from_dict(dictionary)
^^ ^^  ^^  ^^  ^^  ^^  ^^  ^^ ^^^^^^^^^^^^^^^^^^
^^^^  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 697, in from_dict
^^^^^^^^^^^^^^^^
  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 697, in from_dict
    trainer = cls(model=model, **dictionary)
        trainer = cls(model=model, **dictionary)
                   ^^  ^^  ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 412, in __init__
^^^
  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 412, in __init__
    self.init()
    self.init()  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 785, in init

  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 785, in init
    self.init_objects()
  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 431, in init_objects
    self.init_objects()
  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 431, in init_objects
    self.model = torch.nn.parallel.DistributedDataParallel(self.model)
           self.model = torch.nn.parallel.DistributedDataParallel(self.model)
                   ^^  ^^  ^^  ^^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
_verify_param_shape_across_processes(self.process_group, parameters)
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
return dist._verify_params_across_processes(process_group, tensors, logger)
                    ^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^torch.distributed
.DistBackendErrortorch.distributed: NCCL error in: /opt/conda/conda-bld/pytorch_1708025845868/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found.DistBackendError
: NCCL error in: /opt/conda/conda-bld/pytorch_1708025845868/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found
[2024-07-11 03:48:46,175] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 947123) of binary: /global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/python
Traceback (most recent call last):
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.1', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

or

[2024-07-10 14:49:44,231] torch.distributed.run: [WARNING] *****************************************
[2024-07-10 14:49:44,231] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-10 14:49:44,231] torch.distributed.run: [WARNING] *****************************************
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.1', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    result = agent.run()
             ^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
    result = self._invoke_run(role)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 862, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 699, in _initialize_workers
    self._rendezvous(worker_group)
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 542, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
  warnings.warn(pytorch_version_warning)
/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/__init__.py:22: UserWarning: !! PyTorch version 2.2.1 found. Upstream issues in PyTorch versions 1.13.* and 2.* have been seen to cause unusual performance degredations on some CUDA systems that become worse over time; see https://github.com/mir-group/nequip/discussions/311. The best tested PyTorch version to use with CUDA devices is 1.11; while using other versions if you observe this problem, an unexpected lack of this problem, or other strange behavior, please post in the linked GitHub issue.
  warnings.warn(pytorch_version_warning)
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:18164 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:18164 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:18164 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:18164 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:18164 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:18164 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/nequip-train", line 33, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/nequip-train", line 33, in <module>
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/nequip-train", line 33, in <module>
    sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/scripts/train.py", line 79, in main
    sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/scripts/train.py", line 79, in main
    sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/scripts/train.py", line 79, in main
    _init_distributed(config.distributed)
  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/utils/_global_options.py", line 128, in _init_distributed
    _init_distributed(config.distributed)
  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/utils/_global_options.py", line 128, in _init_distributed
    _init_distributed(config.distributed)
  File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/utils/_global_options.py", line 128, in _init_distributed
    dist.init_process_group(backend=distributed, timeout=timedelta(hours=2))  # TODO: Should dynamically set this, just for processing part?
    dist.init_process_group(backend=distributed, timeout=timedelta(hours=2))  # TODO: Should dynamically set this, just for processing part?
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
    dist.init_process_group(backend=distributed, timeout=timedelta(hours=2))  # TODO: Should dynamically set this, just for processing part?
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
    return TCPStore(
           ^^^^^^^^^
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:18164 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:18164 (errno: 98 - Address already in use).
    return TCPStore(
           ^^^^^^^^^
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:18164 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:18164 (errno: 98 - Address already in use).
    return TCPStore(
           ^^^^^^^^^
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:18164 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:18164 (errno: 98 - Address already in use).
srun: error: nid001089: tasks 0,3: Exited with exit code 1
srun: Terminating StepId=27928164.0
slurmstepd: error: *** STEP 27928164.0 ON nid001089 CANCELLED AT 2024-07-11T00:33:51 ***
srun: error: nid001089: task 1: Exited with exit code 1
srun: error: nid001089: task 2: Terminated
srun: Force Terminated StepId=27928164.0

Final notes for posterity:

  • Was getting some failures when I had empty processed_data_dir_...s present (from previous crashed runs); will try fix this in the code in future.
  • As mentioned in the comment above, I commented out the gather() method calls in nequip (ddp) as suggested by @JonathanSchmidt1, though I'm not sure if this breaks something else? If @Linux-cpp-lisp has a chance at some point he might be able to comment on this

@JonathanSchmidt1
Copy link
Author

JonathanSchmidt1 commented Jul 16, 2024

Hi, I honestly forgot most of the issues with the ddp branch and would probably need a few hours of free time to figure out what was going on again, but as mentioned with the horovod branch most the issues went away. I got great scaling even on really outdated nodes (piz daint 1P100 per node). Is it an option for you to use the horovod branch?

This would be my submission script in slurm for horovod:

#!/bin/bash -l
#SBATCH --job-name=test_pt_hvd
#SBATCH --time=02:00:00
##SBATCH --nodes=$1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
#SBATCH --constraint=gpu
#SBATCH --account=s1128
#SBATCH --partition=normal
#SBATCH --output=test_pt_hvd_%j.out

module load daint-gpu PyTorch
cd $SLURM_SUBMIT_DIR
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=ipogif0
export NCCL_IB_CUDA_SUPPORT=1
srun nequip-train ETO_$SLURM_NNODES.yaml

@cw-tan
Copy link
Collaborator

cw-tan commented Nov 22, 2024

Just a note that it is possible to do ddp on the develop branch. The implementation is based on PyTorch Lightning and torchmetrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants