-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU support exists❓ [QUESTION] #210
Comments
Hi @JonathanSchmidt1 , Thanks for your interest in our code/method for your project! Sounds like an interesting application; please feel free to get in touch by email and let us know how it's going (we're always interested to hear about what people are working on using our methods). Re multi-GPU training: I have a draft branch PyTorch Lightning is a lot more difficult to integrate with. Getting a simple training loop going would be easy, but it would use a different configuration file, and integrating it with the full set of important Thanks! |
OK, I've merged the latest |
If you try this, please run the Horovod unit tests |
thank you very much. I will see how it goes. |
As usual, other things got in the way but I could finally test it. Now if I run with --horovod the training of the first epoch seems fine but there is a problem with the metrics. Epoch batch loss loss_f f_mae f_rmse |
Hi @JonathanSchmidt1 , Surprised that the tests run if the training won't... that sounds like a sign that the tests are broken 😄 Whoops yes I forgot to mention, I haven't merged the code I was writing to enable multi-GPU training in |
Thank you that fixed it for one gpu. I checked and "n" and "state" are on cuda:1 and "self._state", "self._n" are on cuda:0 . |
Aha... here's that "this is very untested" 😁 I think PyTorch / Horovod may be too smart for its own good and reloading transmitted tensors onto different CUDA devices when they are all available to the same host... I will look into this when I get a chance. |
That would be great, I will also try to find the time to look into it but I think I will need some time to understand the whole codebase. |
I thought reviving the issues might be more convenient than continuing by email.
|
Hi @JonathanSchmidt1 , Thanks!
Hm yes... this one will be a little nontrivial, since need to not only prevent
Weird... usually when we see something like this it means out-of-memory, or that the cluster's scheduler went crazy.
Not sure exactly what I'm looking at here, but yes, every GPU will get its own copy of the model as hinted by the name "Distributed Data Parallel" |
Out of memory errors could make sense and might be connected to the last issue as with the same batch size per GPU I did not produce OOM errors when running on a single gpu. The output basically says that each worker process uses up memory (most likely a copy of the model) on each gpu, however with DDP each worker is supposed to have a copy only on its own gpu. Then gradient updates are sent all-to-all. Basically I would expect the output to look like this from previous experience with ddp: |
I'd also be very interested in this feature. I have access to a system with four A100s on each node. Being able to use all four would make training go a lot faster. |
I spend some time debugging the issue and it seems that the metrics.gather and loss.gather calls cause the extra processes to spawn. If I remove these calls it's only one process per gpu and I can scale to 16 gpus (before it would run oom because of the extra processes). However continuing the training after stopping still somehow causes extra processes to spawn but just on the zeroth gpu. |
Hi all, Any updates on this feature? I also have some rather large datasets. |
Just a small update. As I had access to a different cluster with HOROVOD I tested the horovod branch again and with the fixed runstats version and a few small changes it ran without the issues of the ddp version. I also got descent speedups, despite using single gpu nodes. |
Did you also receive a message like this when using the horovod branch on 2 gpus:
|
The dataset processing only seems to happen in process for me, so I only get the message once. Anyway if that is causing problems for you it might work to process the dataset before and then start the training. |
Hi, I am also quite interested in the multi-GPU training capbility. I did some tests with the ddp branch using PyTorch 2.1.1 up to 16 GPUs (4 V100 per node) on a dataset with ~5k configurations. In all my tests I achieved the same results compared to a single GPU reference. I was wondering whether this feature is still under active development and if there is any plan to merge it with the develop branch ? |
Hi @sklenard, I am trying to utilizing the multi-GPU feature, but I have some trouble with it. in this way, ddp branch can be installed without any error.
it seems that there is something wrong with wandb. |
@beidouamg this looks like a network error unrelated to the |
@JonathanSchmidt1 I'm trying to run multi-GPU testing now using the
You mentioned that you got it working with this, the updated
which seems to be intermediate between what you posted before with I tried commenting out the |
@kavanase, I'm also involved in this issue, is there anyway you could share your run (or |
Hi @rschireman, sorry for the delay in replying!
This is running on NERSC Perlmutter which uses Slurm as the scheduler. I'm not sure which settings here are actually necessary for the job to run, as I'm still in the trial and error stage and plan to prune down to figure out which ones are actually needed, once I get some consistency in the jobs running. Some of these choices were motivated by what I read here:
My
This now seems to be mostly up and running, but as mentioned above it currently seems slower than expected and I'm not sure if the rank distribution shown in the Currently I'm seeing some runs failing apparently randomly, with these being some error outputs I'm getting:
or
Final notes for posterity:
|
Hi, I honestly forgot most of the issues with the ddp branch and would probably need a few hours of free time to figure out what was going on again, but as mentioned with the horovod branch most the issues went away. I got great scaling even on really outdated nodes (piz daint 1P100 per node). Is it an option for you to use the horovod branch? This would be my submission script in slurm for horovod:
|
Just a note that it is possible to do |
We are interested in training nequip potentials on large datasets of several million structures.
Consequently we wanted to know whether multi-gpu support exists or if someone knows whether the networks can be integrated into pytorch lightning.
best regards and thank you very much,
Jonathan
Ps: this might be related to #126
The text was updated successfully, but these errors were encountered: