Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I train on multiple machines? #45

Open
ElizavetaSedova opened this issue May 12, 2023 · 2 comments
Open

Can I train on multiple machines? #45

ElizavetaSedova opened this issue May 12, 2023 · 2 comments
Labels
question Further information is requested

Comments

@ElizavetaSedova
Copy link

❓ Questions

I am new to Dora. I see that I can run distributed training. But is it possible to deploy learning on multiple machines? I don’t see the possibility of adding master_addr, master_port, rank. Maybe you haven’t done it yet. Perhaps I did not notice it. But it would be very cool to have such a possibility! I would be very grateful for help and tips in this matter!

@ElizavetaSedova ElizavetaSedova added the question Further information is requested label May 12, 2023
@adefossez
Copy link
Contributor

It is possible with a Slurm cluster, but not manually though. It should be possible to add but that would take a few changes in the code.

@adefossez
Copy link
Contributor

I've updated the instructions for having compatiblity with torchrun with multinode: https://github.com/facebookresearch/dora/blob/main/README.md#multi-node-training-without-slurm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants