-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using multi-node with LocalExecutor #130
Comments
Hey @LopezGG, thanks for the issue. Can you share a sample command you run with torchrun? We can add extra options based on that to support multi-node via Currently |
Thank you for the quick reply @hemildesai . Usually, with AML I use something like
https://pytorch.org/docs/stable/elastic/run.html $NODE_RANK, $MASTER_ADDR, and $MASTER_PORT are set automatically by AML or Slurm or can be set manually. Change in your script might look like ( I need to test it though) Changing num_nodes will feed into NeMo-Run/src/nemo_run/core/execution/base.py Lines 183 to 188 in b4e2258
and can be called from |
Sounds good, we will make num nodes configurable |
@hemildesai : Please Hold off on this. I had to make some changes in torchrun file as well. I'll write back when I have it working |
@hemildesai : I finally got this working. Here are the changes I made to torchrun file. Master_addr is the environmental variable set by aml or other co-ordinating system.
This works with tensor parallelism and context parallelism. For pipeline parallelism, nemo_run.core.runners.fdl_runner script passed to torchrun needs to specify how to split the model and co-ordinate. Would you know which parts of the code handle pipeline parallelism with slurm. I might be missing something. |
Thanks @LopezGG , I'll create a PR with the changes you suggested. |
Can you try out #143? |
That was fast. Thank you, Hemil. This week I am with family in IST. I'll try it Indian morning tomorrow |
I noticed LocalExecutor has a hard-coded value for nnodes.
NeMo-Run/src/nemo_run/core/execution/local.py
Lines 53 to 54 in b4e2258
Is there a reason multi-nodes are disabled ? It feeds into torch_run which seems to support multi-nodes
NeMo-Run/src/nemo_run/run/torchx_backend/components/torchrun.py
Lines 104 to 124 in b4e2258
Asking because I am using this with AML where I can usually get multi-node working with torchrun
The text was updated successfully, but these errors were encountered: