Using multi-node with LocalExecutor #130

LopezGG · 2024-12-23T18:22:27Z

I noticed LocalExecutor has a hard-coded value for nnodes.

NeMo-Run/src/nemo_run/core/execution/local.py

Lines 53 to 54 in b4e2258

    
           def nnodes(self) -> int: 
        
               return 1

Is there a reason multi-nodes are disabled ? It feeds into torch_run which seems to support multi-nodes

NeMo-Run/src/nemo_run/run/torchx_backend/components/torchrun.py

Lines 104 to 124 in b4e2258

    
           if max_nnodes == 1: 
        
               # using port 0 makes elastic chose a free random port which is ok 
        
               # for single-node jobs since all workers run under a single agent 
        
               # When nnodes is 0 and max_nnodes is 1, it's still a single node job 
        
               # but pending until the resources become available 
        
               rdzv_endpoint = "localhost:0" 
        
               num_nodes = nnodes_rep 
        
               nproc_per_node = str(nproc_per_node) 
        
               node_rank = "0" 
        
           else: 
        
               # for multi-node, rely on the rank0_env environment variable set by 
        
               # the schedulers (see scheduler implementation for the actual env var this maps to) 
        
               # some schedulers (e.g. aws batch) make the rank0's ip-addr available on all BUT on rank0 
        
               # so default to "localhost" if the env var is not set or is empty 
        
               # rdzv_endpoint bash resolves to something to the effect of 
        
               # ${TORCHX_RANK0_HOST:=localhost}:29500 
        
               # use $$ in the prefix to escape the '$' literal (rather than a string Template substitution argument) 
        
               rdzv_endpoint = torchx_dist._noquote(f"$${ExecutorMacros.HEAD_NODE_IP_VAR}:{rdzv_port}") 
        
               num_nodes = torchx_dist._noquote(f"$${ExecutorMacros.NUM_NODES_VAR}") 
        
               nproc_per_node = str(nproc_per_node) 
        
               node_rank = torchx_dist._noquote(f"$${ExecutorMacros.NODE_RANK_VAR}")

Asking because I am using this with AML where I can usually get multi-node working with torchrun

hemildesai · 2024-12-23T19:19:14Z

Hey @LopezGG, thanks for the issue. Can you share a sample command you run with torchrun? We can add extra options based on that to support multi-node via LocalExecutor.

Currently LocalExecutor assumes you are on your local workstation which would just be a single node.

LopezGG · 2024-12-23T19:30:17Z

Thank you for the quick reply @hemildesai . Usually, with AML I use something like

 torchrun  --nproc_per_node=${{inputs.nproc_per_node}} --nnodes=${{inputs.nnodes}}
  --node_rank=$NODE_RANK
  --master_addr=$MASTER_ADDR
  --master_port=$MASTER_PORT  train.py

https://pytorch.org/docs/stable/elastic/run.html

$NODE_RANK, $MASTER_ADDR, and $MASTER_PORT are set automatically by AML or Slurm or can be set manually. Change in your script might look like ( I need to test it though)

Changing num_nodes will feed into

NeMo-Run/src/nemo_run/core/execution/base.py

Lines 183 to 188 in b4e2258

    
               def nnodes(self) -> int: 
        
                   """ 
        
                   Helper function called by torchrun component 
        
                   to determine --nnodes. 
        
                   """ 
        
                   raise NotImplementedError

and can be called from

hemildesai · 2024-12-25T03:18:20Z

Sounds good, we will make num nodes configurable

LopezGG · 2024-12-25T03:42:50Z

@hemildesai : Please Hold off on this. I had to make some changes in torchrun file as well. I'll write back when I have it working

LopezGG · 2025-01-30T09:10:46Z

@hemildesai : I finally got this working. Here are the changes I made to torchrun file. Master_addr is the environmental variable set by aml or other co-ordinating system.

else:
       master_addr = os.getenv("MASTER_ADDR" )
       master_port = os.getenv("MASTER_PORT")
        rdzv_endpoint = torchx_dist._noquote(master_addr + ":" + master_port)
        rdzv_port = master_port
        num_nodes = torchx_dist._noquote(nnodes_rep)
        node_rank = torchx_dist._noquote(os.getenv("NODE_RANK"))

    if env is None:
        env = {}

    env.setdefault("LOGLEVEL", os.getenv("LOGLEVEL", "INFO"))
    if debug:
        env.update(_TORCH_DEBUG_FLAGS)

    cmd = [
        "--rdzv-backend",
        rdzv_backend,
        "--rdzv-endpoint",
        rdzv_endpoint,
        "--rdzv-id",
        aisc_job_name, # A random int would work here
        "--nnodes",
        num_nodes,
        "--nproc-per-node",
        str(nproc_per_node),
        "--node-rank",
        node_rank,
        "--tee",
        "3", # tee: tees the specified std stream(s) to console + file. 0: none, 1: stdout, 2: stderr, 3: both
        "--role",
        "",
    ]

This works with tensor parallelism and context parallelism. For pipeline parallelism, nemo_run.core.runners.fdl_runner script passed to torchrun needs to specify how to split the model and co-ordinate. Would you know which parts of the code handle pipeline parallelism with slurm. I might be missing something.

hemildesai · 2025-01-30T11:09:58Z

Thanks @LopezGG , I'll create a PR with the changes you suggested.

hemildesai · 2025-01-30T11:37:33Z

Can you try out #143?

LopezGG · 2025-01-30T11:49:03Z

That was fast. Thank you, Hemil. This week I am with family in IST. I'll try it Indian morning tomorrow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using multi-node with LocalExecutor #130

Using multi-node with LocalExecutor #130

LopezGG commented Dec 23, 2024

hemildesai commented Dec 23, 2024

LopezGG commented Dec 23, 2024 •

edited

Loading

hemildesai commented Dec 25, 2024

LopezGG commented Dec 25, 2024 •

edited

Loading

LopezGG commented Jan 30, 2025

hemildesai commented Jan 30, 2025

hemildesai commented Jan 30, 2025

LopezGG commented Jan 30, 2025

Using multi-node with LocalExecutor #130

Using multi-node with LocalExecutor #130

Comments

LopezGG commented Dec 23, 2024

hemildesai commented Dec 23, 2024

LopezGG commented Dec 23, 2024 • edited Loading

hemildesai commented Dec 25, 2024

LopezGG commented Dec 25, 2024 • edited Loading

LopezGG commented Jan 30, 2025

hemildesai commented Jan 30, 2025

hemildesai commented Jan 30, 2025

LopezGG commented Jan 30, 2025

LopezGG commented Dec 23, 2024 •

edited

Loading

LopezGG commented Dec 25, 2024 •

edited

Loading