Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed GPU inference across multiple machines on the same network #3313

Open
2 of 4 tasks
MonkeeMan1 opened this issue Dec 25, 2024 · 2 comments
Open
2 of 4 tasks

Comments

@MonkeeMan1
Copy link

MonkeeMan1 commented Dec 25, 2024

System Info

- `Accelerate` version: 1.2.1
- Platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
- `accelerate` bash location: /mnt/c/Users/benn/OneDrive/Desktop/aii/venv/bin/accelerate
- Python version: 3.10.12
- Numpy version: 2.2.1
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 30.92 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:
        Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Hello.

I am currently attempting to get multi machine gpu inference working on 2 computers on my local network. The goal for this is to be able to load 50% of a model on my 4090 and another 50% of the model on my other 4090, my goal is to speed up inference and possibly allow me to load larger models.

from accelerate import Accelerator
from accelerate.utils import gather_object
from transformers import AutoModelForCausalLM, AutoTokenizer
from statistics import mean
import torch, time, json

accelerator = Accelerator()

# 10*10 Prompts. Source: https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-books
prompts_all=[
    "The King is dead. Long live the Queen.",
]

# load a base model and tokenizer
model_path = "Qwen/Qwen2.5-Coder-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_path,    
    device_map={"": accelerator.process_index},
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)   

# sync GPUs and start the timer
accelerator.wait_for_everyone()
start=time.time()
print("STARTING")

# divide the prompt list onto the available GPUs 
with accelerator.split_between_processes(prompts_all) as prompts:
    # store output of generations in dict
    results=dict(outputs=[], num_tokens=0)

    # have each GPU do inference, prompt by prompt
    for prompt in prompts:
        prompt_tokenized=tokenizer(prompt, return_tensors="pt").to("cuda")
        output_tokenized = model.generate(**prompt_tokenized, max_new_tokens=100)[0]

        # remove prompt from output 
        output_tokenized=output_tokenized[len(prompt_tokenized["input_ids"][0]):]

        # store outputs and number of tokens in result{}
        results["outputs"].append( tokenizer.decode(output_tokenized) )
        results["num_tokens"] += len(output_tokenized)
        print("1here")

    results=[ results ] # transform to list, otherwise gather_object() will not collect correctly
    print("2here")

# collect results from all the GPUs
results_gathered=gather_object(results)

if accelerator.is_main_process:
    timediff=time.time()-start
    num_tokens=sum([r["num_tokens"] for r in results_gathered ])

    print(f"tokens/sec: {num_tokens//timediff}, time {timediff}, total tokens {num_tokens}, total prompts {len(prompts_all)}")
master machine: 
accelerate launch --num_machines 2 --machine_rank 0 --main_process_ip 192.168.0.21 --main_process_port 29500 main.py

worker machine:

accelerate launch --num_machines 2 --machine_rank 1 --main_process_ip 192.168.0.21 --main_process_port 29500 main.py

Expected behavior

The expected behaviour is that the two machines will work together to improve the inference of the prompts, speeding them up. However, when starting the workers both seem to work completely independently of each other, and make no attempt at connecting to each other. There is no evidence that the num_machines and --main_process_ip args are doing anything.

@chiragjn
Copy link

chiragjn commented Dec 31, 2024

I stumbled here too, after reading some code this worked for me

accelerate launch \
--num_processes 2 \
--num_machines 2 \
--same_network \
--deepspeed_multinode_launcher standard \
--main_process_ip <my-master-ip> \
--main_process_port 23456 \
--machine_rank 0 \
--monitor_interval 30 \
--use_deepspeed \
train.py
...

All other multi node launchers except standard expect a hostfile with ip and slots of each node. When this file is absent deepspeed launcher chugs along assuming world size 1

@MonkeeMan1
Copy link
Author

I stumbled here too, after reading some code this worked for me

accelerate launch \
--num_processes 2 \
--num_machines 2 \
--same_network \
--deepspeed_multinode_launcher standard \
--main_process_ip <my-master-ip> \
--main_process_port 23456 \
--machine_rank 0 \
--monitor_interval 30 \
--use_deepspeed \
train.py
...

All other multi node launchers except standard expect a hostfile with ip and slots of each node. When this file is absent deepspeed launcher chugs along assuming world size 1

Amazing thanks. I'm trying to do this for inference, not training. But I assume it'll be somewhat similar.

However, I am having a problem with this setup, it says it cannot connect to the docker engine, but I am not using docker. Is there a setup guide for doing this? I skewered the web and found nothing about using accelerate multi node inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants