You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
Hello.
I am currently attempting to get multi machine gpu inference working on 2 computers on my local network. The goal for this is to be able to load 50% of a model on my 4090 and another 50% of the model on my other 4090, my goal is to speed up inference and possibly allow me to load larger models.
fromaccelerateimportAcceleratorfromaccelerate.utilsimportgather_objectfromtransformersimportAutoModelForCausalLM, AutoTokenizerfromstatisticsimportmeanimporttorch, time, jsonaccelerator=Accelerator()
# 10*10 Prompts. Source: https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-booksprompts_all=[
"The King is dead. Long live the Queen.",
]
# load a base model and tokenizermodel_path="Qwen/Qwen2.5-Coder-3B-Instruct"model=AutoModelForCausalLM.from_pretrained(
model_path,
device_map={"": accelerator.process_index},
torch_dtype=torch.bfloat16,
)
tokenizer=AutoTokenizer.from_pretrained(model_path)
# sync GPUs and start the timeraccelerator.wait_for_everyone()
start=time.time()
print("STARTING")
# divide the prompt list onto the available GPUs withaccelerator.split_between_processes(prompts_all) asprompts:
# store output of generations in dictresults=dict(outputs=[], num_tokens=0)
# have each GPU do inference, prompt by promptforpromptinprompts:
prompt_tokenized=tokenizer(prompt, return_tensors="pt").to("cuda")
output_tokenized=model.generate(**prompt_tokenized, max_new_tokens=100)[0]
# remove prompt from output output_tokenized=output_tokenized[len(prompt_tokenized["input_ids"][0]):]
# store outputs and number of tokens in result{}results["outputs"].append( tokenizer.decode(output_tokenized) )
results["num_tokens"] +=len(output_tokenized)
print("1here")
results=[ results ] # transform to list, otherwise gather_object() will not collect correctlyprint("2here")
# collect results from all the GPUsresults_gathered=gather_object(results)
ifaccelerator.is_main_process:
timediff=time.time()-startnum_tokens=sum([r["num_tokens"] forrinresults_gathered ])
print(f"tokens/sec: {num_tokens//timediff}, time {timediff}, total tokens {num_tokens}, total prompts {len(prompts_all)}")
The expected behaviour is that the two machines will work together to improve the inference of the prompts, speeding them up. However, when starting the workers both seem to work completely independently of each other, and make no attempt at connecting to each other. There is no evidence that the num_machines and --main_process_ip args are doing anything.
The text was updated successfully, but these errors were encountered:
All other multi node launchers except standard expect a hostfile with ip and slots of each node. When this file is absent deepspeed launcher chugs along assuming world size 1
All other multi node launchers except standard expect a hostfile with ip and slots of each node. When this file is absent deepspeed launcher chugs along assuming world size 1
Amazing thanks. I'm trying to do this for inference, not training. But I assume it'll be somewhat similar.
However, I am having a problem with this setup, it says it cannot connect to the docker engine, but I am not using docker. Is there a setup guide for doing this? I skewered the web and found nothing about using accelerate multi node inference.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Hello.
I am currently attempting to get multi machine gpu inference working on 2 computers on my local network. The goal for this is to be able to load 50% of a model on my 4090 and another 50% of the model on my other 4090, my goal is to speed up inference and possibly allow me to load larger models.
worker machine:
Expected behavior
The expected behaviour is that the two machines will work together to improve the inference of the prompts, speeding them up. However, when starting the workers both seem to work completely independently of each other, and make no attempt at connecting to each other. There is no evidence that the
num_machines
and--main_process_ip
args are doing anything.The text was updated successfully, but these errors were encountered: