Can not work on multi machines with multi gpus #53

Maggione · 2023-08-18T02:54:12Z

❓ Questions

I am trying to run the program on multiple machines with multiple GPUs, but the code can only find multiple machines and use only one GPU on each machine during runtime. Do I need to add additional configurations to run it properly?

/home/fay.cyf/shiyi.zxh/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Dora directory: /home/fay.cyf/weixipin.wxp/audiocraft/audiocraft_root
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split valid: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split evaluate: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split generate: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34mroot�[0m][�[32mINFO�[0m] - Getting pretrained compression model from HF facebook/encodec_32khz�[0m
[�[36m08-18 02:46:53�[0m][�[34mtorch.distributed.distributed_c10d�[0m][�[32mINFO�[0m] - Added key: store_based_barrier_key:1 to store for rank: 0�[0m
[�[36m08-18 02:46:53�[0m][�[34mtorch.distributed.distributed_c10d�[0m][�[32mINFO�[0m] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.�[0m
[�[36m08-18 02:46:53�[0m][�[34mdora.distrib�[0m][�[32mINFO�[0m] - Distributed init: 0/2 (local 0) from env�[0m
[�[36m08-18 02:46:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Instantiating solver MusicGenSolver for XP 9521b0af�[0m
[�[36m08-18 02:46:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - All XP logs are stored in /home/fay.cyf/weixipin.wxp/audiocraft/audiocraft_root/xps/9521b0af�[0m
/home/fay.cyf/shiyi.zxh/miniconda3/envs/musicgen/lib/python3.9/site-packages/flashy/loggers/tensorboard.py:47: UserWarning: tensorboard package was not found: use pip install tensorboard
  warnings.warn("tensorboard package was not found: use pip install tensorboard")
[�[36m08-18 02:46:53�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split train: /home/fay.cyf/weixipin.wxp/audiocraft/egs/data�[0m
[�[36m08-18 02:47:34�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Compression model has 4 codebooks with 2048 cardinality, and a framerate of 50�[0m
[�[36m08-18 02:47:34�[0m][�[34maudiocraft.modules.conditioners�[0m][�[32mINFO�[0m] - T5 will be evaluated with autocast as float32�[0m
[�[36m08-18 02:47:51�[0m][�[34maudiocraft.optim.dadam�[0m][�[32mINFO�[0m] - Using decoupled weight decay�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model hash: e7554e7f9d6cc2dea51bd31aa3e89765bc73d1dd�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Initializing EMA on the model with decay = 0.99 every 10 updates�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model size: 420.37 M params�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Base memory usage, with model, grad and optim: 6.73 GB�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Restoring weights and history.�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Loading a pretrained model. Ignoring 'load_best' and 'ignore_state_keys' params.�[0m
[�[36m08-18 02:48:00�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Checkpoint source is not the current xp: Load state_dict from best state.�[0m
[�[36m08-18 02:48:00�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Ignoring keys when loading best []�[0m
damo-pod7-0129:1:1 [0] NCCL INFO Bootstrap : Using bond0:33.57.143.227<0>
damo-pod7-0129:1:1 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
damo-pod7-0129:1:1 [0] NCCL INFO cudaDriverVersion 11070
NCCL version 2.14.3+cuda11.7
[�[36m08-18 02:48:06�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model hash: 776d041cbbcb8973c4968782a79f9bb63b53a727�[0m
[�[36m08-18 02:48:04�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Re-initializing EMA from best state�[0m
[�[36m08-18 02:48:04�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Initializing EMA on the model with decay = 0.99 every 10 updates�[0m
[�[36m08-18 02:48:03�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Loading state_dict from best state.�[0m

The text was updated successfully, but these errors were encountered:

adefossez · 2023-08-28T13:28:27Z

Can you check that you do indeed see all the gpus when running python, e.g. check torch.cuda.device_count() ?

adefossez · 2023-08-28T13:28:55Z

by the way you do need to run one process per GPU, even on a single machine

Maggione added the question Further information is requested label Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not work on multi machines with multi gpus #53

Can not work on multi machines with multi gpus #53

Maggione commented Aug 18, 2023

adefossez commented Aug 28, 2023

adefossez commented Aug 28, 2023

Can not work on multi machines with multi gpus #53

Can not work on multi machines with multi gpus #53

Comments

Maggione commented Aug 18, 2023

❓ Questions

adefossez commented Aug 28, 2023

adefossez commented Aug 28, 2023