Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not work on multi machines with multi gpus #53

Open
Maggione opened this issue Aug 18, 2023 · 2 comments
Open

Can not work on multi machines with multi gpus #53

Maggione opened this issue Aug 18, 2023 · 2 comments
Labels
question Further information is requested

Comments

@Maggione
Copy link

❓ Questions

I am trying to run the program on multiple machines with multiple GPUs, but the code can only find multiple machines and use only one GPU on each machine during runtime. Do I need to add additional configurations to run it properly?

/home/fay.cyf/shiyi.zxh/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Dora directory: /home/fay.cyf/weixipin.wxp/audiocraft/audiocraft_root
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split valid: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split evaluate: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split generate: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34mroot�[0m][�[32mINFO�[0m] - Getting pretrained compression model from HF facebook/encodec_32khz�[0m
[�[36m08-18 02:46:53�[0m][�[34mtorch.distributed.distributed_c10d�[0m][�[32mINFO�[0m] - Added key: store_based_barrier_key:1 to store for rank: 0�[0m
[�[36m08-18 02:46:53�[0m][�[34mtorch.distributed.distributed_c10d�[0m][�[32mINFO�[0m] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.�[0m
[�[36m08-18 02:46:53�[0m][�[34mdora.distrib�[0m][�[32mINFO�[0m] - Distributed init: 0/2 (local 0) from env�[0m
[�[36m08-18 02:46:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Instantiating solver MusicGenSolver for XP 9521b0af�[0m
[�[36m08-18 02:46:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - All XP logs are stored in /home/fay.cyf/weixipin.wxp/audiocraft/audiocraft_root/xps/9521b0af�[0m
/home/fay.cyf/shiyi.zxh/miniconda3/envs/musicgen/lib/python3.9/site-packages/flashy/loggers/tensorboard.py:47: UserWarning: tensorboard package was not found: use pip install tensorboard
  warnings.warn("tensorboard package was not found: use pip install tensorboard")
[�[36m08-18 02:46:53�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split train: /home/fay.cyf/weixipin.wxp/audiocraft/egs/data�[0m
[�[36m08-18 02:47:34�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Compression model has 4 codebooks with 2048 cardinality, and a framerate of 50�[0m
[�[36m08-18 02:47:34�[0m][�[34maudiocraft.modules.conditioners�[0m][�[32mINFO�[0m] - T5 will be evaluated with autocast as float32�[0m
[�[36m08-18 02:47:51�[0m][�[34maudiocraft.optim.dadam�[0m][�[32mINFO�[0m] - Using decoupled weight decay�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model hash: e7554e7f9d6cc2dea51bd31aa3e89765bc73d1dd�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Initializing EMA on the model with decay = 0.99 every 10 updates�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model size: 420.37 M params�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Base memory usage, with model, grad and optim: 6.73 GB�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Restoring weights and history.�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Loading a pretrained model. Ignoring 'load_best' and 'ignore_state_keys' params.�[0m
[�[36m08-18 02:48:00�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Checkpoint source is not the current xp: Load state_dict from best state.�[0m
[�[36m08-18 02:48:00�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Ignoring keys when loading best []�[0m
damo-pod7-0129:1:1 [0] NCCL INFO Bootstrap : Using bond0:33.57.143.227<0>
damo-pod7-0129:1:1 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
damo-pod7-0129:1:1 [0] NCCL INFO cudaDriverVersion 11070
NCCL version 2.14.3+cuda11.7
[�[36m08-18 02:48:06�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model hash: 776d041cbbcb8973c4968782a79f9bb63b53a727�[0m
[�[36m08-18 02:48:04�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Re-initializing EMA from best state�[0m
[�[36m08-18 02:48:04�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Initializing EMA on the model with decay = 0.99 every 10 updates�[0m
[�[36m08-18 02:48:03�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Loading state_dict from best state.�[0m
@Maggione Maggione added the question Further information is requested label Aug 18, 2023
@adefossez
Copy link
Contributor

Can you check that you do indeed see all the gpus when running python, e.g. check torch.cuda.device_count() ?

@adefossez
Copy link
Contributor

by the way you do need to run one process per GPU, even on a single machine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants