You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run the program on multiple machines with multiple GPUs, but the code can only find multiple machines and use only one GPU on each machine during runtime. Do I need to add additional configurations to run it properly?
/home/fay.cyf/shiyi.zxh/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
Dora directory: /home/fay.cyf/weixipin.wxp/audiocraft/audiocraft_root
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split valid: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split evaluate: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split generate: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example�[0m
[�[36m08-18 02:46:54�[0m][�[34mroot�[0m][�[32mINFO�[0m] - Getting pretrained compression model from HF facebook/encodec_32khz�[0m
[�[36m08-18 02:46:53�[0m][�[34mtorch.distributed.distributed_c10d�[0m][�[32mINFO�[0m] - Added key: store_based_barrier_key:1 to store for rank: 0�[0m
[�[36m08-18 02:46:53�[0m][�[34mtorch.distributed.distributed_c10d�[0m][�[32mINFO�[0m] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.�[0m
[�[36m08-18 02:46:53�[0m][�[34mdora.distrib�[0m][�[32mINFO�[0m] - Distributed init: 0/2 (local 0) from env�[0m
[�[36m08-18 02:46:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Instantiating solver MusicGenSolver for XP 9521b0af�[0m
[�[36m08-18 02:46:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - All XP logs are stored in /home/fay.cyf/weixipin.wxp/audiocraft/audiocraft_root/xps/9521b0af�[0m
/home/fay.cyf/shiyi.zxh/miniconda3/envs/musicgen/lib/python3.9/site-packages/flashy/loggers/tensorboard.py:47: UserWarning: tensorboard package was not found: use pip install tensorboard
warnings.warn("tensorboard package was not found: use pip install tensorboard")
[�[36m08-18 02:46:53�[0m][�[34maudiocraft.solvers.builders�[0m][�[32mINFO�[0m] - Loading audio data split train: /home/fay.cyf/weixipin.wxp/audiocraft/egs/data�[0m
[�[36m08-18 02:47:34�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Compression model has 4 codebooks with 2048 cardinality, and a framerate of 50�[0m
[�[36m08-18 02:47:34�[0m][�[34maudiocraft.modules.conditioners�[0m][�[32mINFO�[0m] - T5 will be evaluated with autocast as float32�[0m
[�[36m08-18 02:47:51�[0m][�[34maudiocraft.optim.dadam�[0m][�[32mINFO�[0m] - Using decoupled weight decay�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model hash: e7554e7f9d6cc2dea51bd31aa3e89765bc73d1dd�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Initializing EMA on the model with decay = 0.99 every 10 updates�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model size: 420.37 M params�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Base memory usage, with model, grad and optim: 6.73 GB�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Restoring weights and history.�[0m
[�[36m08-18 02:47:53�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Loading a pretrained model. Ignoring 'load_best' and 'ignore_state_keys' params.�[0m
[�[36m08-18 02:48:00�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Checkpoint source is not the current xp: Load state_dict from best state.�[0m
[�[36m08-18 02:48:00�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Ignoring keys when loading best []�[0m
damo-pod7-0129:1:1 [0] NCCL INFO Bootstrap : Using bond0:33.57.143.227<0>
damo-pod7-0129:1:1 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
damo-pod7-0129:1:1 [0] NCCL INFO cudaDriverVersion 11070
NCCL version 2.14.3+cuda11.7
[�[36m08-18 02:48:06�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Model hash: 776d041cbbcb8973c4968782a79f9bb63b53a727�[0m
[�[36m08-18 02:48:04�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Re-initializing EMA from best state�[0m
[�[36m08-18 02:48:04�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Initializing EMA on the model with decay = 0.99 every 10 updates�[0m
[�[36m08-18 02:48:03�[0m][�[34mflashy.solver�[0m][�[32mINFO�[0m] - Loading state_dict from best state.�[0m
The text was updated successfully, but these errors were encountered:
❓ Questions
I am trying to run the program on multiple machines with multiple GPUs, but the code can only find multiple machines and use only one GPU on each machine during runtime. Do I need to add additional configurations to run it properly?
The text was updated successfully, but these errors were encountered: