torch.distributed.elastic.multiprocessing.errors.ChildFailedError #21

fangg2000 · 2025-01-24T08:17:42Z

(tango) fangg@fangg-MS-7B78:~/other/tts/TangoFlux$ ./train.sh
01/24/2025 16:00:13 - INFO - main - Distributed environment: DistributedType.MULTI_GPU Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: no

01/24/2025 16:00:13 - INFO - main - Distributed environment: DistributedType.MULTI_GPU Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:0

Mixed precision type: no

wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[rank1]:[W124 16:00:13.110158490 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
wandb: Currently logged in as: fangg2000 (fangg2000-123test). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.18.7
wandb: Run data is saved locally in /home/fangg/other/tts/TangoFlux/wandb/run-20250124_160014-j9yjp2yb
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run zany-sponge-5
wandb: ⭐️ View project at https://wandb.ai/fangg2000-123test/Text%20to%20Audio%20Flow%20matching
wandb: 🚀 View run at https://wandb.ai/fangg2000-123test/Text%20to%20Audio%20Flow%20matching/runs/j9yjp2yb
[rank0]:[W124 16:00:15.345231521 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
Traceback (most recent call last):
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/fangg/other/tts/TangoFlux/tangoflux/train.py", line 588, in
[rank1]: main()
[rank1]: File "/home/fangg/other/tts/TangoFlux/tangoflux/train.py", line 229, in main
[rank1]: accelerator.wait_for_everyone()
[rank1]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/accelerator.py", line 2607, in wait_for_everyone
[rank1]: wait_for_everyone()
[rank1]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/utils/other.py", line 138, in wait_for_everyone
[rank1]: PartialState().wait_for_everyone()
[rank1]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/state.py", line 374, in wait_for_everyone
[rank1]: torch.distributed.barrier()
[rank1]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
[rank1]: work = group.barrier(opts=opts)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5
[rank1]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank1]: Last error:
[rank1]: Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1c000
File "/home/fangg/other/tts/TangoFlux/tangoflux/train.py", line 588, in
main()
File "/home/fangg/other/tts/TangoFlux/tangoflux/train.py", line 229, in main
accelerator.wait_for_everyone()
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/accelerator.py", line 2607, in wait_for_everyone
wait_for_everyone()
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/utils/other.py", line 138, in wait_for_everyone
PartialState().wait_for_everyone()
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/state.py", line 374, in wait_for_everyone
torch.distributed.barrier()
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
work = group.barrier(opts=opts)
^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1c000
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/fangg/other/tts/TangoFlux/tangoflux/train.py", line 588, in
[rank0]: main()
[rank0]: File "/home/fangg/other/tts/TangoFlux/tangoflux/train.py", line 229, in main
[rank0]: accelerator.wait_for_everyone()
[rank0]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/accelerator.py", line 2607, in wait_for_everyone
[rank0]: wait_for_everyone()
[rank0]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/utils/other.py", line 138, in wait_for_everyone
[rank0]: PartialState().wait_for_everyone()
[rank0]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/state.py", line 374, in wait_for_everyone
[rank0]: torch.distributed.barrier()
[rank0]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank0]: return func(*args, kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
[rank0]: work = group.barrier(opts=opts)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5
[rank0]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank0]: Last error:
[rank0]: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1c000
[rank1]:[W124 16:00:16.906100177 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0124 16:00:17.241000 20671 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 20692 closing signal SIGTERM
E0124 16:00:17.407000 20671 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 20693) of binary: /home/fangg/anaconda3/envs/tango/bin/python
Traceback (most recent call last):
File "/home/fangg/anaconda3/envs/tango/bin/accelerate", line 8, in
sys.exit(main())
^^^^^^
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1165, in launch_command
multi_gpu_launcher(args)
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
distrib_run.run(args)
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in call**
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tangoflux/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-24_16:00:17
host : fangg-MS-7B78
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 20693)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

(tango) fangg@fangg-MS-7B78:/other/tts/TangoFlux$
(tango) fangg@fangg-MS-7B78:/other/tts/TangoFlux$

Driver Version: 535.183.01 CUDA Version: 12.2
pytorch version: 2.5.1+cu121
Python 3.12.1

I don't know what caused the exception, help, thanks

The text was updated successfully, but these errors were encountered:

fangg2000 · 2025-01-24T09:37:51Z

python ./tangoflux/train.py --checkpointing_steps="best" --save_every=5 --config='configs/tangoflux_config.yaml'

When I execute this command, it works ...

01/24/2025 17:28:45 - INFO - main - ***** Running training *****
01/24/2025 17:28:45 - INFO - main - Num examples = 10
01/24/2025 17:28:45 - INFO - main - Num Epochs = 80
01/24/2025 17:28:45 - INFO - main - Instantaneous batch size per device = 4
01/24/2025 17:28:45 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4
01/24/2025 17:28:45 - INFO - main - Gradient Accumulation steps = 1
01/24/2025 17:28:45 - INFO - main - Total optimization steps = 240
0%| | 0/240 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/fangg/other/tts/TangoFlux/./tangoflux/train.py", line 591, in
main()
File "/home/fangg/other/tts/TangoFlux/./tangoflux/train.py", line 454, in main
audio_latent = unwrapped_vae.encode(
...
torch.OutOfMemoryError: CUDA out of memory...

How to do it?

hungchiayu1 · 2025-01-25T09:10:38Z

It seems like you ran out of cuda memory, try lowering the batch size in the config.yaml and increase gradient accumulation step

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #21

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #21

fangg2000 commented Jan 24, 2025

fangg2000 commented Jan 24, 2025

hungchiayu1 commented Jan 25, 2025

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #21

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #21

Comments

fangg2000 commented Jan 24, 2025

tangoflux/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-01-24_16:00:17 host : fangg-MS-7B78 rank : 1 (local_rank: 1) exitcode : 1 (pid: 20693) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

fangg2000 commented Jan 24, 2025

hungchiayu1 commented Jan 25, 2025

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-24_16:00:17
host : fangg-MS-7B78
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 20693)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html