Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #21

Open
fangg2000 opened this issue Jan 24, 2025 · 2 comments
Open

Comments

@fangg2000
Copy link

(tango) fangg@fangg-MS-7B78:~/other/tts/TangoFlux$ ./train.sh
01/24/2025 16:00:13 - INFO - main - Distributed environment: DistributedType.MULTI_GPU Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: no

01/24/2025 16:00:13 - INFO - main - Distributed environment: DistributedType.MULTI_GPU Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:0

Mixed precision type: no

wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[rank1]:[W124 16:00:13.110158490 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
wandb: Currently logged in as: fangg2000 (fangg2000-123test). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.18.7
wandb: Run data is saved locally in /home/fangg/other/tts/TangoFlux/wandb/run-20250124_160014-j9yjp2yb
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run zany-sponge-5
wandb: ⭐️ View project at https://wandb.ai/fangg2000-123test/Text%20to%20Audio%20Flow%20matching
wandb: 🚀 View run at https://wandb.ai/fangg2000-123test/Text%20to%20Audio%20Flow%20matching/runs/j9yjp2yb
[rank0]:[W124 16:00:15.345231521 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
Traceback (most recent call last):
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/fangg/other/tts/TangoFlux/tangoflux/train.py", line 588, in
[rank1]: main()
[rank1]: File "/home/fangg/other/tts/TangoFlux/tangoflux/train.py", line 229, in main
[rank1]: accelerator.wait_for_everyone()
[rank1]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/accelerator.py", line 2607, in wait_for_everyone
[rank1]: wait_for_everyone()
[rank1]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/utils/other.py", line 138, in wait_for_everyone
[rank1]: PartialState().wait_for_everyone()
[rank1]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/state.py", line 374, in wait_for_everyone
[rank1]: torch.distributed.barrier()
[rank1]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
[rank1]: work = group.barrier(opts=opts)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5
[rank1]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank1]: Last error:
[rank1]: Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1c000
File "/home/fangg/other/tts/TangoFlux/tangoflux/train.py", line 588, in
main()
File "/home/fangg/other/tts/TangoFlux/tangoflux/train.py", line 229, in main
accelerator.wait_for_everyone()
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/accelerator.py", line 2607, in wait_for_everyone
wait_for_everyone()
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/utils/other.py", line 138, in wait_for_everyone
PartialState().wait_for_everyone()
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/state.py", line 374, in wait_for_everyone
torch.distributed.barrier()
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
work = group.barrier(opts=opts)
^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1c000
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/fangg/other/tts/TangoFlux/tangoflux/train.py", line 588, in
[rank0]: main()
[rank0]: File "/home/fangg/other/tts/TangoFlux/tangoflux/train.py", line 229, in main
[rank0]: accelerator.wait_for_everyone()
[rank0]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/accelerator.py", line 2607, in wait_for_everyone
[rank0]: wait_for_everyone()
[rank0]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/utils/other.py", line 138, in wait_for_everyone
[rank0]: PartialState().wait_for_everyone()
[rank0]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/state.py", line 374, in wait_for_everyone
[rank0]: torch.distributed.barrier()
[rank0]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
[rank0]: work = group.barrier(opts=opts)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5
[rank0]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank0]: Last error:
[rank0]: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1c000
[rank1]:[W124 16:00:16.906100177 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0124 16:00:17.241000 20671 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 20692 closing signal SIGTERM
E0124 16:00:17.407000 20671 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 20693) of binary: /home/fangg/anaconda3/envs/tango/bin/python
Traceback (most recent call last):
File "/home/fangg/anaconda3/envs/tango/bin/accelerate", line 8, in
sys.exit(main())
^^^^^^
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1165, in launch_command
multi_gpu_launcher(args)
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
distrib_run.run(args)
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fangg/anaconda3/envs/tango/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tangoflux/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-24_16:00:17
host : fangg-MS-7B78
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 20693)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

(tango) fangg@fangg-MS-7B78:/other/tts/TangoFlux$
(tango) fangg@fangg-MS-7B78:
/other/tts/TangoFlux$

Driver Version: 535.183.01 CUDA Version: 12.2
pytorch version: 2.5.1+cu121
Python 3.12.1

I don't know what caused the exception, help, thanks

@fangg2000
Copy link
Author

python ./tangoflux/train.py --checkpointing_steps="best" --save_every=5 --config='configs/tangoflux_config.yaml'

When I execute this command, it works ...

01/24/2025 17:28:45 - INFO - main - ***** Running training *****
01/24/2025 17:28:45 - INFO - main - Num examples = 10
01/24/2025 17:28:45 - INFO - main - Num Epochs = 80
01/24/2025 17:28:45 - INFO - main - Instantaneous batch size per device = 4
01/24/2025 17:28:45 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4
01/24/2025 17:28:45 - INFO - main - Gradient Accumulation steps = 1
01/24/2025 17:28:45 - INFO - main - Total optimization steps = 240
0%| | 0/240 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/fangg/other/tts/TangoFlux/./tangoflux/train.py", line 591, in
main()
File "/home/fangg/other/tts/TangoFlux/./tangoflux/train.py", line 454, in main
audio_latent = unwrapped_vae.encode(
...
torch.OutOfMemoryError: CUDA out of memory...

How to do it?

@hungchiayu1
Copy link
Collaborator

It seems like you ran out of cuda memory, try lowering the batch size in the config.yaml and increase gradient accumulation step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants