Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hunyuanDiT PipeFusion=8 on L40 #284

Open
feifeibear opened this issue Sep 23, 2024 · 0 comments
Open

hunyuanDiT PipeFusion=8 on L40 #284

feifeibear opened this issue Sep 23, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@feifeibear
Copy link
Collaborator

It can not work and throw the following errors.

INFO 09-23 13:55:24 [base_pipeline.py:286] Scheduler found, paralleling scheduler...
INFO 09-23 13:55:24 [base_pipeline.py:286] Scheduler found, paralleling scheduler...
33%|███▎ | 1/3 [00:02<00:04, 2.19s/it][rank0]:[E923 14:05:28.953176853 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=0, OpType=SEND, NumelIn=7208960, NumelOut=7208960, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
[rank0]:[E923 14:05:28.953368768 ProcessGroupNCCL.cpp:1664] [PG 37 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 0, last enqueued NCCL work: 0, last completed NCCL work: 0.
[rank3]:[E923 14:05:28.992395601 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
[rank3]:[E923 14:05:28.992592726 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 29, last enqueued NCCL work: 29, last completed NCCL work: 28.
[rank7]:[E923 14:05:28.000333452 ProcessGroupNCCL.cpp:607] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=23, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600040 milliseconds before timing out.
[rank7]:[E923 14:05:28.000516457 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 23, last enqueued NCCL work: 23, last completed NCCL work: 22.
[rank1]:[E923 14:05:28.037161473 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=30, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600075 milliseconds before timing out.
[rank1]:[E923 14:05:28.037419896 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 30, last enqueued NCCL work: 30, last completed NCCL work: 29.
[rank4]:[E923 14:05:28.051095136 ProcessGroupNCCL.cpp:607] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600055 milliseconds before timing out.
[rank4]:[E923 14:05:28.051310450 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 29, last enqueued NCCL work: 29, last completed NCCL work: 28.
[rank2]:[E923 14:05:28.059676590 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600088 milliseconds before timing out.
[rank2]:[E923 14:05:28.059882735 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 29, last enqueued NCCL work: 29, last completed NCCL work: 28.
[rank6]:[E923 14:05:28.093412822 ProcessGroupNCCL.cpp:607] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28, OpType=SEND, NumelIn=1441792, NumelOut=1441792, Timeout(ms)=600000) ran for 600075 milliseconds before timing out.
[rank6]:[E923 14:05:28.093665436 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 28, last enqueued NCCL work: 29, last completed NCCL work: 28.
[rank5]:[E923 14:05:28.093982237 ProcessGroupNCCL.cpp:607] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600087 milliseconds before timing out.
[rank5]:[E923 14:05:28.094188232 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 29, last enqueued NCCL work: 29, last completed NCCL work: 28.
[rank0]:[E923 14:05:28.099458433 ProcessGroupNCCL.cpp:1709] [PG 37 Rank 0] Timeout at NCCL work: 0, last enqueued NCCL work: 0, last completed NCCL work: 0.
[rank0]:[E923 14:05:28.099489572 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E923 14:05:28.099494742 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
33%|███▎ | 1/3 [10:02<20:05, 602.68s/it]
[rank0]:[E923 14:05:28.100399708 ProcessGroupNCCL.cpp:1515] [PG 37 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=0, OpType=SEND, NumelIn=7208960, NumelOut=7208960, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9fc59e7f86 in /home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9fc6ce48f2 in /home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f9fc6ceb333 in /home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9fc6ced71c in /home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7fa01444cbf4 in /home/pjz/miniconda3/envs/fjr/bin/../lib/libstdc++.so.6)
frame #5: + 0x81ca (0x7fa022efc1ca in /lib64/libpthread.so.0)
frame #6: clone + 0x43 (0x7fa0223cd8d3 in /lib64/libc.so.6)

@feifeibear feifeibear added the bug Something isn't working label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant