We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It can not work and throw the following errors.
INFO 09-23 13:55:24 [base_pipeline.py:286] Scheduler found, paralleling scheduler... INFO 09-23 13:55:24 [base_pipeline.py:286] Scheduler found, paralleling scheduler... 33%|███▎ | 1/3 [00:02<00:04, 2.19s/it][rank0]:[E923 14:05:28.953176853 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=0, OpType=SEND, NumelIn=7208960, NumelOut=7208960, Timeout(ms)=600000) ran for 600003 milliseconds before timing out. [rank0]:[E923 14:05:28.953368768 ProcessGroupNCCL.cpp:1664] [PG 37 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 0, last enqueued NCCL work: 0, last completed NCCL work: 0. [rank3]:[E923 14:05:28.992395601 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600009 milliseconds before timing out. [rank3]:[E923 14:05:28.992592726 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 29, last enqueued NCCL work: 29, last completed NCCL work: 28. [rank7]:[E923 14:05:28.000333452 ProcessGroupNCCL.cpp:607] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=23, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600040 milliseconds before timing out. [rank7]:[E923 14:05:28.000516457 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 23, last enqueued NCCL work: 23, last completed NCCL work: 22. [rank1]:[E923 14:05:28.037161473 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=30, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600075 milliseconds before timing out. [rank1]:[E923 14:05:28.037419896 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 30, last enqueued NCCL work: 30, last completed NCCL work: 29. [rank4]:[E923 14:05:28.051095136 ProcessGroupNCCL.cpp:607] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600055 milliseconds before timing out. [rank4]:[E923 14:05:28.051310450 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 29, last enqueued NCCL work: 29, last completed NCCL work: 28. [rank2]:[E923 14:05:28.059676590 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. [rank2]:[E923 14:05:28.059882735 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 29, last enqueued NCCL work: 29, last completed NCCL work: 28. [rank6]:[E923 14:05:28.093412822 ProcessGroupNCCL.cpp:607] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28, OpType=SEND, NumelIn=1441792, NumelOut=1441792, Timeout(ms)=600000) ran for 600075 milliseconds before timing out. [rank6]:[E923 14:05:28.093665436 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 28, last enqueued NCCL work: 29, last completed NCCL work: 28. [rank5]:[E923 14:05:28.093982237 ProcessGroupNCCL.cpp:607] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. [rank5]:[E923 14:05:28.094188232 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 29, last enqueued NCCL work: 29, last completed NCCL work: 28. [rank0]:[E923 14:05:28.099458433 ProcessGroupNCCL.cpp:1709] [PG 37 Rank 0] Timeout at NCCL work: 0, last enqueued NCCL work: 0, last completed NCCL work: 0. [rank0]:[E923 14:05:28.099489572 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E923 14:05:28.099494742 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down. 33%|███▎ | 1/3 [10:02<20:05, 602.68s/it] [rank0]:[E923 14:05:28.100399708 ProcessGroupNCCL.cpp:1515] [PG 37 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=0, OpType=SEND, NumelIn=7208960, NumelOut=7208960, Timeout(ms)=600000) ran for 600003 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9fc59e7f86 in /home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9fc6ce48f2 in /home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f9fc6ceb333 in /home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9fc6ced71c in /home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7fa01444cbf4 in /home/pjz/miniconda3/envs/fjr/bin/../lib/libstdc++.so.6) frame #5: + 0x81ca (0x7fa022efc1ca in /lib64/libpthread.so.0) frame #6: clone + 0x43 (0x7fa0223cd8d3 in /lib64/libc.so.6)
The text was updated successfully, but these errors were encountered:
No branches or pull requests
It can not work and throw the following errors.
INFO 09-23 13:55:24 [base_pipeline.py:286] Scheduler found, paralleling scheduler...
INFO 09-23 13:55:24 [base_pipeline.py:286] Scheduler found, paralleling scheduler...
33%|███▎ | 1/3 [00:02<00:04, 2.19s/it][rank0]:[E923 14:05:28.953176853 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=0, OpType=SEND, NumelIn=7208960, NumelOut=7208960, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
[rank0]:[E923 14:05:28.953368768 ProcessGroupNCCL.cpp:1664] [PG 37 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 0, last enqueued NCCL work: 0, last completed NCCL work: 0.
[rank3]:[E923 14:05:28.992395601 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
[rank3]:[E923 14:05:28.992592726 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 29, last enqueued NCCL work: 29, last completed NCCL work: 28.
[rank7]:[E923 14:05:28.000333452 ProcessGroupNCCL.cpp:607] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=23, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600040 milliseconds before timing out.
[rank7]:[E923 14:05:28.000516457 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 23, last enqueued NCCL work: 23, last completed NCCL work: 22.
[rank1]:[E923 14:05:28.037161473 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=30, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600075 milliseconds before timing out.
[rank1]:[E923 14:05:28.037419896 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 30, last enqueued NCCL work: 30, last completed NCCL work: 29.
[rank4]:[E923 14:05:28.051095136 ProcessGroupNCCL.cpp:607] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600055 milliseconds before timing out.
[rank4]:[E923 14:05:28.051310450 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 29, last enqueued NCCL work: 29, last completed NCCL work: 28.
[rank2]:[E923 14:05:28.059676590 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600088 milliseconds before timing out.
[rank2]:[E923 14:05:28.059882735 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 29, last enqueued NCCL work: 29, last completed NCCL work: 28.
[rank6]:[E923 14:05:28.093412822 ProcessGroupNCCL.cpp:607] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28, OpType=SEND, NumelIn=1441792, NumelOut=1441792, Timeout(ms)=600000) ran for 600075 milliseconds before timing out.
[rank6]:[E923 14:05:28.093665436 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 28, last enqueued NCCL work: 29, last completed NCCL work: 28.
[rank5]:[E923 14:05:28.093982237 ProcessGroupNCCL.cpp:607] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600087 milliseconds before timing out.
[rank5]:[E923 14:05:28.094188232 ProcessGroupNCCL.cpp:1664] [PG 35 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 29, last enqueued NCCL work: 29, last completed NCCL work: 28.
[rank0]:[E923 14:05:28.099458433 ProcessGroupNCCL.cpp:1709] [PG 37 Rank 0] Timeout at NCCL work: 0, last enqueued NCCL work: 0, last completed NCCL work: 0.
[rank0]:[E923 14:05:28.099489572 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E923 14:05:28.099494742 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
33%|███▎ | 1/3 [10:02<20:05, 602.68s/it]
[rank0]:[E923 14:05:28.100399708 ProcessGroupNCCL.cpp:1515] [PG 37 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=0, OpType=SEND, NumelIn=7208960, NumelOut=7208960, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9fc59e7f86 in /home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9fc6ce48f2 in /home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f9fc6ceb333 in /home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9fc6ced71c in /home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7fa01444cbf4 in /home/pjz/miniconda3/envs/fjr/bin/../lib/libstdc++.so.6)
frame #5: + 0x81ca (0x7fa022efc1ca in /lib64/libpthread.so.0)
frame #6: clone + 0x43 (0x7fa0223cd8d3 in /lib64/libc.so.6)
The text was updated successfully, but these errors were encountered: