训练数据在第20epoch验证时报错RuntimeError: NCCL communicator was aborted on rank ，修改batchsize 不管用，pytorch 2.0.1 cuda11.7 和nccl 2.14.3 #1072

zhangyuelong · 2024-11-29T08:04:30Z

Before Asking

I have read the README carefully. 我已经仔细阅读了README上的操作指引。
I want to train my custom dataset, and I have read the tutorials for training your custom data carefully and organize my dataset correctly; (FYI: We recommand you to apply the config files of xx_finetune.py.) 我想训练自定义数据集，我已经仔细阅读了训练自定义数据的教程，以及按照正确的目录结构存放数据集。（FYI: 我们推荐使用xx_finetune.py等配置文件训练自定义数据集。）
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking

I have searched the YOLOv6 issues and found no similar questions.

Question

RuntimeError: NCCL communicator was aborted on rank 1. Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26279, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807935 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 99472 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 99473) of binary: /home/yjauto/miniconda3/envs/yolov6/bin/python应该怎么修改

Additional

No response

zhangyuelong added the question Further information is requested label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

训练数据在第20epoch验证时报错RuntimeError: NCCL communicator was aborted on rank ，修改batchsize 不管用，pytorch 2.0.1 cuda11.7 和nccl 2.14.3 #1072

训练数据在第20epoch验证时报错RuntimeError: NCCL communicator was aborted on rank ，修改batchsize 不管用，pytorch 2.0.1 cuda11.7 和nccl 2.14.3 #1072

zhangyuelong commented Nov 29, 2024

训练数据在第20epoch验证时报错RuntimeError: NCCL communicator was aborted on rank ，修改batchsize 不管用，pytorch 2.0.1 cuda11.7 和nccl 2.14.3 #1072

训练数据在第20epoch验证时报错RuntimeError: NCCL communicator was aborted on rank ，修改batchsize 不管用，pytorch 2.0.1 cuda11.7 和nccl 2.14.3 #1072

Comments

zhangyuelong commented Nov 29, 2024

Before Asking

Search before asking

Question

Additional