Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练数据在第20epoch验证时报错RuntimeError: NCCL communicator was aborted on rank ,修改batchsize 不管用,pytorch 2.0.1 cuda11.7 和nccl 2.14.3 #1072

Open
4 tasks done
zhangyuelong opened this issue Nov 29, 2024 · 0 comments
Labels
question Further information is requested

Comments

@zhangyuelong
Copy link

Before Asking

  • I have read the README carefully. 我已经仔细阅读了README上的操作指引。

  • I want to train my custom dataset, and I have read the tutorials for training your custom data carefully and organize my dataset correctly; (FYI: We recommand you to apply the config files of xx_finetune.py.) 我想训练自定义数据集,我已经仔细阅读了训练自定义数据的教程,以及按照正确的目录结构存放数据集。(FYI: 我们推荐使用xx_finetune.py等配置文件训练自定义数据集。)

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking

  • I have searched the YOLOv6 issues and found no similar questions.

Question

RuntimeError: NCCL communicator was aborted on rank 1. Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26279, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807935 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 99472 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 99473) of binary: /home/yjauto/miniconda3/envs/yolov6/bin/python应该怎么修改

Additional

No response

@zhangyuelong zhangyuelong added the question Further information is requested label Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant