You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.
Referring to the Distributed MNIST example, I am running into an issue where the worker pods return "call to connect returned Connection refused" repeatedly before crashing with an NCCL runtime error on the dist.init_process_group call.
Environment
I have checked that:
The headless service (same name as the master pod) appears to be created correctly
All relevant env vars (MASTER_ADDR, MASTER_PORT, RANK etc.) appear to be set correctly in both the master and worker pods
curl -vv telnet://$MASTER_ADDR:$MASTER_PORT succeeds from worker pods
I am not using same base image as the mnist example (i.e. pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime); but instead a base image with Pytorch 1.7.1 and CU110. I can't revert to PyTorch 1.0 as the core applications of distributed training for my use cases require later Pytorch versions. If there are hard compatibility issues with later versions of Pytorch and CUDA please let me know.
Verbose Logs
Master pod waits idle
Worker pods repeatedly attempt to connect
Worker pods crash with an NCCL error
The text was updated successfully, but these errors were encountered:
This issue appears to be resolved by setting the environment variable NCCL_SOCKET_IFNAME=eth0 inside each pod.
This is slightly confusing as according to Nvidia's documentation, the loopback interface should only be selected if there are no other interfaces available.
But in the example for this issue, the loopback interface was selected by default (see the log screenshots), even though the ethernet interface was available.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Overview
Referring to the Distributed MNIST example, I am running into an issue where the worker pods return "call to connect returned Connection refused" repeatedly before crashing with an NCCL runtime error on the
dist.init_process_group
call.Environment
I have checked that:
MASTER_ADDR
,MASTER_PORT
,RANK
etc.) appear to be set correctly in both the master and worker podscurl -vv telnet://$MASTER_ADDR:$MASTER_PORT
succeeds from worker podsI am not using same base image as the mnist example (i.e.
pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
); but instead a base image with Pytorch 1.7.1 and CU110. I can't revert to PyTorch 1.0 as the core applications of distributed training for my use cases require later Pytorch versions. If there are hard compatibility issues with later versions of Pytorch and CUDA please let me know.Verbose Logs
The text was updated successfully, but these errors were encountered: