Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

NCCL "Connection Refused" for Worker Pods #332

Open
twolffpiggott opened this issue Apr 26, 2021 · 1 comment
Open

NCCL "Connection Refused" for Worker Pods #332

twolffpiggott opened this issue Apr 26, 2021 · 1 comment

Comments

@twolffpiggott
Copy link

twolffpiggott commented Apr 26, 2021

Overview

Referring to the Distributed MNIST example, I am running into an issue where the worker pods return "call to connect returned Connection refused" repeatedly before crashing with an NCCL runtime error on the dist.init_process_group call.

Environment

I have checked that:

  1. The headless service (same name as the master pod) appears to be created correctly
  2. All relevant env vars (MASTER_ADDR, MASTER_PORT, RANK etc.) appear to be set correctly in both the master and worker pods
  3. curl -vv telnet://$MASTER_ADDR:$MASTER_PORT succeeds from worker pods

I am not using same base image as the mnist example (i.e. pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime); but instead a base image with Pytorch 1.7.1 and CU110. I can't revert to PyTorch 1.0 as the core applications of distributed training for my use cases require later Pytorch versions. If there are hard compatibility issues with later versions of Pytorch and CUDA please let me know.

Verbose Logs

  1. Master pod waits idle

Screenshot 2021-04-26 at 17 13 24

  1. Worker pods repeatedly attempt to connect

Screenshot 2021-04-26 at 17 04 40

  1. Worker pods crash with an NCCL error

Screenshot 2021-04-26 at 17 05 00

@twolffpiggott
Copy link
Author

twolffpiggott commented Apr 27, 2021

This issue appears to be resolved by setting the environment variable NCCL_SOCKET_IFNAME=eth0 inside each pod.

This is slightly confusing as according to Nvidia's documentation, the loopback interface should only be selected if there are no other interfaces available.

But in the example for this issue, the loopback interface was selected by default (see the log screenshots), even though the ethernet interface was available.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant