NCCL "Connection Refused" for Worker Pods #332

twolffpiggott · 2021-04-26T15:22:35Z

Overview

Referring to the Distributed MNIST example, I am running into an issue where the worker pods return "call to connect returned Connection refused" repeatedly before crashing with an NCCL runtime error on the dist.init_process_group call.

Environment

I have checked that:

The headless service (same name as the master pod) appears to be created correctly
All relevant env vars (MASTER_ADDR, MASTER_PORT, RANK etc.) appear to be set correctly in both the master and worker pods
curl -vv telnet://$MASTER_ADDR:$MASTER_PORT succeeds from worker pods

I am not using same base image as the mnist example (i.e. pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime); but instead a base image with Pytorch 1.7.1 and CU110. I can't revert to PyTorch 1.0 as the core applications of distributed training for my use cases require later Pytorch versions. If there are hard compatibility issues with later versions of Pytorch and CUDA please let me know.

Verbose Logs

Master pod waits idle

Worker pods repeatedly attempt to connect

Worker pods crash with an NCCL error

The text was updated successfully, but these errors were encountered:

twolffpiggott · 2021-04-27T10:49:17Z

This issue appears to be resolved by setting the environment variable NCCL_SOCKET_IFNAME=eth0 inside each pod.

This is slightly confusing as according to Nvidia's documentation, the loopback interface should only be selected if there are no other interfaces available.

But in the example for this issue, the loopback interface was selected by default (see the log screenshots), even though the ethernet interface was available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL "Connection Refused" for Worker Pods #332

NCCL "Connection Refused" for Worker Pods #332

twolffpiggott commented Apr 26, 2021 •

edited

Loading

twolffpiggott commented Apr 27, 2021 •

edited

Loading

NCCL "Connection Refused" for Worker Pods #332

NCCL "Connection Refused" for Worker Pods #332

Comments

twolffpiggott commented Apr 26, 2021 • edited Loading

Overview

Environment

Verbose Logs

twolffpiggott commented Apr 27, 2021 • edited Loading

twolffpiggott commented Apr 26, 2021 •

edited

Loading

twolffpiggott commented Apr 27, 2021 •

edited

Loading