NVIDIA_VISIBLE_DEVICES not working as expected #65

jonlm · 2025-01-07T06:21:20Z

Hi, I'm trying to run multiple containers on a multi-GPU machine, with each container assigned its own GPU. I initially tried this by setting NVIDIA_VISIBLE_DEVICES to 0 and 1 in each container, respectively, but the container with the "1" setting wouldn't launch the desktop. I narrowed this down to line 91 in entrypoint.sh:

export GPU_SELECT="$(nvidia-smi --id=$(echo ${NVIDIA_VISIBLE_DEVICES} | cut -d ',' -f1) --query-gpu=uuid --format=csv,noheader | head -n1)"

Prior to launching the container, NVIDIA_VISIBLE_DEVICES set to "1" tells the nvidia docker runtime to use the second GPU, i.e. the GPU with id of 1 in the output of nvidia-smi in bare metal. However within the container there is only one GPU visible (which is expected) and it has been reassigned the id of 0 in nvidia-smi (which is breaking the logic). So the query of "nvidia-smi --id=1 ..." within the container fails.

As a workaround I can set NVIDIA_VISIBLE_DEVICES in the second container to "1,0", which exposes both GPUs while also allowing "1" to still refer to the second GPU (and launching the desktop there since it's listed first). However for other reasons I'd prefer to not expose two GPUs to that container.

Is this an unexpected use case? I'm not sure how entrypoint.sh is intending to NVIDIA_VISIBLE_DEVICES in the first place if a numerical id means something different outside vs inside the container.

The text was updated successfully, but these errors were encountered:

ehfd · 2025-01-07T12:23:19Z

Makes sense.
For now, don't use NVIDIA_VISIBLE_DEVICES but something like docker --gpus '"device=1,2"' instead. This is in the README. Because there is a workaround that offers the same capability, need to think later whether this must be addressed.

jonlm · 2025-01-08T07:49:30Z

Right, I was originally going to do that but for reasons not worth explaining here it ended up being preferable to use the environment variable. (Which is advertised by Nvidia here.) But I will try to switch back to using the docker flags, actually docker-compose configuration in my case.

ehfd · 2025-01-08T13:23:05Z

If you wait, I'll edit the scripts to accommodate this usage case.

jonlm · 2025-01-09T00:05:09Z

Thanks! FYI I switched to doing this in my docker compose file:

    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['1']
            capabilities: [gpu]

But the exact same problem happened. This must end up creating an NVIDIA_VISIBILE_DEVICES environment variable anyway, because the container still failed calling nvidia-smi --id=1 ....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA_VISIBLE_DEVICES not working as expected #65

NVIDIA_VISIBLE_DEVICES not working as expected #65

jonlm commented Jan 7, 2025

ehfd commented Jan 7, 2025 •

edited

Loading

jonlm commented Jan 8, 2025

ehfd commented Jan 8, 2025

jonlm commented Jan 9, 2025

NVIDIA_VISIBLE_DEVICES not working as expected #65

NVIDIA_VISIBLE_DEVICES not working as expected #65

Comments

jonlm commented Jan 7, 2025

ehfd commented Jan 7, 2025 • edited Loading

jonlm commented Jan 8, 2025

ehfd commented Jan 8, 2025

jonlm commented Jan 9, 2025

ehfd commented Jan 7, 2025 •

edited

Loading