Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA_VISIBLE_DEVICES not working as expected #65

Open
jonlm opened this issue Jan 7, 2025 · 4 comments
Open

NVIDIA_VISIBLE_DEVICES not working as expected #65

jonlm opened this issue Jan 7, 2025 · 4 comments

Comments

@jonlm
Copy link

jonlm commented Jan 7, 2025

Hi, I'm trying to run multiple containers on a multi-GPU machine, with each container assigned its own GPU. I initially tried this by setting NVIDIA_VISIBLE_DEVICES to 0 and 1 in each container, respectively, but the container with the "1" setting wouldn't launch the desktop. I narrowed this down to line 91 in entrypoint.sh:

export GPU_SELECT="$(nvidia-smi --id=$(echo ${NVIDIA_VISIBLE_DEVICES} | cut -d ',' -f1) --query-gpu=uuid --format=csv,noheader | head -n1)"

Prior to launching the container, NVIDIA_VISIBLE_DEVICES set to "1" tells the nvidia docker runtime to use the second GPU, i.e. the GPU with id of 1 in the output of nvidia-smi in bare metal. However within the container there is only one GPU visible (which is expected) and it has been reassigned the id of 0 in nvidia-smi (which is breaking the logic). So the query of "nvidia-smi --id=1 ..." within the container fails.

As a workaround I can set NVIDIA_VISIBLE_DEVICES in the second container to "1,0", which exposes both GPUs while also allowing "1" to still refer to the second GPU (and launching the desktop there since it's listed first). However for other reasons I'd prefer to not expose two GPUs to that container.

Is this an unexpected use case? I'm not sure how entrypoint.sh is intending to NVIDIA_VISIBLE_DEVICES in the first place if a numerical id means something different outside vs inside the container.

@ehfd
Copy link
Member

ehfd commented Jan 7, 2025

Makes sense.
For now, don't use NVIDIA_VISIBLE_DEVICES but something like docker --gpus '"device=1,2"' instead. This is in the README. Because there is a workaround that offers the same capability, need to think later whether this must be addressed.

@jonlm
Copy link
Author

jonlm commented Jan 8, 2025

Right, I was originally going to do that but for reasons not worth explaining here it ended up being preferable to use the environment variable. (Which is advertised by Nvidia here.) But I will try to switch back to using the docker flags, actually docker-compose configuration in my case.

@ehfd
Copy link
Member

ehfd commented Jan 8, 2025

If you wait, I'll edit the scripts to accommodate this usage case.

@jonlm
Copy link
Author

jonlm commented Jan 9, 2025

Thanks! FYI I switched to doing this in my docker compose file:

    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['1']
            capabilities: [gpu]

But the exact same problem happened. This must end up creating an NVIDIA_VISIBILE_DEVICES environment variable anyway, because the container still failed calling nvidia-smi --id=1 ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants