Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use NVML to Acquire GPU UUIDs #552

Closed
wants to merge 8 commits into from
Closed

Conversation

nv-braf
Copy link
Contributor

@nv-braf nv-braf commented Oct 19, 2022

Replaces DCGM call with ones from NVML

Comment on lines +233 to +243
devices = nvmlDeviceGetCount()
for device_id in range(devices):
Copy link

@rmccorm4 rmccorm4 Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the previous variable name was cuda_visible_gpus, so just wanted to mention this:

Might not be an issue for anyone, but nvml doesn't respect CUDA_VISIBLE_DEVICES. So if you have any tests that exploit this env var for isolation inside the container (ex: CUDA_VISIBLE_DEVICES=1,3,5), it likely won't work as expected.

Isolating gpus in a container through docker run --gpus '"device=1,3,5"' ... will work as expected though.

ref: gpuopenanalytics/pynvml#28

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function name is get_cuda_visible_gpus, so we should honor that. Is there a way to do so? Can we read that env variable here to filter the gpus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ryan, thanks for looking at this! I've added in logic to check for CUDA_VISIBLE_DEVICES and map them to the device ids.

model_analyzer/device/gpu_device_factory.py Outdated Show resolved Hide resolved
Comment on lines +233 to +243
devices = nvmlDeviceGetCount()
for device_id in range(devices):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function name is get_cuda_visible_gpus, so we should honor that. Is there a way to do so? Can we read that env variable here to filter the gpus?

model_analyzer/device/gpu_device_factory.py Show resolved Hide resolved
@@ -18,7 +18,10 @@
import model_analyzer.monitor.dcgm.dcgm_structs as structs
from model_analyzer.model_analyzer_exceptions import TritonModelAnalyzerException

from pynvml import *

import numba.cuda
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, numba.cuda is still being used

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. I thought the whole point of this was to remove DCGM, but it doesn't appear that dcgm was used in the function you removed? I do still see it used in init_all_devices() though. Did we change the wrong code?

Comment on lines +238 to +240
cuda_id_map = list(
map(int,
os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA_VISIBLE_DEVICES also supports MIG IDs: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-gi which are strings and not just ints.

You may be able to just keep it as a list of strings and instead check something like this:

if id in cuda_visible_devices or uuid in cuda_visible_devices:
   ...

disclaimer: I have no idea what values nvml will return for MIG devices, so you may want to just make this a FIXME or something to support MIG devices as needed based on priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants