Use NVML to Acquire GPU UUIDs #552

nv-braf · 2022-10-19T15:30:18Z

Replaces DCGM call with ones from NVML

rmccorm4 · 2022-10-19T19:40:37Z

model_analyzer/device/gpu_device_factory.py

+            devices = nvmlDeviceGetCount()
+            for device_id in range(devices):


I noticed the previous variable name was cuda_visible_gpus, so just wanted to mention this:

Might not be an issue for anyone, but nvml doesn't respect CUDA_VISIBLE_DEVICES. So if you have any tests that exploit this env var for isolation inside the container (ex: CUDA_VISIBLE_DEVICES=1,3,5), it likely won't work as expected.

Isolating gpus in a container through docker run --gpus '"device=1,3,5"' ... will work as expected though.

ref: gpuopenanalytics/pynvml#28

the function name is get_cuda_visible_gpus, so we should honor that. Is there a way to do so? Can we read that env variable here to filter the gpus?

Ryan, thanks for looking at this! I've added in logic to check for CUDA_VISIBLE_DEVICES and map them to the device ids.

model_analyzer/device/gpu_device_factory.py

tgerdesnv · 2022-10-19T20:51:33Z

model_analyzer/device/gpu_device_factory.py

+            devices = nvmlDeviceGetCount()
+            for device_id in range(devices):


the function name is get_cuda_visible_gpus, so we should honor that. Is there a way to do so? Can we read that env variable here to filter the gpus?

model_analyzer/device/gpu_device_factory.py

tgerdesnv · 2022-10-19T20:53:26Z

model_analyzer/device/gpu_device_factory.py

@@ -18,7 +18,10 @@
 import model_analyzer.monitor.dcgm.dcgm_structs as structs
 from model_analyzer.model_analyzer_exceptions import TritonModelAnalyzerException

+from pynvml import *
+
 import numba.cuda


can this be removed?

No, numba.cuda is still being used

I'm confused. I thought the whole point of this was to remove DCGM, but it doesn't appear that dcgm was used in the function you removed? I do still see it used in init_all_devices() though. Did we change the wrong code?

rmccorm4 · 2022-10-19T21:54:54Z

model_analyzer/device/gpu_device_factory.py

+                cuda_id_map = list(
+                    map(int,
+                        os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")))


CUDA_VISIBLE_DEVICES also supports MIG IDs: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-gi which are strings and not just ints.

You may be able to just keep it as a list of strings and instead check something like this:

if id in cuda_visible_devices or uuid in cuda_visible_devices: ...

disclaimer: I have no idea what values nvml will return for MIG devices, so you may want to just make this a FIXME or something to support MIG devices as needed based on priority.

nv-braf requested a review from tgerdesnv October 19, 2022 15:33

rmccorm4 reviewed Oct 19, 2022

View reviewed changes

tgerdesnv requested changes Oct 19, 2022

View reviewed changes

rmccorm4 reviewed Oct 19, 2022

View reviewed changes

nv-braf added 8 commits November 2, 2022 18:14

Replacing DCGM with nvidia-smi call

b30d09e

Making lack of nvidia-smi a logger warning

d37aa12

Using nvml to lookup gpu device uuid

47b870d

Adding pynvml to dockerfile and requirements

0914b4f

Updating comment

134000c

Fixing case where there is no GPU

6d84b4b

Minor cleanup

e98e044

Adding check for CUDA visible devices

7654fa2

nv-braf force-pushed the acquire-gpu-ids-no-dcgm branch from c2ef87f to 7654fa2 Compare November 2, 2022 18:14

nv-braf closed this Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use NVML to Acquire GPU UUIDs #552

Use NVML to Acquire GPU UUIDs #552

nv-braf commented Oct 19, 2022

rmccorm4 Oct 19, 2022 •

edited

Loading

tgerdesnv Oct 19, 2022

nv-braf Oct 19, 2022

tgerdesnv Oct 19, 2022

tgerdesnv Oct 19, 2022

nv-braf Oct 19, 2022

tgerdesnv Oct 20, 2022

rmccorm4 Oct 19, 2022

		devices = nvmlDeviceGetCount()
		for device_id in range(devices):

Use NVML to Acquire GPU UUIDs #552

Use NVML to Acquire GPU UUIDs #552

Conversation

nv-braf commented Oct 19, 2022

rmccorm4 Oct 19, 2022 • edited Loading

Choose a reason for hiding this comment

tgerdesnv Oct 19, 2022

Choose a reason for hiding this comment

nv-braf Oct 19, 2022

Choose a reason for hiding this comment

tgerdesnv Oct 19, 2022

Choose a reason for hiding this comment

tgerdesnv Oct 19, 2022

Choose a reason for hiding this comment

nv-braf Oct 19, 2022

Choose a reason for hiding this comment

tgerdesnv Oct 20, 2022

Choose a reason for hiding this comment

rmccorm4 Oct 19, 2022

Choose a reason for hiding this comment

rmccorm4 Oct 19, 2022 •

edited

Loading