-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use NVML to Acquire GPU UUIDs #552
Conversation
devices = nvmlDeviceGetCount() | ||
for device_id in range(devices): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed the previous variable name was cuda_visible_gpus
, so just wanted to mention this:
Might not be an issue for anyone, but nvml
doesn't respect CUDA_VISIBLE_DEVICES
. So if you have any tests that exploit this env var for isolation inside the container (ex: CUDA_VISIBLE_DEVICES=1,3,5
), it likely won't work as expected.
Isolating gpus in a container through docker run --gpus '"device=1,3,5"' ...
will work as expected though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function name is get_cuda_visible_gpus, so we should honor that. Is there a way to do so? Can we read that env variable here to filter the gpus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ryan, thanks for looking at this! I've added in logic to check for CUDA_VISIBLE_DEVICES and map them to the device ids.
devices = nvmlDeviceGetCount() | ||
for device_id in range(devices): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function name is get_cuda_visible_gpus, so we should honor that. Is there a way to do so? Can we read that env variable here to filter the gpus?
@@ -18,7 +18,10 @@ | |||
import model_analyzer.monitor.dcgm.dcgm_structs as structs | |||
from model_analyzer.model_analyzer_exceptions import TritonModelAnalyzerException | |||
|
|||
from pynvml import * | |||
|
|||
import numba.cuda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, numba.cuda is still being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. I thought the whole point of this was to remove DCGM, but it doesn't appear that dcgm was used in the function you removed? I do still see it used in init_all_devices() though. Did we change the wrong code?
cuda_id_map = list( | ||
map(int, | ||
os.environ.get("CUDA_VISIBLE_DEVICES", "").split(","))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA_VISIBLE_DEVICES also supports MIG IDs: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-gi which are strings and not just ints.
You may be able to just keep it as a list of strings and instead check something like this:
if id in cuda_visible_devices or uuid in cuda_visible_devices:
...
disclaimer: I have no idea what values nvml will return for MIG devices, so you may want to just make this a FIXME or something to support MIG devices as needed based on priority.
c2ef87f
to
7654fa2
Compare
Replaces DCGM call with ones from NVML