-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Assume 0 MiB of memory is available on a disabled NVIDIA GPU #537
base: mainline
Are you sure you want to change the base?
fix: Assume 0 MiB of memory is available on a disabled NVIDIA GPU #537
Conversation
…ry-gpu=memory.total returns "[N/A]" Signed-off-by: Daniel Cilevitz <[email protected]>
Signed-off-by: Daniel Cilevitz <[email protected]>
e067ad2
to
e864a6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for discovering this edge case and putting up a fix! I have two bits of feedback. Let me know what you think!
for line in output.splitlines(): | ||
mem_mib = int(line.replace("MiB", "")) | ||
mem_per_gpu.append(mem_mib) | ||
mem_per_gpu = _parse_gpu_memory(output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mem_per_gpu = _parse_gpu_memory(output) | |
mem_per_gpu = _parse_gpu_memory(output, verbose) |
I think we will want to pass on the verbose
option when we call this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added verbose
parameter in e78b9da.
Signed-off-by: Daniel Cilevitz <[email protected]>
|
…
What was the problem/requirement? (What/Why)
As described in #536:
On a system with a disabled NVIDIA GPU that has no memory allocated to the GPU, the worker agent exits in its initialization phase due to a critical error from
_get_gpu_memory()
trying to parse a string as an integer.What was the solution? (How)
nvidia-smi --query-gpu=memory.total
returns a memory value that can be parsed byint()
if the GPU is active, or the string "[N/A]
" if the GPU is disabled. There are no other mode flags (e.g.display_active
,display_mode
) that can be used to determine whether a GPU has no memory allocated to it.Catching the
ValueError
raised when trying to parse the "[N/A]
" string output fromnvidia-smi --query-gpu=memory.total
instead of letting it propagate upward allows the worker agent to finish initializing. A GPU that's present in a system but not active contributes zero memory towards that system's total GPU memory, and therefore_get_gpu_memory()
is made to return 0 for a GPU if that GPU's memory size cannot be queried.What is the impact of this change?
GPUs without allocated memory will be part of the total number of GPUs available on the machine, but they contribute 0 MiB to the total amount of GPU memory available on the machine. The worker agent emits a warning in its log that a non-numeric result was received from
nvidia-smi
.How was this change tested?
The change was tested on an affected machine and the worker agent starts up successfully.
Was this change documented?
The docstring is updated and the code block has comments.
Is this a breaking change?
No.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.