Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Assume 0 MiB of memory is available on a disabled NVIDIA GPU #537

Open
wants to merge 5 commits into
base: mainline
Choose a base branch
from

Conversation

cilevitz
Copy link

@cilevitz cilevitz commented Feb 1, 2025

What was the problem/requirement? (What/Why)

As described in #536:

On a system with a disabled NVIDIA GPU that has no memory allocated to the GPU, the worker agent exits in its initialization phase due to a critical error from _get_gpu_memory() trying to parse a string as an integer.

What was the solution? (How)

nvidia-smi --query-gpu=memory.total returns a memory value that can be parsed by int() if the GPU is active, or the string "[N/A]" if the GPU is disabled. There are no other mode flags (e.g. display_active, display_mode) that can be used to determine whether a GPU has no memory allocated to it.

Catching the ValueError raised when trying to parse the "[N/A]" string output from nvidia-smi --query-gpu=memory.total instead of letting it propagate upward allows the worker agent to finish initializing. A GPU that's present in a system but not active contributes zero memory towards that system's total GPU memory, and therefore _get_gpu_memory() is made to return 0 for a GPU if that GPU's memory size cannot be queried.

What is the impact of this change?

GPUs without allocated memory will be part of the total number of GPUs available on the machine, but they contribute 0 MiB to the total amount of GPU memory available on the machine. The worker agent emits a warning in its log that a non-numeric result was received from nvidia-smi.

How was this change tested?

The change was tested on an affected machine and the worker agent starts up successfully.

Was this change documented?

The docstring is updated and the code block has comments.

Is this a breaking change?

No.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…ry-gpu=memory.total returns "[N/A]"

Signed-off-by: Daniel Cilevitz <[email protected]>
@erico-aws erico-aws added the response-requested A response from the contributor has been requested. label Feb 6, 2025
@cilevitz cilevitz force-pushed the fix-disabled-gpu-memory-detection branch from e067ad2 to e864a6f Compare February 8, 2025 00:24
@cilevitz cilevitz marked this pull request as ready for review February 8, 2025 00:29
@cilevitz cilevitz requested a review from a team as a code owner February 8, 2025 00:29
Copy link
Contributor

@AWS-Samuel AWS-Samuel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for discovering this edge case and putting up a fix! I have two bits of feedback. Let me know what you think!

for line in output.splitlines():
mem_mib = int(line.replace("MiB", ""))
mem_per_gpu.append(mem_mib)
mem_per_gpu = _parse_gpu_memory(output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mem_per_gpu = _parse_gpu_memory(output)
mem_per_gpu = _parse_gpu_memory(output, verbose)

I think we will want to pass on the verbose option when we call this function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added verbose parameter in e78b9da.

src/deadline_worker_agent/capabilities.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
response-requested A response from the contributor has been requested.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants