fix: Assume 0 MiB of memory is available on a disabled NVIDIA GPU #537

cilevitz · 2025-02-01T06:23:55Z

…

What was the problem/requirement? (What/Why)

As described in #536:

On a system with a disabled NVIDIA GPU that has no memory allocated to the GPU, the worker agent exits in its initialization phase due to a critical error from _get_gpu_memory() trying to parse a string as an integer.

What was the solution? (How)

nvidia-smi --query-gpu=memory.total returns a memory value that can be parsed by int() if the GPU is active, or the string "[N/A]" if the GPU is disabled. There are no other mode flags (e.g. display_active, display_mode) that can be used to determine whether a GPU has no memory allocated to it.

Catching the ValueError raised when trying to parse the "[N/A]" string output from nvidia-smi --query-gpu=memory.total instead of letting it propagate upward allows the worker agent to finish initializing. A GPU that's present in a system but not active contributes zero memory towards that system's total GPU memory, and therefore _get_gpu_memory() is made to return 0 for a GPU if that GPU's memory size cannot be queried.

What is the impact of this change?

GPUs without allocated memory will be part of the total number of GPUs available on the machine, but they contribute 0 MiB to the total amount of GPU memory available on the machine. The worker agent emits a warning in its log that a non-numeric result was received from nvidia-smi.

How was this change tested?

The change was tested on an affected machine and the worker agent starts up successfully.

Was this change documented?

The docstring is updated and the code block has comments.

Is this a breaking change?

No.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…ry-gpu=memory.total returns "[N/A]" Signed-off-by: Daniel Cilevitz <[email protected]>

…ection

Signed-off-by: Daniel Cilevitz <[email protected]>

AWS-Samuel

Thanks for discovering this edge case and putting up a fix! I have two bits of feedback. Let me know what you think!

AWS-Samuel · 2025-02-10T21:42:39Z

src/deadline_worker_agent/capabilities.py

-    for line in output.splitlines():
-        mem_mib = int(line.replace("MiB", ""))
-        mem_per_gpu.append(mem_mib)
+    mem_per_gpu = _parse_gpu_memory(output)


Suggested change

mem_per_gpu = _parse_gpu_memory(output)

mem_per_gpu = _parse_gpu_memory(output, verbose)

I think we will want to pass on the verbose option when we call this function

Added verbose parameter in e78b9da.

src/deadline_worker_agent/capabilities.py

Signed-off-by: Daniel Cilevitz <[email protected]>

sonarqubecloud · 2025-02-11T05:49:37Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

fix: Assume 0 MiB of memory is available on a GPU if nvidia-smi --que…

60b9f46

…ry-gpu=memory.total returns "[N/A]" Signed-off-by: Daniel Cilevitz <[email protected]>

cilevitz mentioned this pull request Feb 1, 2025

Bug: Worker agent aborts startup when detecting GPU memory on a disabled GPU #536

Open

erico-aws added the response-requested A response from the contributor has been requested. label Feb 6, 2025

cilevitz added 2 commits February 7, 2025 19:15

Merge branch 'aws-deadline:mainline' into fix-disabled-gpu-memory-det…

441faae

…ection

chore: Refactor _parse_gpu_memory

e864a6f

Signed-off-by: Daniel Cilevitz <[email protected]>

cilevitz force-pushed the fix-disabled-gpu-memory-detection branch from e067ad2 to e864a6f Compare February 8, 2025 00:24

cilevitz marked this pull request as ready for review February 8, 2025 00:29

cilevitz requested a review from a team as a code owner February 8, 2025 00:29

AWS-Samuel reviewed Feb 10, 2025

View reviewed changes

cilevitz added 2 commits February 11, 2025 00:44

Merge branch 'mainline' into fix-disabled-gpu-memory-detection

ddb118b

fix: Pass verbose to _parse_gpu_memory()

e78b9da

Signed-off-by: Daniel Cilevitz <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Assume 0 MiB of memory is available on a disabled NVIDIA GPU #537

fix: Assume 0 MiB of memory is available on a disabled NVIDIA GPU #537

cilevitz commented Feb 1, 2025

AWS-Samuel left a comment

AWS-Samuel Feb 10, 2025

cilevitz Feb 11, 2025

sonarqubecloud bot commented Feb 11, 2025

	mem_per_gpu = _parse_gpu_memory(output)
	mem_per_gpu = _parse_gpu_memory(output, verbose)

fix: Assume 0 MiB of memory is available on a disabled NVIDIA GPU #537

Are you sure you want to change the base?

fix: Assume 0 MiB of memory is available on a disabled NVIDIA GPU #537

Conversation

cilevitz commented Feb 1, 2025

What was the problem/requirement? (What/Why)

What was the solution? (How)

What is the impact of this change?

How was this change tested?

Was this change documented?

Is this a breaking change?

AWS-Samuel left a comment

Choose a reason for hiding this comment

AWS-Samuel Feb 10, 2025

Choose a reason for hiding this comment

cilevitz Feb 11, 2025

Choose a reason for hiding this comment

sonarqubecloud bot commented Feb 11, 2025

Quality Gate passed