Stop managing the installation of the NVIDIA driver. #380

aieri · 2024-12-18T02:07:54Z

While we initially provided automatic installation of the NVIDIA driver
as a convenience, we then ran into the complexity of dealing with users
wanting to configure pci-passthrough and/or vgpu, and to possibly move
across these configurations post-deployment (see #362, #379).

After some more discussions, we agreed that deploying a gpu driver is
not the responsibility of hardware-observer, but rather of the principal charm
that needs to use the gpu (e.g. nova or kubernetes-worker).

This commit therefore drops the functionality of automatically
installing the driver and determining if it has been blacklisted for a
simpler workflow of installing DCGM only if a driver is found to have
been installed and loaded.

Fixes: #379

While we initially provided automatic installation of the NVIDIA driver as a convenience, we then ran into the complexity of dealing with users wanting to configure pci-passthrough and/or vgpu, and to possibly move across these configurations post-deployment (see canonical#362, canonical#379). After some more discussions, we agreed that deploying a gpu driver is not the responsibility of hardware-observer, but rather of the principal charm that needs to use the gpu (e.g. nova or kubernetes-worker). This commit therefore drops the functionality of automatically installing the driver and determining if it has been blacklisted for a simpler workflow of installing DCGM only if a driver is found to have been installed and loaded. Fixes: canonical#379

jneo8

(Need clarification)
I see your approach overall.

However I have some concerns about the user experience that I feel haven't been fully addressed in this PR:

If a GPU is detected but no driver is found, should the charm enter a "blocked" status? I think this needs clarification.
(Optional) I believe there’s still value in having hw-observer assist users with GPU driver installation. It could be provided as an optional Juju action to enable automatic driver installation.

aieri · 2024-12-18T16:51:02Z

(Need clarification) I see your approach overall.

However I have some concerns about the user experience that I feel haven't been fully addressed in this PR:

If a GPU is detected but no driver is found, should the charm enter a "blocked" status? I think this needs clarification.

(Optional) I believe there’s still value in having hw-observer assist users with GPU driver installation. It could be provided as an optional Juju action to enable automatic driver installation.

No, I don't think we should enter a blocked state regardless of presence or absence of a gpu. We should deploy dcgm when we see an NVIDIA gpu, and not deploy it / remove it where we don't (the removal part is not automated, TBD, and out of scope for this PR).

I don't quite agree with keeping the driver installation codepath in an action, because I think it would make that codepath even more out of place than where it currently is. Why would a user use a hardware-observer action to deploy a driver that is needed by a different component? And what benefit would they have over ubuntu-drivers --gpgpu install ?
Generally speaking, actions should not be a 1:1 remapping of cli tools, otherwise we'll get all the maintenance efforts and none of the gains.

jneo8 · 2024-12-19T01:41:13Z

No, I don't think we should enter a blocked state regardless of presence or absence of a gpu. We should deploy dcgm when we see an NVIDIA gpu, and not deploy it / remove it where we don't (the removal part is not automated, TBD, and out of scope for this PR).

The key point here is: how does the user realize they need to install the driver? Unlike other hardware tools, the user experience with hw-observer aims to proactively guide users by reminding them of all the necessary manual operations they need to complete.

jneo8

Thanks. LGTM.

Deezzir

Thanks

aieri requested a review from a team as a code owner December 18, 2024 02:07

aieri requested review from Vultaire, Pjack, samuelallan72, jneo8, gabrielcocenza, Deezzir and sbparke December 18, 2024 02:07

jneo8 requested changes Dec 18, 2024

View reviewed changes

Clarify the reasoning why we don't automate installing the gpu driver.

b0525b5

jneo8 approved these changes Dec 19, 2024

View reviewed changes

Deezzir approved these changes Dec 19, 2024

View reviewed changes

aieri merged commit 2206174 into canonical:main Dec 19, 2024
10 checks passed

aieri deleted the SOLENG-998-no-driver-installation branch December 19, 2024 02:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop managing the installation of the NVIDIA driver. #380

Stop managing the installation of the NVIDIA driver. #380

aieri commented Dec 18, 2024

jneo8 left a comment

aieri commented Dec 18, 2024

jneo8 commented Dec 19, 2024

jneo8 left a comment

Deezzir left a comment

Stop managing the installation of the NVIDIA driver. #380

Stop managing the installation of the NVIDIA driver. #380

Conversation

aieri commented Dec 18, 2024

jneo8 left a comment

Choose a reason for hiding this comment

aieri commented Dec 18, 2024

jneo8 commented Dec 19, 2024

jneo8 left a comment

Choose a reason for hiding this comment

Deezzir left a comment

Choose a reason for hiding this comment