Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop managing the installation of the NVIDIA driver. #380

Merged
merged 2 commits into from
Dec 19, 2024

Conversation

aieri
Copy link
Contributor

@aieri aieri commented Dec 18, 2024

While we initially provided automatic installation of the NVIDIA driver
as a convenience, we then ran into the complexity of dealing with users
wanting to configure pci-passthrough and/or vgpu, and to possibly move
across these configurations post-deployment (see #362, #379).

After some more discussions, we agreed that deploying a gpu driver is
not the responsibility of hardware-observer, but rather of the principal charm
that needs to use the gpu (e.g. nova or kubernetes-worker).

This commit therefore drops the functionality of automatically
installing the driver and determining if it has been blacklisted for a
simpler workflow of installing DCGM only if a driver is found to have
been installed and loaded.

Fixes: #379

While we initially provided automatic installation of the NVIDIA driver
as a convenience, we then ran into the complexity of dealing with users
wanting to configure pci-passthrough and/or vgpu, and to possibly move
across these configurations post-deployment (see canonical#362, canonical#379).

After some more discussions, we agreed that deploying a gpu driver is
not the responsibility of hardware-observer, but rather of the principal charm
that needs to use the gpu (e.g. nova or kubernetes-worker).

This commit therefore drops the functionality of automatically
installing the driver and determining if it has been blacklisted for a
simpler workflow of installing DCGM only if a driver is found to have
been installed and loaded.

Fixes: canonical#379
Copy link
Contributor

@jneo8 jneo8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Need clarification)
I see your approach overall.

However I have some concerns about the user experience that I feel haven't been fully addressed in this PR:

  • If a GPU is detected but no driver is found, should the charm enter a "blocked" status? I think this needs clarification.

  • (Optional) I believe there’s still value in having hw-observer assist users with GPU driver installation. It could be provided as an optional Juju action to enable automatic driver installation.

@aieri
Copy link
Contributor Author

aieri commented Dec 18, 2024

(Need clarification) I see your approach overall.

However I have some concerns about the user experience that I feel haven't been fully addressed in this PR:

  • If a GPU is detected but no driver is found, should the charm enter a "blocked" status? I think this needs clarification.
  • (Optional) I believe there’s still value in having hw-observer assist users with GPU driver installation. It could be provided as an optional Juju action to enable automatic driver installation.

No, I don't think we should enter a blocked state regardless of presence or absence of a gpu. We should deploy dcgm when we see an NVIDIA gpu, and not deploy it / remove it where we don't (the removal part is not automated, TBD, and out of scope for this PR).

I don't quite agree with keeping the driver installation codepath in an action, because I think it would make that codepath even more out of place than where it currently is. Why would a user use a hardware-observer action to deploy a driver that is needed by a different component? And what benefit would they have over ubuntu-drivers --gpgpu install ?
Generally speaking, actions should not be a 1:1 remapping of cli tools, otherwise we'll get all the maintenance efforts and none of the gains.

@jneo8
Copy link
Contributor

jneo8 commented Dec 19, 2024

No, I don't think we should enter a blocked state regardless of presence or absence of a gpu. We should deploy dcgm when we see an NVIDIA gpu, and not deploy it / remove it where we don't (the removal part is not automated, TBD, and out of scope for this PR).

The key point here is: how does the user realize they need to install the driver? Unlike other hardware tools, the user experience with hw-observer aims to proactively guide users by reminding them of all the necessary manual operations they need to complete.

Copy link
Contributor

@jneo8 jneo8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM.

Copy link
Contributor

@Deezzir Deezzir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@aieri aieri merged commit 2206174 into canonical:main Dec 19, 2024
10 checks passed
@aieri aieri deleted the SOLENG-998-no-driver-installation branch December 19, 2024 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

the NVIDIA gpu module blacklisting algorithm is not specific enough
3 participants