-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop managing the installation of the NVIDIA driver. #380
Stop managing the installation of the NVIDIA driver. #380
Conversation
While we initially provided automatic installation of the NVIDIA driver as a convenience, we then ran into the complexity of dealing with users wanting to configure pci-passthrough and/or vgpu, and to possibly move across these configurations post-deployment (see canonical#362, canonical#379). After some more discussions, we agreed that deploying a gpu driver is not the responsibility of hardware-observer, but rather of the principal charm that needs to use the gpu (e.g. nova or kubernetes-worker). This commit therefore drops the functionality of automatically installing the driver and determining if it has been blacklisted for a simpler workflow of installing DCGM only if a driver is found to have been installed and loaded. Fixes: canonical#379
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Need clarification)
I see your approach overall.
However I have some concerns about the user experience that I feel haven't been fully addressed in this PR:
-
If a GPU is detected but no driver is found, should the charm enter a "blocked" status? I think this needs clarification.
-
(Optional) I believe there’s still value in having hw-observer assist users with GPU driver installation. It could be provided as an optional Juju action to enable automatic driver installation.
No, I don't think we should enter a blocked state regardless of presence or absence of a gpu. We should deploy dcgm when we see an NVIDIA gpu, and not deploy it / remove it where we don't (the removal part is not automated, TBD, and out of scope for this PR). I don't quite agree with keeping the driver installation codepath in an action, because I think it would make that codepath even more out of place than where it currently is. Why would a user use a hardware-observer action to deploy a driver that is needed by a different component? And what benefit would they have over |
The key point here is: how does the user realize they need to install the driver? Unlike other hardware tools, the user experience with hw-observer aims to proactively guide users by reminding them of all the necessary manual operations they need to complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
While we initially provided automatic installation of the NVIDIA driver
as a convenience, we then ran into the complexity of dealing with users
wanting to configure pci-passthrough and/or vgpu, and to possibly move
across these configurations post-deployment (see #362, #379).
After some more discussions, we agreed that deploying a gpu driver is
not the responsibility of hardware-observer, but rather of the principal charm
that needs to use the gpu (e.g. nova or kubernetes-worker).
This commit therefore drops the functionality of automatically
installing the driver and determining if it has been blacklisted for a
simpler workflow of installing DCGM only if a driver is found to have
been installed and loaded.
Fixes: #379