Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SnapExporter class and DCGM #319

Merged
merged 18 commits into from
Sep 24, 2024
Merged

Conversation

gabrielcocenza
Copy link
Member

@gabrielcocenza gabrielcocenza commented Sep 18, 2024

  • Added the SnapExporter class that can be used for handle snap packages
  • Added the DCGMExporter that installs the DCGM exporter if detected NVIDIA GPUs
  • BaseExporter is a simple abstract class that has the required methods for an exporter
  • RendarableExporter is the legacy BaseExporter that contains the logic for exporters that are not snaped
  • DCGMExporterStrategy is not used on HWToolHelper. Avoid injecting the charm config and this class will eventually be removed

@gabrielcocenza gabrielcocenza mentioned this pull request Sep 19, 2024
src/service.py Outdated Show resolved Hide resolved
Copy link
Contributor

@aieri aieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this smells a bit like COU class hierarchies, I generally like it. Hardware observer's code feels to me really hard to follow and overly complex, probably due to the complexities of supporting those binary resources and unpackaged exporters. Going forward we will only have snapped exporters though, so we should be able to remove a ton of complexity. After all, a snap doesn't need much to be handled, probably just installing, configuring (via snap set, not config files), starting/stopping the services (without needing systemd templates), and some way to check the services are alive. From this perspective perhaps the dcgm (and smartctl) snaps can provide a much needed reset.

I have added more specific comments directly in the body of the PR.

src/service.py Outdated Show resolved Hide resolved
src/service.py Outdated Show resolved Hide resolved
src/service.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jneo8 jneo8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, most of the change are LGTM.

I still want to keep to strategy design, yes just a wrapper, at this point until we finish the #272 .

src/service.py Outdated Show resolved Hide resolved
src/service.py Outdated Show resolved Hide resolved
src/service.py Outdated Show resolved Hide resolved
src/service.py Outdated Show resolved Hide resolved
src/service.py Outdated Show resolved Hide resolved
src/charm.py Outdated Show resolved Hide resolved
@gabrielcocenza gabrielcocenza marked this pull request as ready for review September 19, 2024 22:05
@gabrielcocenza gabrielcocenza requested a review from a team as a code owner September 19, 2024 22:05
@gabrielcocenza
Copy link
Member Author

I still want to keep to strategy design, yes just a wrapper

@jneo8 I don't see a strong reason why to add strategy snap exporters. They are working fine this way and if add the strategy, we would need to inject the config to HWToolHelper to instantiate a strategy. Than we would need to map the install from the strategy to the install from the exporter and so on...

@gabrielcocenza gabrielcocenza changed the title Add dcgm Add SnapExporter class and DCGM Sep 19, 2024
@jneo8
Copy link
Contributor

jneo8 commented Sep 20, 2024

@jneo8 I don't see a strong reason why to add strategy snap exporters. They are working fine this way and if add the strategy, we would need to inject the config to HWToolHelper to instantiate a strategy. Than we would need to map the install from the strategy to the install from the exporter and so on...

Please check the source code how SmartExporter use the strategy. You don't have to use HWToolHelper, you can register strategy on the service class directly and use it. The design idea is to separate the responsibilities that the package install/remove logic should be define in the strategy class and service class only handle the service life-cycle.

Copy link
Contributor

@aieri aieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is looking good but I have a few comments

config.yaml Outdated Show resolved Hide resolved
src/charm.py Outdated Show resolved Hide resolved
src/charm.py Outdated Show resolved Hide resolved
src/service.py Show resolved Hide resolved
src/service.py Outdated Show resolved Hide resolved
src/service.py Outdated Show resolved Hide resolved
src/service.py Outdated Show resolved Hide resolved
tests/unit/test_service.py Outdated Show resolved Hide resolved
tests/unit/test_service.py Outdated Show resolved Hide resolved
tests/unit/test_service.py Outdated Show resolved Hide resolved
@gabrielcocenza
Copy link
Member Author

Still need to manually test to see if it's behaving as expected, but I've included the strategy that @jneo8 asked and another round of review is needed.

I don't fully agree that this is the right approach, but let's discuss that on a major refactor in another PR.

src/hw_tools.py Outdated Show resolved Hide resolved
- added hard coded port to the dcgm exporter
src/service.py Show resolved Hide resolved
src/hw_tools.py Show resolved Hide resolved
Copy link
Contributor

@jneo8 jneo8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, after address those comments, it's LGTM.

@gabrielcocenza gabrielcocenza merged commit 0f62edc into canonical:main Sep 24, 2024
9 checks passed
@gabrielcocenza gabrielcocenza deleted the add-dcgm branch September 24, 2024 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants