Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for GPU monitoring #1601

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

gjulianm
Copy link
Collaborator

@gjulianm gjulianm commented Jan 3, 2025

What does this PR do?

This PR adds support for the GPU monitoring feature, doing all the required changes to improve experience of customers.

Motivation

Simplify deployment of GPU monitoring.

Additional Notes

This is an initial implementation of the feature. It does not support deployment of mixed clusters (those where not all nodes have GPUs).

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: v7.60.x for the GPU monitoring feature.
  • Cluster Agent: N/A

Describe your test plan

  1. Deploy the operator in a cluster
  2. Deploy the agent resource with feature.gpu.enabled: yes.
  3. Check that the deployed agent pod has runtimeClassName: nvidia with kubectl get pod datadog-agent-XXX -o json | jq ".spec.runtimeClassName".
  4. Ensure that DD_GPU_MONITORING_ENABLED is set to true in both the agent and system-probe containers.

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label

@gjulianm gjulianm self-assigned this Jan 3, 2025
@gjulianm gjulianm added the enhancement New feature or request label Jan 3, 2025
@gjulianm gjulianm added this to the v1.12.0 milestone Jan 3, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jan 3, 2025

Codecov Report

Attention: Patch coverage is 84.00000% with 20 lines in your changes missing coverage. Please review.

Project coverage is 49.06%. Comparing base (db00883) to head (8d4c14e).
Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
...nal/controller/datadogagent/feature/gpu/feature.go 90.09% 9 Missing and 1 partial ⚠️
internal/controller/testutils/agent.go 0.00% 10 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1601      +/-   ##
==========================================
+ Coverage   48.94%   49.06%   +0.11%     
==========================================
  Files         227      235       +8     
  Lines       20636    21983    +1347     
==========================================
+ Hits        10101    10785     +684     
- Misses      10010    10629     +619     
- Partials      525      569      +44     
Flag Coverage Δ
unittests 49.06% <84.00%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
api/datadoghq/v2alpha1/datadogagent_types.go 100.00% <ø> (ø)
internal/controller/datadogagent/controller.go 51.85% <ø> (ø)
...ller/datadogagent/defaults/datadogagent_default.go 91.24% <100.00%> (+0.14%) ⬆️
pkg/testutils/builder.go 91.62% <100.00%> (+0.10%) ⬆️
...nal/controller/datadogagent/feature/gpu/feature.go 90.09% <90.09%> (ø)
internal/controller/testutils/agent.go 0.00% <0.00%> (ø)

... and 20 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db00883...8d4c14e. Read the comment docs.

@gjulianm gjulianm force-pushed the guillermo.julian/gpu-monitoring branch from 6777208 to ce25955 Compare January 7, 2025 13:40
@gjulianm gjulianm force-pushed the guillermo.julian/gpu-monitoring branch 4 times, most recently from ce25955 to 60173ad Compare January 8, 2025 09:55
@gjulianm gjulianm force-pushed the guillermo.julian/gpu-monitoring branch from 60173ad to dd0dd9c Compare January 8, 2025 10:03
@gjulianm gjulianm marked this pull request as ready for review January 8, 2025 11:06
@gjulianm gjulianm requested review from a team as code owners January 8, 2025 11:06
@tbavelier tbavelier modified the milestones: v1.12.0, v1.13.0 Jan 8, 2025
Copy link
Contributor

@buraizu buraizu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with a minor edit requested

docs/configuration.v2alpha1.md Outdated Show resolved Hide resolved
Copy link
Contributor

@celenechang celenechang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall - left some minor questions/suggestions. Thanks for the helpful comments in the feature.go code!

api/datadoghq/v2alpha1/datadogagent_types.go Outdated Show resolved Hide resolved
Comment on lines 208 to 210
NVIDIADevicesMountPath = "/var/run/nvidia-container-devices/all"
NVIDIADevicesVolumeName = "nvidia-devices"
DevNullPath = "/dev/null" // used to mount the NVIDIADevicesHostPath to /dev/null in the container, it's just used as a "signal" to the nvidia runtime to use the nvidia devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these (including line 83) only seem to be used in the feature/gpu/feature.go code, can they be placed within the feature/gpu directory? e.g. https://github.com/DataDog/datadog-operator/blob/main/internal/controller/datadogagent/feature/kubernetesstatecore/const.go

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved back to gpu/const.go

Comment on lines 5 to 6
newName: gcr.io/datadoghq/operator
newTag: 1.11.1
newName: 601427279990.dkr.ecr.us-east-1.amazonaws.com/guillermo.julian/sandbox
newTag: operator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind undoing these changes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry. Removed!

Comment on lines 74 to 75
// GPUMonitoringType monitoring feature.
GPUMonitoringType = "gpu"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// GPUMonitoringType monitoring feature.
GPUMonitoringType = "gpu"
// GPUIDType GPU feature.
GPUIDType = "gpu"

return &gpuMonitoringFeature{}
}

type gpuMonitoringFeature struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but for consistency maybe we could rename to gpuFeature

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

internal/controller/datadogagent/feature/gpu/feature.go Outdated Show resolved Hide resolved
@@ -78,6 +78,9 @@ const (
KubeServicesAndEndpointsListeners = "kube_services kube_endpoints"
EndpointsChecksConfigProvider = "endpointschecks"
ClusterAndEndpointsConfigProviders = "clusterchecks endpointschecks"

// DefaultGPUMonitoringRuntimeClass default runtime class for GPU pods
DefaultGPUMonitoringRuntimeClass = "nvidia"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind moving this to feature/gpu/const.go as well?

@@ -82,6 +82,8 @@ type DatadogFeatures struct {
SBOM *SBOMFeatureConfig `json:"sbom,omitempty"`
// ServiceDiscovery
ServiceDiscovery *ServiceDiscoveryFeatureConfig `json:"serviceDiscovery,omitempty"`
// GPU monitoring
GPUMonitoring *GPUMonitoringFeatureConfig `json:"gpu,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this in the first pass.

Suggested change
GPUMonitoring *GPUMonitoringFeatureConfig `json:"gpu,omitempty"`
GPU *GPUFeatureConfig `json:"gpu,omitempty"`

Comment on lines +503 to +504
// GPUMonitoringFeatureConfig contains the GPU monitoring configuration.
type GPUMonitoringFeatureConfig struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// GPUMonitoringFeatureConfig contains the GPU monitoring configuration.
type GPUMonitoringFeatureConfig struct {
// GPUFeatureConfig contains the GPU monitoring configuration.
type GPUFeatureConfig struct {

Comment on lines +8 to +10
const nvidiaDevicesMountPath = "/var/run/nvidia-container-devices/all"
const nvidiaDevicesVolumeName = "nvidia-devices"
const devNullPath = "/dev/null" // used to mount the NVIDIADevicesHostPath to /dev/null in the container, it's just used as a "signal" to the nvidia runtime to use the nvidia devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const nvidiaDevicesMountPath = "/var/run/nvidia-container-devices/all"
const nvidiaDevicesVolumeName = "nvidia-devices"
const devNullPath = "/dev/null" // used to mount the NVIDIADevicesHostPath to /dev/null in the container, it's just used as a "signal" to the nvidia runtime to use the nvidia devices
const (
nvidiaDevicesMountPath = "/var/run/nvidia-container-devices/all"
nvidiaDevicesVolumeName = "nvidia-devices"
devNullPath = "/dev/null" // used to mount the NVIDIADevicesHostPath to /dev/null in the container, it's just used as a "signal" to the nvidia runtime to use the nvidia devices
)

@@ -71,4 +71,6 @@ const (
DummyIDType = "dummy"
// ServiceDiscoveryType service discovery feature.
ServiceDiscoveryType = "service_discovery"
// GPUIDType monitoring feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// GPUIDType monitoring feature.
// GPUIDType GPU monitoring feature.

Comment on lines +82 to +83
// DefaultGPUMonitoringRuntimeClass default runtime class for GPU pods
DefaultGPUMonitoringRuntimeClass = "nvidia"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// DefaultGPUMonitoringRuntimeClass default runtime class for GPU pods
DefaultGPUMonitoringRuntimeClass = "nvidia"
// DefaultGPURuntimeClass default runtime class for GPU pods
DefaultGPURuntimeClass = "nvidia"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants