Add support for GPU monitoring #1601

gjulianm · 2025-01-03T13:10:39Z

What does this PR do?

This PR adds support for the GPU monitoring feature, doing all the required changes to improve experience of customers.

Motivation

Simplify deployment of GPU monitoring.

Additional Notes

This is an initial implementation of the feature. It does not support deployment of mixed clusters (those where not all nodes have GPUs).

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

Agent: v7.60.x for the GPU monitoring feature.
Cluster Agent: N/A

Describe your test plan

Deploy the operator in a cluster
Deploy the agent resource with feature.gpu.enabled: yes.
Check that the deployed agent pod has runtimeClassName: nvidia with kubectl get pod datadog-agent-XXX -o json | jq ".spec.runtimeClassName".
Ensure that DD_GPU_MONITORING_ENABLED is set to true in both the agent and system-probe containers.

Checklist

PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
PR has a milestone or the qa/skip-qa label

codecov-commenter · 2025-01-03T13:18:34Z

Codecov Report

Attention: Patch coverage is 84.00000% with 20 lines in your changes missing coverage. Please review.

Project coverage is 49.06%. Comparing base (db00883) to head (8d4c14e).
Report is 12 commits behind head on main.

Files with missing lines	Patch %	Lines
...nal/controller/datadogagent/feature/gpu/feature.go	90.09%	9 Missing and 1 partial ⚠️
internal/controller/testutils/agent.go	0.00%	10 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1601      +/-   ##
==========================================
+ Coverage   48.94%   49.06%   +0.11%     
==========================================
  Files         227      235       +8     
  Lines       20636    21983    +1347     
==========================================
+ Hits        10101    10785     +684     
- Misses      10010    10629     +619     
- Partials      525      569      +44

Flag	Coverage Δ
unittests	`49.06% <84.00%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
api/datadoghq/v2alpha1/datadogagent_types.go	`100.00% <ø> (ø)`
internal/controller/datadogagent/controller.go	`51.85% <ø> (ø)`
...ller/datadogagent/defaults/datadogagent_default.go	`91.24% <100.00%> (+0.14%)`	⬆️
pkg/testutils/builder.go	`91.62% <100.00%> (+0.10%)`	⬆️
...nal/controller/datadogagent/feature/gpu/feature.go	`90.09% <90.09%> (ø)`
internal/controller/testutils/agent.go	`0.00% <0.00%> (ø)`

... and 20 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db00883...8d4c14e. Read the comment docs.

buraizu

Approving with a minor edit requested

docs/configuration.v2alpha1.md

celenechang

Looks good overall - left some minor questions/suggestions. Thanks for the helpful comments in the feature.go code!

api/datadoghq/v2alpha1/datadogagent_types.go

celenechang · 2025-01-23T12:52:47Z

api/datadoghq/v2alpha1/const.go

+	NVIDIADevicesMountPath  = "/var/run/nvidia-container-devices/all"
+	NVIDIADevicesVolumeName = "nvidia-devices"
+	DevNullPath             = "/dev/null" // used to mount the NVIDIADevicesHostPath to /dev/null in the container, it's just used as a "signal" to the nvidia runtime to use the nvidia devices


Since these (including line 83) only seem to be used in the feature/gpu/feature.go code, can they be placed within the feature/gpu directory? e.g. https://github.com/DataDog/datadog-operator/blob/main/internal/controller/datadogagent/feature/kubernetesstatecore/const.go

Moved back to gpu/const.go

celenechang · 2025-01-23T12:56:40Z

config/manager/kustomization.yaml

-  newName: gcr.io/datadoghq/operator
-  newTag: 1.11.1
+  newName: 601427279990.dkr.ecr.us-east-1.amazonaws.com/guillermo.julian/sandbox
+  newTag: operator


Mind undoing these changes?

Oops, sorry. Removed!

celenechang · 2025-01-23T13:06:07Z

internal/controller/datadogagent/feature/ids.go

+	// GPUMonitoringType monitoring feature.
+	GPUMonitoringType = "gpu"


Suggested change

// GPUMonitoringType monitoring feature.

GPUMonitoringType = "gpu"

// GPUIDType GPU feature.

GPUIDType = "gpu"

celenechang · 2025-01-23T13:14:51Z

internal/controller/datadogagent/feature/gpu/feature.go

+	return &gpuMonitoringFeature{}
+}
+
+type gpuMonitoringFeature struct {


Nit, but for consistency maybe we could rename to gpuFeature

internal/controller/datadogagent/feature/gpu/feature.go

Co-authored-by: Celene <[email protected]>

celenechang · 2025-01-24T12:19:04Z

api/datadoghq/v2alpha1/const.go

@@ -78,6 +78,9 @@ const (
 	KubeServicesAndEndpointsListeners       = "kube_services kube_endpoints"
 	EndpointsChecksConfigProvider           = "endpointschecks"
 	ClusterAndEndpointsConfigProviders      = "clusterchecks endpointschecks"
+
+	// DefaultGPUMonitoringRuntimeClass default runtime class for GPU pods
+	DefaultGPUMonitoringRuntimeClass = "nvidia"


Mind moving this to feature/gpu/const.go as well?

celenechang · 2025-01-24T12:28:33Z

api/datadoghq/v2alpha1/datadogagent_types.go

@@ -82,6 +82,8 @@ type DatadogFeatures struct {
 	SBOM *SBOMFeatureConfig `json:"sbom,omitempty"`
 	// ServiceDiscovery
 	ServiceDiscovery *ServiceDiscoveryFeatureConfig `json:"serviceDiscovery,omitempty"`
+	// GPU monitoring
+	GPUMonitoring *GPUMonitoringFeatureConfig `json:"gpu,omitempty"`


Sorry I missed this in the first pass.

Suggested change

GPUMonitoring *GPUMonitoringFeatureConfig `json:"gpu,omitempty"`

GPU *GPUFeatureConfig `json:"gpu,omitempty"`

celenechang · 2025-01-24T12:28:48Z

api/datadoghq/v2alpha1/datadogagent_types.go

+// GPUMonitoringFeatureConfig contains the GPU monitoring configuration.
+type GPUMonitoringFeatureConfig struct {


Suggested change

// GPUMonitoringFeatureConfig contains the GPU monitoring configuration.

type GPUMonitoringFeatureConfig struct {

// GPUFeatureConfig contains the GPU monitoring configuration.

type GPUFeatureConfig struct {

celenechang · 2025-01-24T12:32:52Z

internal/controller/datadogagent/feature/gpu/const.go

+const nvidiaDevicesMountPath  = "/var/run/nvidia-container-devices/all"
+const nvidiaDevicesVolumeName = "nvidia-devices"
+const devNullPath             = "/dev/null" // used to mount the NVIDIADevicesHostPath to /dev/null in the container, it's just used as a "signal" to the nvidia runtime to use the nvidia devices


Suggested change

const nvidiaDevicesMountPath = "/var/run/nvidia-container-devices/all"

const nvidiaDevicesVolumeName = "nvidia-devices"

const devNullPath = "/dev/null" // used to mount the NVIDIADevicesHostPath to /dev/null in the container, it's just used as a "signal" to the nvidia runtime to use the nvidia devices

const (

nvidiaDevicesMountPath = "/var/run/nvidia-container-devices/all"

nvidiaDevicesVolumeName = "nvidia-devices"

devNullPath = "/dev/null" // used to mount the NVIDIADevicesHostPath to /dev/null in the container, it's just used as a "signal" to the nvidia runtime to use the nvidia devices

)

celenechang · 2025-01-24T12:33:27Z

internal/controller/datadogagent/feature/ids.go

@@ -71,4 +71,6 @@ const (
 	DummyIDType = "dummy"
 	// ServiceDiscoveryType service discovery feature.
 	ServiceDiscoveryType = "service_discovery"
+	// GPUIDType monitoring feature.


Suggested change

// GPUIDType monitoring feature.

// GPUIDType GPU monitoring feature.

celenechang · 2025-01-24T12:35:20Z

api/datadoghq/v2alpha1/const.go

+	// DefaultGPUMonitoringRuntimeClass default runtime class for GPU pods
+	DefaultGPUMonitoringRuntimeClass = "nvidia"


Suggested change

// DefaultGPUMonitoringRuntimeClass default runtime class for GPU pods

DefaultGPUMonitoringRuntimeClass = "nvidia"

// DefaultGPURuntimeClass default runtime class for GPU pods

DefaultGPURuntimeClass = "nvidia"

gjulianm self-assigned this Jan 3, 2025

gjulianm added the enhancement New feature or request label Jan 3, 2025

gjulianm added this to the v1.12.0 milestone Jan 3, 2025

gjulianm force-pushed the guillermo.julian/gpu-monitoring branch from 6777208 to ce25955 Compare January 7, 2025 13:40

gjulianm mentioned this pull request Jan 7, 2025

tagger: handle GPU tags DataDog/datadog-agent#32052

Merged

gjulianm force-pushed the guillermo.julian/gpu-monitoring branch 4 times, most recently from ce25955 to 60173ad Compare January 8, 2025 09:55

gjulianm added 2 commits January 8, 2025 10:03

Add support for GPU feature

9bc1a53

Add tests for runtime class changes

dd0dd9c

gjulianm force-pushed the guillermo.julian/gpu-monitoring branch from 60173ad to dd0dd9c Compare January 8, 2025 10:03

Documentation

7d40ddb

gjulianm marked this pull request as ready for review January 8, 2025 11:06

gjulianm requested review from a team as code owners January 8, 2025 11:06

tbavelier modified the milestones: v1.12.0, v1.13.0 Jan 8, 2025

buraizu approved these changes Jan 8, 2025

View reviewed changes

docs/configuration.v2alpha1.md Outdated Show resolved Hide resolved

Update docs

da4ab24

celenechang reviewed Jan 23, 2025

View reviewed changes

gjulianm and others added 6 commits January 24, 2025 11:03

Update api/datadoghq/v2alpha1/datadogagent_types.go

f5b5e32

Co-authored-by: Celene <[email protected]>

Remove debug changes

dbf0019

Move const variables to gpu package

0c78f33

GPUMonitoringType -> GPUIDType

3fcc699

Rename gpuMonitoringFeature to gpuFeature

76a3ac3

Apply suggestion

8d4c14e

celenechang reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for GPU monitoring #1601

Add support for GPU monitoring #1601

gjulianm commented Jan 3, 2025 •

edited

Loading

codecov-commenter commented Jan 3, 2025 •

edited

Loading

buraizu left a comment

celenechang left a comment

celenechang Jan 23, 2025

gjulianm Jan 24, 2025

celenechang Jan 23, 2025

gjulianm Jan 24, 2025

celenechang Jan 23, 2025

celenechang Jan 23, 2025

gjulianm Jan 24, 2025

celenechang Jan 24, 2025

celenechang Jan 24, 2025

celenechang Jan 24, 2025

celenechang Jan 24, 2025

celenechang Jan 24, 2025

celenechang Jan 24, 2025

		// GPUMonitoringType monitoring feature.
		GPUMonitoringType = "gpu"

	GPUMonitoring *GPUMonitoringFeatureConfig `json:"gpu,omitempty"`
	GPU *GPUFeatureConfig `json:"gpu,omitempty"`

		// GPUMonitoringFeatureConfig contains the GPU monitoring configuration.
		type GPUMonitoringFeatureConfig struct {

	// GPUIDType monitoring feature.
	// GPUIDType GPU monitoring feature.

		// DefaultGPUMonitoringRuntimeClass default runtime class for GPU pods
		DefaultGPUMonitoringRuntimeClass = "nvidia"

Add support for GPU monitoring #1601

Are you sure you want to change the base?

Add support for GPU monitoring #1601

Conversation

gjulianm commented Jan 3, 2025 • edited Loading

What does this PR do?

Motivation

Additional Notes

Minimum Agent Versions

Describe your test plan

Checklist

codecov-commenter commented Jan 3, 2025 • edited Loading

Codecov Report

buraizu left a comment

Choose a reason for hiding this comment

celenechang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjulianm commented Jan 3, 2025 •

edited

Loading

codecov-commenter commented Jan 3, 2025 •

edited

Loading