[EBPF] gpu: auto-enable agent check if system-probe gpu_monitoring module is enabled #32521

gjulianm · 2024-12-26T10:55:43Z

What does this PR do?

This PR adds automatic detection of the GPU monitoring feature in 60-sysprobe-check.sh config script for containerized environments, and enables the corresponding necessary agent-side check.

Motivation

Reduce the possibility of misconfigurations: if system-probe has the GPU monitoring module enabled but the agent gpu check is not enabled, no data will be reported.

This PR also simplifies deployments in k8s environments. As the GPU monitoring feature is deployed in mixed clusters (those where some nodes have GPUs and some don't), we need to override the features based on the node GPU availability. Enabling/disabling the system-probe module is easy enough with environment variables, but enabling/disabling checks is not as simple. This PR removes the need to manually configure the agent side check.

Describe how you validated your changes

Validated by manually running the script with different input values.

Possible Drawbacks / Trade-offs

Additional Notes

This PR parses the YAML file with Python's YAML module, which is present in the container image so no extra dependencies are required. This is a more robust alternative than using regex: while for other settings (e.g. enable_oom_kill) we're looking for a single key, here we're looking for a nested key inside another, so the regex would be more complex and more prone to errors.

agent-platform-auto-pr · 2024-12-26T11:19:46Z

[Fast Unit Tests Report]

On pipeline 52124222 (CI Visibility). The following jobs did not run any unit tests:

Jobs:

tests_deb-arm64-py3
tests_deb-x64-py3
tests_flavor_dogstatsd_deb-x64
tests_flavor_heroku_deb-x64
tests_flavor_iot_deb-x64
tests_rpm-arm64-py3
tests_rpm-x64-py3
tests_windows-x64

If you modified Go files and expected unit tests to run in these jobs, please double check the job logs. If you think tests should have been executed reach out to #agent-devx-help

agent-platform-auto-pr · 2024-12-26T11:33:57Z

Uncompressed package size comparison

Comparison with ancestor 1c800ccd263af6b80ff43c74954dae832b32b811

Diff per package

package	diff	status	size	ancestor	threshold
datadog-agent-amd64-deb	0.00MB	✅	1197.93MB	1197.93MB	140.00MB
datadog-agent-x86_64-rpm	0.00MB	✅	1207.24MB	1207.24MB	140.00MB
datadog-agent-x86_64-suse	0.00MB	✅	1207.24MB	1207.24MB	140.00MB
datadog-agent-arm64-deb	0.00MB	✅	940.31MB	940.31MB	140.00MB
datadog-agent-aarch64-rpm	0.00MB	✅	949.60MB	949.60MB	140.00MB
datadog-dogstatsd-amd64-deb	0.00MB	✅	79.00MB	79.00MB	10.00MB
datadog-dogstatsd-x86_64-rpm	0.00MB	✅	79.08MB	79.08MB	10.00MB
datadog-dogstatsd-x86_64-suse	0.00MB	✅	79.08MB	79.08MB	10.00MB
datadog-dogstatsd-arm64-deb	0.00MB	✅	56.11MB	56.11MB	10.00MB
datadog-heroku-agent-amd64-deb	0.00MB	✅	506.10MB	506.10MB	70.00MB
datadog-iot-agent-amd64-deb	0.00MB	✅	113.77MB	113.77MB	10.00MB
datadog-iot-agent-x86_64-rpm	0.00MB	✅	113.84MB	113.84MB	10.00MB
datadog-iot-agent-x86_64-suse	0.00MB	✅	113.84MB	113.84MB	10.00MB
datadog-iot-agent-arm64-deb	0.00MB	✅	109.22MB	109.22MB	10.00MB
datadog-iot-agent-aarch64-rpm	0.00MB	✅	109.29MB	109.29MB	10.00MB

Decision

✅ Passed

cit-pr-commenter · 2024-12-26T11:53:06Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 7dacea9c-4a3e-4e24-9391-b62cd60ca528

Baseline: 1c800cc
Comparison: 46501e2
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	quality_gate_logs	% cpu utilization	+2.31	[-0.95, +5.57]	1	Logs
➖	file_to_blackhole_1000ms_latency_linear_load	egress throughput	+0.44	[-0.03, +0.91]	1	Logs
➖	file_to_blackhole_0ms_latency_http1	egress throughput	+0.06	[-0.85, +0.96]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	+0.03	[-0.74, +0.81]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	+0.00	[-0.77, +0.78]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	+0.00	[-0.12, +0.12]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.01, +0.01]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	-0.01	[-0.88, +0.86]	1	Logs
➖	file_to_blackhole_0ms_latency_http2	egress throughput	-0.02	[-0.85, +0.81]	1	Logs
➖	file_to_blackhole_300ms_latency	egress throughput	-0.05	[-0.70, +0.60]	1	Logs
➖	quality_gate_idle	memory utilization	-0.14	[-0.18, -0.11]	1	Logs bounds checks dashboard
➖	quality_gate_idle_all_features	memory utilization	-0.24	[-0.32, -0.16]	1	Logs bounds checks dashboard
➖	file_tree	memory utilization	-0.24	[-0.37, -0.12]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	-0.50	[-1.28, +0.28]	1	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	-0.53	[-1.21, +0.15]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	-0.67	[-0.73, -0.61]	1	Logs

Bounds Checks: ❌ Failed

perf	experiment	bounds_check_name	replicates_passed	links
❌	file_to_blackhole_500ms_latency	lost_bytes	9/10
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http1	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http1	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http2	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http2	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency_linear_load	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_300ms_latency	lost_bytes	10/10
✅	file_to_blackhole_300ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	lost_bytes	10/10
✅	quality_gate_logs	memory_usage	10/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.

vboulineau · 2025-01-03T15:41:24Z

Dockerfiles/agent/cont-init.d/60-sysprobe-check.sh

+
+# Match the key gpu_monitoring.enabled: true using Python's YAML parser, which is included in the base image
+# and is more robust than using regexes.
+gpu_monitoring_enabled=$(python -c "import yaml, sys; data=yaml.safe_load(sys.stdin); print(bool(data.get('gpu_monitoring', {}).get('enabled')))" < $sysprobe_cfg)


nit: Not sure if the python alias is here to stay or not (anyway we need to get rid of these cont-init.d files at some point.

Also it'd be nice to have the same way for al then.

Changed to explicitly use python3. We can make a refactor of the previous calls in another PR, although if the plan is to remove this file entirely in the future I'm not sure if it's worth the risk to change it.

pgimalac

Approving because I don't think there is a much better way 😄

pgimalac · 2025-01-06T10:08:53Z

Dockerfiles/agent/cont-init.d/60-sysprobe-check.sh

-if grep -Eq '^ *enable_tcp_queue_length *: *true' /etc/datadog-agent/system-probe.yaml || [[ "$DD_SYSTEM_PROBE_CONFIG_ENABLE_TCP_QUEUE_LENGTH" == "true" ]]; then
+sysprobe_cfg="/etc/datadog-agent/system-probe.yaml"
+
+if grep -Eq '^ *enable_tcp_queue_length *: *true' $sysprobe_cfg || [[ "$DD_SYSTEM_PROBE_CONFIG_ENABLE_TCP_QUEUE_LENGTH" == "true" ]]; then


This is a very brittle way to check for a config...
Eg. all the following values are considered as true by Viper: "1", "t", "T", "true", "TRUE", "True"

I think the assumption here is that the config is set by the operator/helm chart in some reasonably sane way. In any case, if this fails is not critical, it just means that customers have to enable the check manually.

I think it also runs when using the agent container "manually", but I definitely agree that failing to detect some config being set is not a big deal here 👍

pgimalac · 2025-01-06T10:11:42Z

Dockerfiles/agent/cont-init.d/60-sysprobe-check.sh

+# Match the key gpu_monitoring.enabled: true using Python's YAML parser, which is included in the base image
+# and is more robust than using regexes.
+gpu_monitoring_enabled=$(python3 -c "import yaml, sys; data=yaml.safe_load(sys.stdin); print(bool(data.get('gpu_monitoring', {}).get('enabled')))" < $sysprobe_cfg)


Why do that only for gpu_monitoring.enabled and not the other configs in this script ?

Also as mentioned in my other comment Viper is very liberal in what it accepts as a boolean, so this won't work for every value

I preferred to avoid changes to this just in case, it's an implicit config change that might be hard to debug/notice if for some reason the behaviour is changed.

Dockerfiles/agent/cont-init.d/60-sysprobe-check.sh

gjulianm · 2025-01-07T11:02:38Z

/merge

dd-devflow · 2025-01-07T11:02:50Z

Devflow running: `/merge`

View all feedbacks in Devflow UI.

2025-01-07 11:02:50 UTC ℹ️ MergeQueue: waiting for PR to be ready

This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.

2025-01-07 12:45:42 UTC ℹ️ MergeQueue: merge request added to the queue

The median merge time in main is 35m.

2025-01-07 13:20:59 UTC ℹ️ MergeQueue: This merge request was merged

…dule is enabled (#32521)

gjulianm self-assigned this Dec 26, 2024

github-actions bot added team/agent-shared-components short review PR is simple enough to be reviewed quickly labels Dec 26, 2024

gjulianm changed the title ~~[EBPF] gpu: auto-enable agent-size check if system-probe gpu_monitoring module is enabled~~ [EBPF] gpu: auto-enable agent check if system-probe gpu_monitoring module is enabled Jan 3, 2025

gjulianm force-pushed the guillermo.julian/auto-enable-gpu-agent-check branch from 1db0904 to ec443a5 Compare January 3, 2025 10:52

gjulianm added changelog/no-changelog qa/done QA done before merge and regressions are covered by tests labels Jan 3, 2025

Auto-enable agent check if system-probe gpu monitoring is enabled

4ffac72

gjulianm force-pushed the guillermo.julian/auto-enable-gpu-agent-check branch from ec443a5 to 4ffac72 Compare January 3, 2025 11:06

gjulianm marked this pull request as ready for review January 3, 2025 12:37

gjulianm requested review from a team as code owners January 3, 2025 12:37

gjulianm requested a review from pgimalac January 3, 2025 12:37

gjulianm added the ask-review Ask required teams to review this PR label Jan 3, 2025

vboulineau reviewed Jan 3, 2025

View reviewed changes

github-actions bot added medium review PR review might take time and removed short review PR is simple enough to be reviewed quickly labels Jan 3, 2025

Use python3

46501e2

pgimalac approved these changes Jan 6, 2025

View reviewed changes

vboulineau approved these changes Jan 7, 2025

View reviewed changes

dd-mergequeue bot merged commit 1cd59eb into main Jan 7, 2025
233 checks passed

dd-mergequeue bot deleted the guillermo.julian/auto-enable-gpu-agent-check branch January 7, 2025 13:20

github-actions bot added this to the 7.63.0 milestone Jan 7, 2025

mwdd146980 pushed a commit that referenced this pull request Jan 10, 2025

[EBPF] gpu: auto-enable agent check if system-probe gpu_monitoring mo…

8f049fc

…dule is enabled (#32521)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EBPF] gpu: auto-enable agent check if system-probe gpu_monitoring module is enabled #32521

[EBPF] gpu: auto-enable agent check if system-probe gpu_monitoring module is enabled #32521

gjulianm commented Dec 26, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Dec 26, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Dec 26, 2024 •

edited

Loading

cit-pr-commenter bot commented Dec 26, 2024 •

edited

Loading

Fine details of change detection per experiment

Explanation

vboulineau Jan 3, 2025

gjulianm Jan 3, 2025

pgimalac left a comment

pgimalac Jan 6, 2025

gjulianm Jan 7, 2025

pgimalac Jan 7, 2025

pgimalac Jan 6, 2025

pgimalac Jan 6, 2025

gjulianm Jan 7, 2025

gjulianm commented Jan 7, 2025

dd-devflow bot commented Jan 7, 2025 •

edited

Loading

[EBPF] gpu: auto-enable agent check if system-probe gpu_monitoring module is enabled #32521

[EBPF] gpu: auto-enable agent check if system-probe gpu_monitoring module is enabled #32521

Conversation

gjulianm commented Dec 26, 2024 • edited Loading

What does this PR do?

Motivation

Describe how you validated your changes

Possible Drawbacks / Trade-offs

Additional Notes

agent-platform-auto-pr bot commented Dec 26, 2024 • edited Loading

agent-platform-auto-pr bot commented Dec 26, 2024 • edited Loading

Uncompressed package size comparison

Decision

cit-pr-commenter bot commented Dec 26, 2024 • edited Loading

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

Bounds Checks: ❌ Failed

Explanation

CI Pass/Fail Decision

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgimalac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjulianm commented Jan 7, 2025

dd-devflow bot commented Jan 7, 2025 • edited Loading

Devflow running: /merge

gjulianm commented Dec 26, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Dec 26, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Dec 26, 2024 •

edited

Loading

cit-pr-commenter bot commented Dec 26, 2024 •

edited

Loading

dd-devflow bot commented Jan 7, 2025 •

edited

Loading

Devflow running: `/merge`