Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EBPF] gpu: auto-enable agent check if system-probe gpu_monitoring module is enabled #32521

Merged
merged 2 commits into from
Jan 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 16 additions & 2 deletions Dockerfiles/agent/cont-init.d/60-sysprobe-check.sh
pgimalac marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,15 +1,29 @@
#!/bin/bash

if grep -Eq '^ *enable_tcp_queue_length *: *true' /etc/datadog-agent/system-probe.yaml || [[ "$DD_SYSTEM_PROBE_CONFIG_ENABLE_TCP_QUEUE_LENGTH" == "true" ]]; then
sysprobe_cfg="/etc/datadog-agent/system-probe.yaml"

if grep -Eq '^ *enable_tcp_queue_length *: *true' $sysprobe_cfg || [[ "$DD_SYSTEM_PROBE_CONFIG_ENABLE_TCP_QUEUE_LENGTH" == "true" ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very brittle way to check for a config...
Eg. all the following values are considered as true by Viper: "1", "t", "T", "true", "TRUE", "True"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assumption here is that the config is set by the operator/helm chart in some reasonably sane way. In any case, if this fails is not critical, it just means that customers have to enable the check manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it also runs when using the agent container "manually", but I definitely agree that failing to detect some config being set is not a big deal here 👍

if [ -f /etc/datadog-agent/conf.d/tcp_queue_length.d/conf.yaml.example ]; then
mv /etc/datadog-agent/conf.d/tcp_queue_length.d/conf.yaml.example \
/etc/datadog-agent/conf.d/tcp_queue_length.d/conf.yaml.default
fi
fi

if grep -Eq '^ *enable_oom_kill *: *true' /etc/datadog-agent/system-probe.yaml || [[ "$DD_SYSTEM_PROBE_CONFIG_ENABLE_OOM_KILL" == "true" ]]; then
if grep -Eq '^ *enable_oom_kill *: *true' $sysprobe_cfg || [[ "$DD_SYSTEM_PROBE_CONFIG_ENABLE_OOM_KILL" == "true" ]]; then
if [ -f /etc/datadog-agent/conf.d/oom_kill.d/conf.yaml.example ]; then
mv /etc/datadog-agent/conf.d/oom_kill.d/conf.yaml.example \
/etc/datadog-agent/conf.d/oom_kill.d/conf.yaml.default
fi
fi

# Match the key gpu_monitoring.enabled: true using Python's YAML parser, which is included in the base image
# and is more robust than using regexes.
gpu_monitoring_enabled=$(python3 -c "import yaml, sys; data=yaml.safe_load(sys.stdin); print(bool(data.get('gpu_monitoring', {}).get('enabled')))" < $sysprobe_cfg)
Comment on lines +19 to +21
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do that only for gpu_monitoring.enabled and not the other configs in this script ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as mentioned in my other comment Viper is very liberal in what it accepts as a boolean, so this won't work for every value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred to avoid changes to this just in case, it's an implicit config change that might be hard to debug/notice if for some reason the behaviour is changed.


# Note gpu_monitoring_enabled is a Python boolean, so casing is important
if [[ "$gpu_monitoring_enabled" == "True" ]] || [[ "$DD_GPU_MONITORING_ENABLED" == "true" ]]; then
if [ -f /etc/datadog-agent/conf.d/gpu.d/conf.yaml.example ]; then
mv /etc/datadog-agent/conf.d/gpu.d/conf.yaml.example \
/etc/datadog-agent/conf.d/gpu.d/conf.yaml.default
fi
fi
20 changes: 20 additions & 0 deletions cmd/agent/dist/conf.d/gpu.d/conf.yaml.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
init_config:

instances:

-

## @param nvml_library_path - string - optional - default: ""
## Configure an alternative path for the NVML NVIDIA library. Necessary
## if the library is in a location where the agent cannot automatically find it.
#
# nvml_library_path: ""

## @param tags - list of strings following the pattern: "key:value" - optional
## List of tags to attach to every metric, event, and service check emitted by this integration.
##
## Learn more about tagging: https://docs.datadoghq.com/tagging/
#
# tags:
# - <KEY_1>:<VALUE_1>
# - <KEY_2>:<VALUE_2>
Loading