-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wmeta: add nvml collector #32109
wmeta: add nvml collector #32109
Conversation
3511c39
to
ec6d319
Compare
8c4fcba
to
903ea80
Compare
d667588
to
aa84a91
Compare
Uncompressed package size comparisonComparison with ancestor Diff per package
Decision❌ Failed |
9fc9d6e
to
89ba025
Compare
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 7eae6b9 Optimization Goals: ✅ No significant changes detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | +0.94 | [+0.23, +1.65] | 1 | Logs |
➖ | quality_gate_idle_all_features | memory utilization | +0.72 | [+0.64, +0.80] | 1 | Logs bounds checks dashboard |
➖ | file_tree | memory utilization | +0.67 | [+0.53, +0.81] | 1 | Logs |
➖ | quality_gate_idle | memory utilization | +0.52 | [+0.47, +0.56] | 1 | Logs bounds checks dashboard |
➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.18 | [-0.59, +0.95] | 1 | Logs |
➖ | file_to_blackhole_100ms_latency | egress throughput | +0.03 | [-0.69, +0.74] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | +0.02 | [-0.93, +0.98] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | +0.00 | [-0.85, +0.85] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.01, +0.01] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | -0.01 | [-0.65, +0.63] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | -0.02 | [-0.12, +0.09] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | -0.02 | [-0.92, +0.88] | 1 | Logs |
➖ | file_to_blackhole_500ms_latency | egress throughput | -0.19 | [-0.98, +0.60] | 1 | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | -0.20 | [-0.26, -0.14] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | -0.32 | [-0.78, +0.14] | 1 | Logs |
➖ | quality_gate_logs | % cpu utilization | -2.93 | [-6.05, +0.19] | 1 | Logs |
Bounds Checks: ✅ Passed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | lost_bytes | 10/10 | |
✅ | quality_gate_logs | memory_usage | 10/10 |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
89ba025
to
66a0d1c
Compare
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv aws.create-vm --pipeline-id=52824148 --os-family=ubuntu Note: This applies to commit 8a226b3 |
c8148b2
to
8f6cf87
Compare
// This product includes software developed at Datadog (https://www.datadoghq.com/). | ||
// Copyright 2024-present Datadog, Inc. | ||
|
||
// Package nvml implements the NVML collector for workloadmeta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's really needed, I was following the structure of the other collectors that do have this file (and the _nop.go
file too)
log.Infof("Agent did not find NVML library: %v", err) | ||
return | ||
} | ||
|
||
features[NVML] = struct{}{} | ||
log.Infof("Agent found NVML library") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those log line looks like debug information. no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other detect*
functions log similar information with Info
level, maybe I could downgrade the one for "not finding it" as most installations won't have this library
// CheckLibraryExists checks if a library is available on the system by trying it to | ||
// open with dlopen. It returns an error if the library is not found. This is | ||
// the most direct way to check for a library's presence on Linux, as there are | ||
// multiple sources for paths for library searches, so it's better to use the | ||
// same mechanism that the loader uses. | ||
func CheckLibraryExists(libname string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really the only way ? Could we not use something like ldconfig -p
or similar ?
Loading the entire library in memory to test if it exists seems a bit overkill, no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not a straightforward way to do this, ldconfig -p
requires parsing a lot of lines so it seems more fragile. Also, if the library is found the most likely scenario is that we will be loading it, so we are not losing that much.
Does the PR require modifying all these go.mod files? |
Apparently yes, due to bringing in the import to |
|
||
event := workloadmeta.CollectorEvent{ | ||
Source: workloadmeta.SourceRuntime, | ||
Type: workloadmeta.EventTypeSet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: are there any cases where an Nvidia GPU can be "removed" from the environment or nvml can't detect it anymore requiring an `workloadmeta.EventTypeUnset? In the GPU case it seems impossible to happen but this is the only difference between the other collectors. Otherwise your collector looks good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A GPU could get disconnected (e.g., disabling the PCI device), although that's not normal operation. I can add support for that just in case though.
// Use dlopen to search for the library to avoid importing the go-nvml package here, | ||
// which is 1MB in size and would increase the agent binary size, when we don't really | ||
// need it for anything else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this comment is not accurate.
Since we build the NVML collector for linux environment, the collector includes go-nvml
library
"github.com/NVIDIA/go-nvml/pkg/nvml" |
There is not an easy way to solve today. Go do not provides ways for adding libraries at runtime base on condition. We could use build flags, but that might be more complex in the end. I think we should update the comment to avoid confusion and leave as it is 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue I found was that importing go-nvml
meant it was imported in a lot of packages, and I think it ended up in other binaries too, the resulting size increase was 1-2MB.
Could you add more description on test/QA. For example,
|
|
||
var events []workloadmeta.CollectorEvent | ||
for i := 0; i < count; i++ { | ||
dev, ret := c.nvmlLib.DeviceGetHandleByIndex(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the permission for core agent an issue? I thought one of your RFC mentioned insufficient permission to access the GPU via nvml lib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only an issue in our staging environments, in any case the permission failure would show up before when trying to start the collector. We haven't seen failures once the NVML library is initialised, as it seems to keep the handles to the devices open.
The collector is started automatically if the NVML library is detected. Added the output of |
Go Package Import DifferencesBaseline: 7eae6b9
|
Gitlab CI Configuration ChangesModified Jobs.on_gpu_or_e2e_changes .on_gpu_or_e2e_changes:
- if: $RUN_E2E_TESTS == "off"
when: never
- if: $CI_COMMIT_BRANCH =~ /^mq-working-branch-/
when: never
- if: $RUN_E2E_TESTS == "on"
when: on_success
- if: $CI_COMMIT_BRANCH == "main"
when: on_success
- if: $CI_COMMIT_BRANCH =~ /^[0-9]+\.[0-9]+\.x$/
when: on_success
- if: $CI_COMMIT_TAG =~ /^[0-9]+\.[0-9]+\.[0-9]+-rc\.[0-9]+$/
when: on_success
- changes:
compare_to: main
paths:
- .gitlab/e2e/e2e.yml
- test/new-e2e/pkg/**/*
- test/new-e2e/go.mod
- flakes.yaml
- changes:
compare_to: main
paths:
- pkg/gpu/**/*
- test/new-e2e/tests/gpu/**/*
- pkg/collector/corechecks/gpu/**/*
+ - comp/core/workloadmeta/collectors/internal/nvml/**/* new-e2e-gpu new-e2e-gpu:
after_script:
- $CI_PROJECT_DIR/tools/ci/junit_upload.sh
artifacts:
expire_in: 2 weeks
paths:
- $E2E_OUTPUT_DIR
- junit-*.tgz
reports:
annotations:
- $EXTERNAL_LINKS_PATH
when: always
before_script:
- mkdir -p $GOPATH/pkg/mod/cache && tar xJf modcache_e2e.tar.xz -C $GOPATH/pkg/mod/cache
|| exit 101
- rm -f modcache_e2e.tar.xz
- mkdir -p ~/.aws
- $CI_PROJECT_DIR/tools/ci/fetch_secret.sh $AGENT_QA_E2E profile >> ~/.aws/config
|| exit $?
- export AWS_PROFILE=agent-qa-ci
- $CI_PROJECT_DIR/tools/ci/fetch_secret.sh $AGENT_QA_E2E ssh_public_key_rsa > $E2E_AWS_PUBLIC_KEY_PATH
|| exit $?
- touch $E2E_AWS_PRIVATE_KEY_PATH && chmod 600 $E2E_AWS_PRIVATE_KEY_PATH && $CI_PROJECT_DIR/tools/ci/fetch_secret.sh
$AGENT_QA_E2E ssh_key_rsa > $E2E_AWS_PRIVATE_KEY_PATH || exit $?
- $CI_PROJECT_DIR/tools/ci/fetch_secret.sh $AGENT_QA_E2E ssh_public_key_rsa > $E2E_AZURE_PUBLIC_KEY_PATH
|| exit $?
- touch $E2E_AZURE_PRIVATE_KEY_PATH && chmod 600 $E2E_AZURE_PRIVATE_KEY_PATH &&
$CI_PROJECT_DIR/tools/ci/fetch_secret.sh $AGENT_QA_E2E ssh_key_rsa > $E2E_AZURE_PRIVATE_KEY_PATH
|| exit $?
- $CI_PROJECT_DIR/tools/ci/fetch_secret.sh $AGENT_QA_E2E ssh_public_key_rsa > $E2E_GCP_PUBLIC_KEY_PATH
|| exit $?
- touch $E2E_GCP_PRIVATE_KEY_PATH && chmod 600 $E2E_GCP_PRIVATE_KEY_PATH && $CI_PROJECT_DIR/tools/ci/fetch_secret.sh
$AGENT_QA_E2E ssh_key_rsa > $E2E_GCP_PRIVATE_KEY_PATH || exit $?
- pulumi login "s3://dd-pulumi-state?region=us-east-1&awssdk=v2&profile=$AWS_PROFILE"
- ARM_CLIENT_ID=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $E2E_AZURE client_id)
|| exit $?; export ARM_CLIENT_ID
- ARM_CLIENT_SECRET=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $E2E_AZURE token)
|| exit $?; export ARM_CLIENT_SECRET
- ARM_TENANT_ID=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $E2E_AZURE tenant_id)
|| exit $?; export ARM_TENANT_ID
- ARM_SUBSCRIPTION_ID=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $E2E_AZURE subscription_id)
|| exit $?; export ARM_SUBSCRIPTION_ID
- $CI_PROJECT_DIR/tools/ci/fetch_secret.sh $E2E_GCP credentials_json > ~/gcp-credentials.json
|| exit $?
- export GOOGLE_APPLICATION_CREDENTIALS=~/gcp-credentials.json
- inv -e gitlab.generate-ci-visibility-links --output=$EXTERNAL_LINKS_PATH
image: registry.ddbuild.io/ci/test-infra-definitions/runner$TEST_INFRA_DEFINITIONS_BUILDIMAGES_SUFFIX:$TEST_INFRA_DEFINITIONS_BUILDIMAGES
needs:
- go_e2e_deps
- deploy_deb_testing-a7_x64
rules:
- if: $RUN_E2E_TESTS == "off"
when: never
- if: $CI_COMMIT_BRANCH =~ /^mq-working-branch-/
when: never
- if: $RUN_E2E_TESTS == "on"
when: on_success
- if: $CI_COMMIT_BRANCH == "main"
when: on_success
- if: $CI_COMMIT_BRANCH =~ /^[0-9]+\.[0-9]+\.x$/
when: on_success
- if: $CI_COMMIT_TAG =~ /^[0-9]+\.[0-9]+\.[0-9]+-rc\.[0-9]+$/
when: on_success
- changes:
compare_to: main
paths:
- .gitlab/e2e/e2e.yml
- test/new-e2e/pkg/**/*
- test/new-e2e/go.mod
- flakes.yaml
- changes:
compare_to: main
paths:
- pkg/gpu/**/*
- test/new-e2e/tests/gpu/**/*
- pkg/collector/corechecks/gpu/**/*
+ - comp/core/workloadmeta/collectors/internal/nvml/**/*
- if: $CI_COMMIT_BRANCH =~ /^mq-working-branch-/
when: never
- allow_failure: true
when: manual
script:
- inv -e new-e2e-tests.run --targets $TARGETS -c ddagent:imagePullRegistry=669783387624.dkr.ecr.us-east-1.amazonaws.com
-c ddagent:imagePullUsername=AWS -c ddagent:imagePullPassword=$(aws ecr get-login-password)
--junit-tar junit-${CI_JOB_ID}.tgz ${EXTRA_PARAMS} --test-washer --logs-folder=$E2E_OUTPUT_DIR/logs
--logs-post-processing --logs-post-processing-test-depth=$E2E_LOGS_PROCESSING_TEST_DEPTH
stage: e2e
tags:
- arch:amd64
variables:
E2E_AWS_PRIVATE_KEY_PATH: /tmp/agent-qa-aws-ssh-key
E2E_AWS_PUBLIC_KEY_PATH: /tmp/agent-qa-aws-ssh-key.pub
E2E_AZURE_PRIVATE_KEY_PATH: /tmp/agent-qa-azure-ssh-key
E2E_AZURE_PUBLIC_KEY_PATH: /tmp/agent-qa-azure-ssh-key.pub
E2E_COMMIT_SHA: $CI_COMMIT_SHORT_SHA
E2E_GCP_PRIVATE_KEY_PATH: /tmp/agent-qa-gcp-ssh-key
E2E_GCP_PUBLIC_KEY_PATH: /tmp/agent-qa-gcp-ssh-key.pub
E2E_KEY_PAIR_NAME: datadog-agent-ci-rsa
E2E_LOGS_PROCESSING_TEST_DEPTH: 1
E2E_OUTPUT_DIR: $CI_PROJECT_DIR/e2e-output
E2E_PIPELINE_ID: $CI_PIPELINE_ID
E2E_PULUMI_LOG_LEVEL: 10
EXTERNAL_LINKS_PATH: external_links_$CI_JOB_ID.json
KUBERNETES_CPU_REQUEST: 6
KUBERNETES_MEMORY_LIMIT: 16Gi
KUBERNETES_MEMORY_REQUEST: 12Gi
SHOULD_RUN_IN_FLAKES_FINDER: 'true'
TARGETS: ./tests/gpu
TEAM: ebpf-platform Changes Summary
ℹ️ Diff available in the job log. |
Exception approved for bypassing the package size check, follow-up work is tracked in EBPF-631. I'm going to force-merge the PR. |
What does this PR do?
This PR adds the NVML collector to workloadmeta, so that we collect data about the NVIDIA GPUs present in the system that can be used in other parts of the agent, including the tagger.
Motivation
Uniform tags for GPU-related metrics, centralizing queries of GPU information.
Describe how you validated your changes
Unit tests included for the NVML collector using a mock. E2E tests also added to ensure workloadmeta is correctly populated.
Manual validation was done by starting the agent in a GPU enabled host and verifying with
agent workload-list
that there the expected data was present in the store.Output of
workload-list
:Possible Drawbacks / Trade-offs
Additional Notes
A
CheckLibraryExists
function has been added, that callsdlopen
to check if the NVML library exists. This is done so that we avoid importing NVML inpkg/config/env
for feature detection, as that would add the NVML dependency to almost all agent packages and increase the binary size by almost 1MB.In another PR we will allow a global setting for the NVML library path, as that setting would affect not only this code but also the GPU check.