Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The value of the DCGM_FI_DEV_XID_ERRORS field is always printed twice #304

Merged
merged 2 commits into from
Apr 3, 2024

Conversation

nvvfedorov
Copy link
Collaborator

The DCGM_FI_DEV_XID_ERRORS field value is always printed twice.

Steps to reproduce:

  1. Run DCGM-Exporter.
  2. Collect and check metrics output.
curl localhost:9400/metrics

Expected:

# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-aee734e0-1a3d-2715-f7e1-4d9d89dc56f2",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="2g.20gb",GPU_I_ID="3",Hostname="4u4g-0018",DCGM_FI_DRIVER_VERSION="555.23"} 0

Actual:

# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-aee734e0-1a3d-2715-f7e1-4d9d89dc56f2",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="2g.20gb",GPU_I_ID="3",Hostname="4u4g-0018",DCGM_FI_DRIVER_VERSION="555.23"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-aee734e0-1a3d-2715-f7e1-4d9d89dc56f2",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="2g.20gb",GPU_I_ID="3",Hostname="4u4g-0018",DCGM_FI_DRIVER_VERSION="555.23"} 0

@nvvfedorov nvvfedorov self-assigned this Apr 3, 2024
pkg/cmd/app.go Outdated Show resolved Hide resolved
Signed-off-by: Vadym Fedorov <[email protected]>
@nvvfedorov nvvfedorov requested a review from glowkey April 3, 2024 17:10
Copy link
Collaborator

@glowkey glowkey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nvvfedorov nvvfedorov merged commit 57fa1c6 into main Apr 3, 2024
1 check passed
@nvvfedorov nvvfedorov deleted the dcgm_fi_dev_xid_errors_duplicates branch April 3, 2024 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants