Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add pci_bus_id label for metrics #326

Merged
merged 1 commit into from
Jun 10, 2024

Conversation

fungaren
Copy link
Contributor

This PR add pci_bus_id label for metrics to indicate the PCI Bus ID of the GPU.

Currently we have UUID for users to indicate a GPU card. For many reasons, you may also want to locate a GPU by its PCI Bus ID.

For example, for cloud service providers, they supply GPU cards from the bare metal machines to Virtual Machines. However, GPU UUID is only aware to Guest system, therefore the cloud provider is unable to accurately tell the user that which GPU is broken. What the cloud provider have is only the PCI Bus ID.

Once added the pci_bus_id label, users can filter metrics by the Bus ID sent by the cloud provider, and receive alarms if their important tasks are affected.

@fungaren fungaren force-pushed the add-pci-bus-id-label branch from f8b2038 to b3def69 Compare May 21, 2024 08:51
@fungaren
Copy link
Contributor Author

Example output:

# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-96be9fff-fdfd-9b87-88d4-fe5b9a012148",pci_bus_id="00000000:08:00.0",device="nvidia0",modelName="NVIDIA GeForce RTX 3090",Hostname="debian",err_code="0",err_msg="Unknown Error"} 0
# HELP DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB).
# TYPE DCGM_FI_DEV_FB_FREE gauge
DCGM_FI_DEV_FB_FREE{gpu="0",UUID="GPU-96be9fff-fdfd-9b87-88d4-fe5b9a012148",pci_bus_id="00000000:08:00.0",device="nvidia0",modelName="NVIDIA GeForce RTX 3090",Hostname="debian"} 24268

@glowkey
Copy link
Collaborator

glowkey commented May 21, 2024

Thanks for the PR. Can you also add a test or adjust an existing test to account for these changes?

@fungaren
Copy link
Contributor Author

@glowkey Done.

@glowkey
Copy link
Collaborator

glowkey commented May 29, 2024

There are a few test failures with this PR when running make test-main on this branch. These will need to be fixed before it can be merged:

make test-main
go test ./... -short
go: downloading github.com/avast/retry-go/v4 v4.5.1
go: downloading github.com/stretchr/objx v0.5.0
? github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter [no test files]
? github.com/NVIDIA/dcgm-exporter/internal/mocks/pkg/os [no test files]
? github.com/NVIDIA/dcgm-exporter/internal/pkg/os [no test files]
? github.com/NVIDIA/dcgm-exporter/internal/pkg/testutils [no test files]
ok github.com/NVIDIA/dcgm-exporter/internal/pkg/logging 0.004s
ok github.com/NVIDIA/dcgm-exporter/internal/pkg/nvmlprovider 0.038s
ok github.com/NVIDIA/dcgm-exporter/pkg/cmd 1.120s
? github.com/NVIDIA/dcgm-exporter/tests/e2e/internal/framework [no test files]
time="2024-05-29T16:18:26Z" level=info msg="Initializing system entities of type: GPU"
time="2024-05-29T16:18:27Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-05-29T16:18:27Z" level=info msg="No Kubelet socket, ignoring"
time="2024-05-29T16:18:28Z" level=info msg="Initializing system entities of type: GPU"
time="2024-05-29T16:18:28Z" level=error msg="DCGM_EXP_CLOCK_EVENTS_COUNT collector is disabled"
time="2024-05-29T16:18:28Z" level=error msg="DCGM_EXP_CLOCK_EVENTS_COUNT collector is disabled"
time="2024-05-29T16:18:29Z" level=info msg="Initializing system entities of type: GPU"
time="2024-05-29T16:18:30Z" level=info msg="Initializing system entities of type: GPU"
time="2024-05-29T16:18:31Z" level=info msg="Initializing system entities of type: GPU"
time="2024-05-29T16:18:31Z" level=info msg="Initializing system entities of type: CPU"
time="2024-05-29T16:18:31Z" level=info msg="Initializing system entities of type: CPU"
--- FAIL: TestDCGMCollector (1.03s)
gpu_collector_test.go:270:
Error Trace: /opt/scratch/dcgm-exporter/pkg/dcgmexporter/gpu_collector_test.go:270
/opt/scratch/dcgm-exporter/pkg/dcgmexporter/gpu_collector_test.go:72
Error: Should NOT be empty, but was
Test: TestDCGMCollector
time="2024-05-29T16:18:32Z" level=info msg="Initializing system entities of type: GPU"
time="2024-05-29T16:18:33Z" level=info msg="HPC job mapping is enabled and watch for the "/var/run/nvidia/slurm" directory"
time="2024-05-29T16:18:33Z" level=warning msg="HPC mapper: can not get file info for the iamerror file."
time="2024-05-29T16:18:33Z" level=info msg="HPC job mapping is enabled and watch for the "" directory"
time="2024-05-29T16:18:33Z" level=info msg="Initializing system entities of type: GPU"
time="2024-05-29T16:18:33Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-05-29T16:18:34Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-05-29T16:18:35Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-05-29T16:18:36Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-05-29T16:18:37Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-05-29T16:18:38Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-05-29T16:18:39Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-05-29T16:18:40Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-05-29T16:18:41Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-05-29T16:18:42Z" level=info msg="Falling back to metric file '/tmp/prefix-2683233977'"
time="2024-05-29T16:18:42Z" level=info msg="Falling back to metric file '/tmp/prefix-2573303514'"
time="2024-05-29T16:18:42Z" level=info msg="Initializing system entities of type: GPU"
time="2024-05-29T16:18:43Z" level=info msg="Falling back to metric file '/tmp/empty.3344998872.csv'"
time="2024-05-29T16:18:43Z" level=info msg="Falling back to metric file '/tmp/empty.3344998872.csv'"
time="2024-05-29T16:18:43Z" level=info msg="Falling back to metric file '/tmp/empty.3344998872.csv'"
time="2024-05-29T16:18:43Z" level=info msg="Falling back to metric file '/tmp/empty.3344998872.csv'"
time="2024-05-29T16:18:43Z" level=info msg="Falling back to metric file '/tmp/empty.3344998872.csv'"
time="2024-05-29T16:18:43Z" level=info msg="Falling back to metric file '/tmp/empty.3344998872.csv'"
time="2024-05-29T16:18:43Z" level=info msg="Falling back to metric file '/tmp/empty.3344998872.csv'"
time="2024-05-29T16:18:43Z" level=info msg="Falling back to metric file '/tmp/empty.3344998872.csv'"
time="2024-05-29T16:18:43Z" level=warning msg="Cannot create DCGMCollector for dcgm.FE_GPU"
time="2024-05-29T16:18:43Z" level=warning msg="Cannot create DCGMCollector for dcgm.FE_SWITCH"
time="2024-05-29T16:18:43Z" level=warning msg="Cannot create DCGMCollector for dcgm.FE_LINK"
time="2024-05-29T16:18:43Z" level=warning msg="Cannot create DCGMCollector for dcgm.FE_CPU"
time="2024-05-29T16:18:43Z" level=warning msg="Cannot create DCGMCollector for dcgm.FE_CPU_CORE"
time="2024-05-29T16:18:45Z" level=info msg="Initializing system entities of type: GPU"
--- FAIL: TestXIDCollector_Gather_Encode (2.05s)
xid_collector_test.go:224:
Error Trace: /opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:224
Error: "[name:"gpu" value:"1" name:"UUID" value:"GPU-00000000-0000-0000-0000-000000000000" name:"pci_bus_id" value:"<<>>" name:"device" value:"nvidia1" name:"modelName" value:"<<>>" name:"Hostname" value:"local-test" name:"DCGM_FI_DRIVER_VERSION" value:"0" name:"window_size_in_ms" value:"300000000000" name:"xid" value:"42"]" should have 8 item(s), but has 9
Test: TestXIDCollector_Gather_Encode
xid_collector_test.go:227:
Error Trace: /opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:227
Error: Not equal:
expected: "device"
actual : "pci_bus_id"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-device
    	            	+pci_bus_id
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:228:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:228
    	Error:      	Not equal:
    	            	expected: "modelName"
    	            	actual  : "device"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-modelName
    	            	+device
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:229:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:229
    	Error:      	Not equal:
    	            	expected: "Hostname"
    	            	actual  : "modelName"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-Hostname
    	            	+modelName
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:230:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:230
    	Error:      	Not equal:
    	            	expected: "DCGM_FI_DRIVER_VERSION"
    	            	actual  : "Hostname"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-DCGM_FI_DRIVER_VERSION
    	            	+Hostname
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:231:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:231
    	Error:      	Not equal:
    	            	expected: "window_size_in_ms"
    	            	actual  : "DCGM_FI_DRIVER_VERSION"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-window_size_in_ms
    	            	+DCGM_FI_DRIVER_VERSION
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:232:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:232
    	Error:      	Not equal:
    	            	expected: "xid"
    	            	actual  : "window_size_in_ms"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-xid
    	            	+window_size_in_ms
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:224:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:224
    	Error:      	"[name:"gpu"  value:"1" name:"UUID"  value:"GPU-00000000-0000-0000-0000-000000000000" name:"pci_bus_id"  value:"<<<NULL>>>" name:"device"  value:"nvidia1" name:"modelName"  value:"<<<NULL>>>" name:"Hostname"  value:"local-test" name:"DCGM_FI_DRIVER_VERSION"  value:"0" name:"window_size_in_ms"  value:"300000000000" name:"xid"  value:"46"]" should have 8 item(s), but has 9
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:227:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:227
    	Error:      	Not equal:
    	            	expected: "device"
    	            	actual  : "pci_bus_id"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-device
    	            	+pci_bus_id
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:228:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:228
    	Error:      	Not equal:
    	            	expected: "modelName"
    	            	actual  : "device"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-modelName
    	            	+device
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:229:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:229
    	Error:      	Not equal:
    	            	expected: "Hostname"
    	            	actual  : "modelName"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-Hostname
    	            	+modelName
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:230:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:230
    	Error:      	Not equal:
    	            	expected: "DCGM_FI_DRIVER_VERSION"
    	            	actual  : "Hostname"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-DCGM_FI_DRIVER_VERSION
    	            	+Hostname
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:231:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:231
    	Error:      	Not equal:
    	            	expected: "window_size_in_ms"
    	            	actual  : "DCGM_FI_DRIVER_VERSION"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-window_size_in_ms
    	            	+DCGM_FI_DRIVER_VERSION
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:232:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:232
    	Error:      	Not equal:
    	            	expected: "xid"
    	            	actual  : "window_size_in_ms"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-xid
    	            	+window_size_in_ms
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:224:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:224
    	Error:      	"[name:"gpu"  value:"1" name:"UUID"  value:"GPU-00000000-0000-0000-0000-000000000000" name:"pci_bus_id"  value:"<<<NULL>>>" name:"device"  value:"nvidia1" name:"modelName"  value:"<<<NULL>>>" name:"Hostname"  value:"local-test" name:"DCGM_FI_DRIVER_VERSION"  value:"0" name:"window_size_in_ms"  value:"300000000000" name:"xid"  value:"19"]" should have 8 item(s), but has 9
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:227:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:227
    	Error:      	Not equal:
    	            	expected: "device"
    	            	actual  : "pci_bus_id"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-device
    	            	+pci_bus_id
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:228:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:228
    	Error:      	Not equal:
    	            	expected: "modelName"
    	            	actual  : "device"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-modelName
    	            	+device
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:229:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:229
    	Error:      	Not equal:
    	            	expected: "Hostname"
    	            	actual  : "modelName"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-Hostname
    	            	+modelName
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:230:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:230
    	Error:      	Not equal:
    	            	expected: "DCGM_FI_DRIVER_VERSION"
    	            	actual  : "Hostname"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-DCGM_FI_DRIVER_VERSION
    	            	+Hostname
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:231:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:231
    	Error:      	Not equal:
    	            	expected: "window_size_in_ms"
    	            	actual  : "DCGM_FI_DRIVER_VERSION"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-window_size_in_ms
    	            	+DCGM_FI_DRIVER_VERSION
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:232:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:232
    	Error:      	Not equal:
    	            	expected: "xid"
    	            	actual  : "window_size_in_ms"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-xid
    	            	+window_size_in_ms
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:224:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:224
    	Error:      	"[name:"gpu"  value:"2" name:"UUID"  value:"GPU-00000000-0000-0000-0000-000000000000" name:"pci_bus_id"  value:"<<<NULL>>>" name:"device"  value:"nvidia2" name:"modelName"  value:"<<<NULL>>>" name:"Hostname"  value:"local-test" name:"DCGM_FI_DRIVER_VERSION"  value:"0" name:"window_size_in_ms"  value:"300000000000" name:"xid"  value:"46"]" should have 8 item(s), but has 9
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:227:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:227
    	Error:      	Not equal:
    	            	expected: "device"
    	            	actual  : "pci_bus_id"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-device
    	            	+pci_bus_id
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:228:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:228
    	Error:      	Not equal:
    	            	expected: "modelName"
    	            	actual  : "device"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-modelName
    	            	+device
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:229:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:229
    	Error:      	Not equal:
    	            	expected: "Hostname"
    	            	actual  : "modelName"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-Hostname
    	            	+modelName
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:230:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:230
    	Error:      	Not equal:
    	            	expected: "DCGM_FI_DRIVER_VERSION"
    	            	actual  : "Hostname"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-DCGM_FI_DRIVER_VERSION
    	            	+Hostname
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:231:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:231
    	Error:      	Not equal:
    	            	expected: "window_size_in_ms"
    	            	actual  : "DCGM_FI_DRIVER_VERSION"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-window_size_in_ms
    	            	+DCGM_FI_DRIVER_VERSION
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:232:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:232
    	Error:      	Not equal:
    	            	expected: "xid"
    	            	actual  : "window_size_in_ms"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-xid
    	            	+window_size_in_ms
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:224:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:224
    	Error:      	"[name:"gpu"  value:"2" name:"UUID"  value:"GPU-00000000-0000-0000-0000-000000000000" name:"pci_bus_id"  value:"<<<NULL>>>" name:"device"  value:"nvidia2" name:"modelName"  value:"<<<NULL>>>" name:"Hostname"  value:"local-test" name:"DCGM_FI_DRIVER_VERSION"  value:"0" name:"window_size_in_ms"  value:"300000000000" name:"xid"  value:"42"]" should have 8 item(s), but has 9
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:227:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:227
    	Error:      	Not equal:
    	            	expected: "device"
    	            	actual  : "pci_bus_id"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-device
    	            	+pci_bus_id
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:228:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:228
    	Error:      	Not equal:
    	            	expected: "modelName"
    	            	actual  : "device"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-modelName
    	            	+device
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:229:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:229
    	Error:      	Not equal:
    	            	expected: "Hostname"
    	            	actual  : "modelName"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-Hostname
    	            	+modelName
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:230:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:230
    	Error:      	Not equal:
    	            	expected: "DCGM_FI_DRIVER_VERSION"
    	            	actual  : "Hostname"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-DCGM_FI_DRIVER_VERSION
    	            	+Hostname
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:231:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:231
    	Error:      	Not equal:
    	            	expected: "window_size_in_ms"
    	            	actual  : "DCGM_FI_DRIVER_VERSION"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-window_size_in_ms
    	            	+DCGM_FI_DRIVER_VERSION
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:232:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:232
    	Error:      	Not equal:
    	            	expected: "xid"
    	            	actual  : "window_size_in_ms"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-xid
    	            	+window_size_in_ms
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:224:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:224
    	Error:      	"[name:"gpu"  value:"3" name:"UUID"  value:"GPU-00000000-0000-0000-0000-000000000000" name:"pci_bus_id"  value:"<<<NULL>>>" name:"device"  value:"nvidia3" name:"modelName"  value:"<<<NULL>>>" name:"Hostname"  value:"local-test" name:"DCGM_FI_DRIVER_VERSION"  value:"0" name:"window_size_in_ms"  value:"300000000000" name:"xid"  value:"42"]" should have 8 item(s), but has 9
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:227:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:227
    	Error:      	Not equal:
    	            	expected: "device"
    	            	actual  : "pci_bus_id"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-device
    	            	+pci_bus_id
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:228:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:228
    	Error:      	Not equal:
    	            	expected: "modelName"
    	            	actual  : "device"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-modelName
    	            	+device
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:229:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:229
    	Error:      	Not equal:
    	            	expected: "Hostname"
    	            	actual  : "modelName"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-Hostname
    	            	+modelName
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:230:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:230
    	Error:      	Not equal:
    	            	expected: "DCGM_FI_DRIVER_VERSION"
    	            	actual  : "Hostname"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-DCGM_FI_DRIVER_VERSION
    	            	+Hostname
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:231:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:231
    	Error:      	Not equal:
    	            	expected: "window_size_in_ms"
    	            	actual  : "DCGM_FI_DRIVER_VERSION"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-window_size_in_ms
    	            	+DCGM_FI_DRIVER_VERSION
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:232:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:232
    	Error:      	Not equal:
    	            	expected: "xid"
    	            	actual  : "window_size_in_ms"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-xid
    	            	+window_size_in_ms
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:224:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:224
    	Error:      	"[name:"gpu"  value:"3" name:"UUID"  value:"GPU-00000000-0000-0000-0000-000000000000" name:"pci_bus_id"  value:"<<<NULL>>>" name:"device"  value:"nvidia3" name:"modelName"  value:"<<<NULL>>>" name:"Hostname"  value:"local-test" name:"DCGM_FI_DRIVER_VERSION"  value:"0" name:"window_size_in_ms"  value:"300000000000" name:"xid"  value:"46"]" should have 8 item(s), but has 9
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:227:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:227
    	Error:      	Not equal:
    	            	expected: "device"
    	            	actual  : "pci_bus_id"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-device
    	            	+pci_bus_id
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:228:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:228
    	Error:      	Not equal:
    	            	expected: "modelName"
    	            	actual  : "device"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-modelName
    	            	+device
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:229:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:229
    	Error:      	Not equal:
    	            	expected: "Hostname"
    	            	actual  : "modelName"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-Hostname
    	            	+modelName
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:230:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:230
    	Error:      	Not equal:
    	            	expected: "DCGM_FI_DRIVER_VERSION"
    	            	actual  : "Hostname"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-DCGM_FI_DRIVER_VERSION
    	            	+Hostname
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:231:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:231
    	Error:      	Not equal:
    	            	expected: "window_size_in_ms"
    	            	actual  : "DCGM_FI_DRIVER_VERSION"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-window_size_in_ms
    	            	+DCGM_FI_DRIVER_VERSION
    	Test:       	TestXIDCollector_Gather_Encode
xid_collector_test.go:232:
    	Error Trace:	/opt/scratch/dcgm-exporter/pkg/dcgmexporter/xid_collector_test.go:232
    	Error:      	Not equal:
    	            	expected: "xid"
    	            	actual  : "window_size_in_ms"

    	            	Diff:
    	            	--- Expected
    	            	+++ Actual
    	            	@@ -1 +1 @@
    	            	-xid
    	            	+window_size_in_ms
    	Test:       	TestXIDCollector_Gather_Encode

time="2024-05-29T16:18:47Z" level=info msg="Initializing system entities of type: GPU"
time="2024-05-29T16:18:47Z" level=error msg="DCGM_EXP_XID_ERRORS_COUNT collector is disabled"
time="2024-05-29T16:18:47Z" level=error msg="DCGM_EXP_XID_ERRORS_COUNT collector is disabled"
FAIL
FAIL github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter 21.439s
ok github.com/NVIDIA/dcgm-exporter/pkg/stdout 0.016s
ok github.com/NVIDIA/dcgm-exporter/tests/integration 0.015s
FAIL
make: *** [Makefile:39: test-main] Error 1

@fungaren fungaren force-pushed the add-pci-bus-id-label branch from be3beaa to 00b1df6 Compare June 10, 2024 05:13
@fungaren
Copy link
Contributor Author

@glowkey Updated. Sorry for being a few days late.

Copy link
Collaborator

@glowkey glowkey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nvvfedorov nvvfedorov force-pushed the add-pci-bus-id-label branch from 00b1df6 to 9ee63de Compare June 10, 2024 20:26
@nvvfedorov nvvfedorov merged commit 961ee35 into NVIDIA:main Jun 10, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants