Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/dcgmexporter/gpu_collector.go: include a err_msg label in metric DCGM_FI_DEV_XID_ERRORS #309

Merged
merged 3 commits into from
May 9, 2024

Conversation

bom-d-van
Copy link
Contributor

The DCGM_FI_DEV_XID_ERRORS metric reports xid error code value, this commit include a err_msg label with value retrieved from this nvidia doc: https://docs.nvidia.com/deploy/xid-errors/#topic_4

@bom-d-van bom-d-van force-pushed the add-xid-error-label-hint branch 3 times, most recently from 93f36ee to ad53c4a Compare April 8, 2024 03:02
@nvvfedorov
Copy link
Collaborator

@bom-d-van , Thank you for your contribution. Can you describe your use case to justify the change?

@bom-d-van
Copy link
Contributor Author

Hi @nvvfedorov , this is to make it easy to generate alarm messages using the metric.

For example, we could write a query like this to generate an alarm and use the err_msg in the template to make the error message easy to read.

max(DCGM_FI_DEV_XID_ERRORS{err_code=~"4|8|9|12|13|24|30|31|37|38|43|48|54|74|119|140|143"}) by (Hostname, DCGM_FI_DRIVER_VERSION, device, gpu, modelName, err_msg)

@nvvfedorov nvvfedorov requested a review from glowkey April 24, 2024 15:48
Copy link
Collaborator

@rohit-arora-dev rohit-arora-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@nvvfedorov
Copy link
Collaborator

@bom-d-van , Please sign your commits and squash your commits into a single one. Then I will be ready to merge changes.

@bom-d-van bom-d-van force-pushed the add-xid-error-label-hint branch from 551817f to d689096 Compare May 2, 2024 03:11
@bom-d-van
Copy link
Contributor Author

@nvvfedorov should be done now. could you take another look? tx.

@bom-d-van bom-d-van force-pushed the add-xid-error-label-hint branch from d689096 to 6c7e0c7 Compare May 2, 2024 03:16
bom-d-van and others added 3 commits May 9, 2024 12:13
…DCGM_FI_DEV_XID_ERRORS

The DCGM_FI_DEV_XID_ERRORS metric reports xid error code as its value, this commit includes an err_msg
label with value retrieved from this nvidia doc: https://docs.nvidia.com/deploy/xid-errors/#topic_4

pkg/dcgmexporter/gpu_collector.go: include err_code label in metrics for easy alert configs

pkg/dcgmexporter/gpu_collector.go: convert xidErrCodeToText to a slice and adjust the known value sanity check

Signed-off-by: Xiaofan Hu <[email protected]>
Signed-off-by: Vadym Fedorov <[email protected]>
Signed-off-by: Vadym Fedorov <[email protected]>
@nvvfedorov nvvfedorov force-pushed the add-xid-error-label-hint branch from 6c7e0c7 to fb0f407 Compare May 9, 2024 17:14
@nvvfedorov nvvfedorov merged commit 7decfd2 into NVIDIA:main May 9, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants