-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/dcgmexporter/gpu_collector.go: include a err_msg label in metric DCGM_FI_DEV_XID_ERRORS #309
Conversation
93f36ee
to
ad53c4a
Compare
@bom-d-van , Thank you for your contribution. Can you describe your use case to justify the change? |
Hi @nvvfedorov , this is to make it easy to generate alarm messages using the metric. For example, we could write a query like this to generate an alarm and use the err_msg in the template to make the error message easy to read.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/LGTM
@bom-d-van , Please sign your commits and squash your commits into a single one. Then I will be ready to merge changes. |
551817f
to
d689096
Compare
@nvvfedorov should be done now. could you take another look? tx. |
d689096
to
6c7e0c7
Compare
…DCGM_FI_DEV_XID_ERRORS The DCGM_FI_DEV_XID_ERRORS metric reports xid error code as its value, this commit includes an err_msg label with value retrieved from this nvidia doc: https://docs.nvidia.com/deploy/xid-errors/#topic_4 pkg/dcgmexporter/gpu_collector.go: include err_code label in metrics for easy alert configs pkg/dcgmexporter/gpu_collector.go: convert xidErrCodeToText to a slice and adjust the known value sanity check Signed-off-by: Xiaofan Hu <[email protected]> Signed-off-by: Vadym Fedorov <[email protected]>
Signed-off-by: Vadym Fedorov <[email protected]>
Signed-off-by: Vadym Fedorov <[email protected]>
6c7e0c7
to
fb0f407
Compare
The DCGM_FI_DEV_XID_ERRORS metric reports xid error code value, this commit include a err_msg label with value retrieved from this nvidia doc: https://docs.nvidia.com/deploy/xid-errors/#topic_4