-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add pci_bus_id label for metrics #326
Conversation
f8b2038
to
b3def69
Compare
Example output: # HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-96be9fff-fdfd-9b87-88d4-fe5b9a012148",pci_bus_id="00000000:08:00.0",device="nvidia0",modelName="NVIDIA GeForce RTX 3090",Hostname="debian",err_code="0",err_msg="Unknown Error"} 0
# HELP DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB).
# TYPE DCGM_FI_DEV_FB_FREE gauge
DCGM_FI_DEV_FB_FREE{gpu="0",UUID="GPU-96be9fff-fdfd-9b87-88d4-fe5b9a012148",pci_bus_id="00000000:08:00.0",device="nvidia0",modelName="NVIDIA GeForce RTX 3090",Hostname="debian"} 24268 |
Thanks for the PR. Can you also add a test or adjust an existing test to account for these changes? |
@glowkey Done. |
There are a few test failures with this PR when running make test-main
time="2024-05-29T16:18:47Z" level=info msg="Initializing system entities of type: GPU" |
be3beaa
to
00b1df6
Compare
@glowkey Updated. Sorry for being a few days late. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Garen Fang <[email protected]>
00b1df6
to
9ee63de
Compare
This PR add
pci_bus_id
label for metrics to indicate the PCI Bus ID of the GPU.Currently we have UUID for users to indicate a GPU card. For many reasons, you may also want to locate a GPU by its PCI Bus ID.
For example, for cloud service providers, they supply GPU cards from the bare metal machines to Virtual Machines. However, GPU UUID is only aware to Guest system, therefore the cloud provider is unable to accurately tell the user that which GPU is broken. What the cloud provider have is only the PCI Bus ID.
Once added the
pci_bus_id
label, users can filter metrics by the Bus ID sent by the cloud provider, and receive alarms if their important tasks are affected.