Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system #416

Open
saichanumolu9 opened this issue Nov 13, 2024 · 0 comments
Labels
question Further information is requested

Comments

@saichanumolu9
Copy link

saichanumolu9 commented Nov 13, 2024

Ask your question

We are using dcgm-exporter to push metrics to Prometheus in GKE Standard cluster. DS pods are up, we are seeing below logs during start up and dont see any metrics in prometheus...

2024/11/13 22:25:51 maxprocs: Leaving GOMAXPROCS=96: CPU quota undefined time="2024-11-13T22:25:51Z" level=info msg="Starting dcgm-exporter" time="2024-11-13T22:25:51Z" level=info msg="Initialized base logger [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmApi.cpp:5247] [{anonymous}::StartEmbeddedV2]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="Not changing to a home directory - 'DCGM_HOME_DIR' is not defined in the environment. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmApi.cpp:5260] [{anonymous}::StartEmbeddedV2]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="version:3.3.8;arch:x86_64;buildtype:Release;buildid:43;builddate:2024-09-03;commit:be8d66b4318e1d5d6e31b67759dc924d1bc18681;branch:rel_dcgm_3_3;buildplatform:Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64;;crc:c32a73e1865ecdfa6990a80f79a6dea9 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmApi.cpp:5264] [{anonymous}::StartEmbeddedV2]" dcgm_level=INFO time="2024-11-13T22:25:51Z" level=info msg="Signal 12 is already handled. Nothing to do. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/common/DcgmThread/DcgmThread.cpp:394] [DcgmThread::InstallSignalHandler]" dcgm_level=INFO time="2024-11-13T22:25:51Z" level=info msg="Cannot load NVML; DCGM will proceed without managing GPUs. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1472] [DcgmHostEngineHandler::LoadNvml]" dcgm_level=ERROR time="2024-11-13T22:25:51Z" level=info msg="__DCGM_XID_KMSG__ unset. Not loading [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmKmsgReader.cpp:40] [ReadEnvXidAndUpdate]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="__DCGM_TEST_KMSG_FILENAME__ unset. Not loading [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmKmsgReader.cpp:149] [ReadEnvKmsgFilenameAndUpdate]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="Set m_forceProfMetricsThroughGpm to 0 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:942] [DcgmCacheManager::DcgmCacheManager]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="Not attaching to GPUs because NVML is not loaded. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:1289] [DcgmCacheManager::AttachGpus]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="Got 0 entities from GetAllEntitiesOfEntityGroup() of eg 1 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmGroupManager.cpp:160] [DcgmGroupManager::AddAllEntitiesToGroup]" dcgm_level=WARN time="2024-11-13T22:25:51Z" level=info msg="Added GroupId 0 name DCGM_ALL_SUPPORTED_GPUS for connectionId 0 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmGroupManager.cpp:272] [DcgmGroupManager::AddNewGroup]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="Entering dcgmModuleIdToName(dcgmModuleId_t id, char const **name) (1, 0x7ffc7efb71f0) [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/entry_point.h:921] [dcgmModuleIdToName]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="Returning 0 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/entry_point.h:921] [dcgmModuleIdToName]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="[[NvSwitch]] Initialized logging for module 1 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/DcgmModule.h:90] [DcgmModuleWithCoreProxy<moduleId>::DcgmModuleWithCoreProxy]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="[[NvSwitch]] Constructing NvSwitch Module [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:29] [DcgmNs::DcgmModuleNvSwitch::DcgmModuleNvSwitch]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="[[NvSwitch]] Could not load NSCQ. dlwrap_attach ret: Can not access a needed shared library (-79): If this system has NvSwitches, please ensure that the package libnvidia-nscq is installed on your system and that the service user has permissions to access it. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:798] [DcgmNs::DcgmNvSwitchManager::AttachToNscq]" dcgm_level=ERROR time="2024-11-13T22:25:51Z" level=info msg="[[NvSwitch]] AttachToNscq() returned -25 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:632] [DcgmNs::DcgmNvSwitchManager::Init]" dcgm_level=ERROR time="2024-11-13T22:25:51Z" level=info msg="[[NvSwitch]] Could not initialize switch manager. Ret: DCGM library could not be found [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:34] [DcgmNs::DcgmModuleNvSwitch::DcgmModuleNvSwitch]" dcgm_level=ERROR time="2024-11-13T22:25:51Z" level=info msg="[[NvSwitch]] Created thread named \"\" ID 58713664 DcgmThread ptr 0x0x2e21e28 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/common/DcgmThread/DcgmThread.cpp:116] [DcgmThread::Start]" dcgm_level=INFO time="2024-11-13T22:25:51Z" level=info msg="[[NvSwitch]] Thread handle 58713664 running [/workspaces/dcgm-rel_dcgm_3_3-postmerge/common/DcgmThread/DcgmThread.cpp:305] [DcgmThread::RunInternal]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="Loaded module 1 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1876] [DcgmHostEngineHandler::LoadModule]" dcgm_level=INFO time="2024-11-13T22:25:51Z" level=info msg="[[NvSwitch]] Rescanning switch states [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:389] [DcgmNs::DcgmModuleNvSwitch::RunOnce]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="[[NvSwitch]] Reading switch status for all switches [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:970] [DcgmNs::DcgmNvSwitchManager::ReadNvSwitchStatusAllSwitches]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="[[NvSwitch]] Not attached to NvSwitches. Aborting [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:975] [DcgmNs::DcgmNvSwitchManager::ReadNvSwitchStatusAllSwitches]" dcgm_level=ERROR time="2024-11-13T22:25:51Z" level=info msg="[[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:393] [DcgmNs::DcgmModuleNvSwitch::RunOnce]" dcgm_level=WARN time="2024-11-13T22:25:51Z" level=info msg="[[NvSwitch]] No fields to update [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:554] [DcgmNs::DcgmNvSwitchManager::UpdateFields]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="[[NvSwitch]] No fields to update [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:554] [DcgmNs::DcgmNvSwitchManager::UpdateFields]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="Got 0 entities from GetAllEntitiesOfEntityGroup() of eg 3 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmGroupManager.cpp:160] [DcgmGroupManager::AddAllEntitiesToGroup]" dcgm_level=WARN time="2024-11-13T22:25:51Z" level=info msg="Added GroupId 1 name DCGM_ALL_SUPPORTED_NVSWITCHES for connectionId 0 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmGroupManager.cpp:272] [DcgmGroupManager::AddNewGroup]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="Not watching host engine fields because NVML is not loaded. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1198] [DcgmHostEngineHandler::WatchHostEngineFields]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="Created thread named \"cache_mgr_main\" ID 50320960 DcgmThread ptr 0x0x2e0ad10 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/common/DcgmThread/DcgmThread.cpp:116] [DcgmThread::Start]" dcgm_level=INFO time="2024-11-13T22:25:51Z" level=info msg="Thread handle 50320960 running [/workspaces/dcgm-rel_dcgm_3_3-postmerge/common/DcgmThread/DcgmThread.cpp:305] [DcgmThread::RunInternal]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="Cache manager update thread starting [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:6123] [DcgmCacheManager::run]" dcgm_level=INFO time="2024-11-13T22:25:51Z" level=info msg="Waited 100 usec for the cache manager thread to start. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2340] [DcgmCacheManager::Start]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="DoOneUpdateAllFields returned 0 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2461] [DcgmCacheManager::UpdateAllFields]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="DCGM successfully initialized!" time="2024-11-13T22:25:51Z" level=info msg="dcgmStartEmbedded(): Embedded host engine started [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmApi.cpp:5288] [{anonymous}::StartEmbeddedV2]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="Entering dcgmProfGetSupportedMetricGroups(dcgmHandle_t pDcgmHandle, dcgmProfGetMetricGroups_t *metricGroups) (2147483647, 0xc000048b00) [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/entry_point.h:873] [dcgmProfGetSupportedMetricGroups]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="GPM cannot be used: NVML is not loaded [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3174] [DcgmCacheManager::EntitySupportsGpm]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="gpuId 0 was not a GPM GPU [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/core/DcgmModuleCore.cpp:1860] [DcgmModuleCore::ProcessProfGetMetricGroups]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="Entering dcgmModuleIdToName(dcgmModuleId_t id, char const **name) (8, 0x7ffc7efb7950) [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/entry_point.h:921] [dcgmModuleIdToName]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="Returning 0 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/entry_point.h:921] [dcgmModuleIdToName]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="[[Profiling]] Initialized logging for module 8 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/DcgmModule.h:90] [DcgmModuleWithCoreProxy<moduleId>::DcgmModuleWithCoreProxy]" dcgm_level=DEBUG time="2024-11-13T22:25:51Z" level=info msg="[[Profiling]] __DCGM_PROF_NO_SKU_CHECK was NOT set. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:582] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ReadEnvironmentalVariables]" dcgm_level=DEBUG time="2024-11-13T22:25:52Z" level=info msg="[[Profiling]] NVPW_InitializeTarget() was successful. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1365] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]" dcgm_level=DEBUG time="2024-11-13T22:25:52Z" level=info msg="[[Profiling]] NVPW_DCGM_LoadDriver returned1 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1366] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]" dcgm_level=ERROR time="2024-11-13T22:25:52Z" level=info msg="[[Profiling]] DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:502] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::DcgmModuleProfiling]" dcgm_level=ERROR time="2024-11-13T22:25:52Z" level=info msg="[[Profiling]] A runtime exception occured when creating module. Ex: DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]" dcgm_level=ERROR time="2024-11-13T22:25:52Z" level=info msg="Failed to load module 8 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1871] [DcgmHostEngineHandler::LoadModule]" dcgm_level=ERROR time="2024-11-13T22:25:52Z" level=info msg="Core module subcommand 51 returned: This request is serviced by a module of DCGM that is not currently loaded [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/core/DcgmModuleCore.cpp:253] [DcgmModuleCore::ProcessMessage]" dcgm_level=ERROR time="2024-11-13T22:25:52Z" level=info msg="Returning -33 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/entry_point.h:873] [dcgmProfGetSupportedMetricGroups]" dcgm_level=DEBUG time="2024-11-13T22:25:52Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded" time="2024-11-13T22:25:52Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'" time="2024-11-13T22:25:52Z" level=warning msg="Skipping line 19 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled" time="2024-11-13T22:25:52Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled" time="2024-11-13T22:25:52Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled" time="2024-11-13T22:25:52Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled" time="2024-11-13T22:25:52Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled" time="2024-11-13T22:25:52Z" level=info msg="Initializing system entities of type: GPU" time="2024-11-13T22:25:52Z" level=info msg="Entering dcgmGetAllDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[DCGM_MAX_NUM_DEVICES], int *count) (2147483647 0xc000504400 0xc000514290) [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/entry_point.h:81] [dcgmGetAllDevices]" dcgm_level=DEBUG time="2024-11-13T22:25:52Z" level=info msg="Cannot get GPU ids: NVML is not loaded [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:11563] [DcgmCacheManager::GetGpuIds]" dcgm_level=DEBUG time="2024-11-13T22:25:52Z" level=info msg="Returning -56 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/entry_point.h:81] [dcgmGetAllDevices]" dcgm_level=DEBUG time="2024-11-13T22:25:52Z" level=info msg="Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system." time="2024-11-13T22:25:52Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3" time="2024-11-13T22:25:52Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6" time="2024-11-13T22:25:52Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7" time="2024-11-13T22:25:52Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8" time="2024-11-13T22:25:52Z" level=info msg="Kubernetes metrics collection enabled!" time="2024-11-13T22:25:52Z" level=info msg="Pipeline starting" time="2024-11-13T22:25:52Z" level=info msg="Starting webserver" time="2024-11-13T22:25:52Z" level=info msg="Listening on" address="[::]:9400" time="2024-11-13T22:25:52Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false

GPU Type:

NVIDIA A100

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB Off | 00000000:00:05.0 Off | 0 |
| N/A 33C P0 65W / 400W | 78441MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 37 C /usr/bin/python3 0MiB |
+-----------------------------------------------------------------------------------------+

NVIDIA L4

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 Off | 00000000:80:02.0 Off | 0 |
| N/A 70C P0 33W / 72W | 20605MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 77 C /usr/bin/python3 0MiB |
+-----------------------------------------------------------------------------------------+

@saichanumolu9 saichanumolu9 added the question Further information is requested label Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant