GPU scheduler cannot start in windows due to use of nvml diagnostics #5628

lmeyerov · 2021-12-24T01:03:12Z

Edit 1: Maybe no longer an issue? Attempt to work around wsl @ #5568

Edit 2: Related upstream nvml issue: I filed gpuopenanalytics/pynvml#42

--

The full issue is in rapidsai/cudf#9955

The interesting parts wrt dask.distributed:

distributed/distributed/system_monitor.py

Line 132 in 96ee7f7

gpu_metrics = nvml.real_time()

=>

distributed/distributed/diagnostics/nvml.py

Lines 128 to 133 in 96ee7f7

    
           def real_time(): 
        
               h = _pynvml_handles() 
        
               return { 
        
                   "utilization": _get_utilization(h), 
        
                   "memory-used": _get_memory_used(h), 
        
               }

=>

distributed/distributed/diagnostics/nvml.py

Lines 100 to 104 in 96ee7f7

    
           def _get_utilization(h): 
        
               try: 
        
                   return pynvml.nvmlDeviceGetUtilizationRates(h).gpu 
        
               except pynvml.NVMLError_NotSupported: 
        
                   return None

_get_memory_used() uses a working nvml call, but _get_utilization() fails on pynvml.nvmlDeviceGetUtilizationRates(h) throwing pynvml.nvml.NVMLError_Unknown instead of pynvml.NVMLError_NotSupported . An exception here is reasonable (see original issue), but dask.distributed is not handling the exn that pynvml throws.

A separate issue is why pynvml receives and propagates an unknown error code in wls2 and whether it can switch to NVMLError_NotSupported , but in the meanwhile, dask.distributed should probably just warn and continue on this case. I'll file an upstream pynvml issue, but suspect it may have an arbitrarily long ETA, so should work around here.

The text was updated successfully, but these errors were encountered:

jrbourbeau · 2022-01-05T20:30:19Z

Thanks for raising an issue @lmeyerov

Edit 1: Maybe no longer an issue? Attempt to work around wsl @ #5568

Yeah, I'm not sure if this has already been resolved or not since we've disabled NVML monitoring on WSL (xref #5568). Could you try with the main branch of distributed to see if the issue you're having is still present?

cc @pentschev @charlesbluca for visibility

charlesbluca · 2022-01-05T20:46:03Z

This should be resolved with #5568 as that blocks the calls to nvmlDeviceGetUtilizationRates altogether, but I am interested in the fact that you got NVMLError_Unknown instead of NVMLError_NotSupported - do you still get an unknown error when running a minimal reproducer?

from pynvml import *

nvmlInit()

h = nvmlDeviceGetHandleByIndex(0)
nvmlDeviceGetUtilizationRates(h)

If so, then we might be able to isolate this issue to PyNVML specifically; I get a NVMLError_NotSupported from the above with PyNVML 11.0.0 and driver version 510.06.

EDIT:

I see from rapidsai/cudf#9955 that the minimal reproducer also gives you an unknown error, following up there

lmeyerov · 2022-01-08T22:45:36Z

Some reason I'm now seeing the expected NotSupported so I think we can close for now, and watch for if my system or anyone else starts seeing Unknown again

jrbourbeau · 2022-01-10T23:56:11Z

Thanks @lmeyerov @charlesbluca -- closing in favor of rapidsai/cudf#9955

This was referenced Dec 24, 2021

[BUG] dask_cudf scheduler cannot start in wsl2 likely due to using unsupported nvml diagnostics rapidsai/cudf#9955

Closed

Add check for unsupported NVML metrics #5343

Merged

Wrong exn for wsl gpuopenanalytics/pynvml#42

Open

jrbourbeau added the needs info Needs further information from the user label Jan 5, 2022

jrbourbeau closed this as completed Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU scheduler cannot start in windows due to use of nvml diagnostics #5628

GPU scheduler cannot start in windows due to use of nvml diagnostics #5628

lmeyerov commented Dec 24, 2021 •

edited

Loading

jrbourbeau commented Jan 5, 2022

charlesbluca commented Jan 5, 2022 •

edited

Loading

lmeyerov commented Jan 8, 2022

jrbourbeau commented Jan 10, 2022

GPU scheduler cannot start in windows due to use of nvml diagnostics #5628

GPU scheduler cannot start in windows due to use of nvml diagnostics #5628

Comments

lmeyerov commented Dec 24, 2021 • edited Loading

jrbourbeau commented Jan 5, 2022

charlesbluca commented Jan 5, 2022 • edited Loading

lmeyerov commented Jan 8, 2022

jrbourbeau commented Jan 10, 2022

lmeyerov commented Dec 24, 2021 •

edited

Loading

charlesbluca commented Jan 5, 2022 •

edited

Loading