-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU scheduler cannot start in windows due to use of nvml diagnostics #5628
Comments
Thanks for raising an issue @lmeyerov
Yeah, I'm not sure if this has already been resolved or not since we've disabled NVML monitoring on WSL (xref #5568). Could you try with the cc @pentschev @charlesbluca for visibility |
This should be resolved with #5568 as that blocks the calls to from pynvml import *
nvmlInit()
h = nvmlDeviceGetHandleByIndex(0)
nvmlDeviceGetUtilizationRates(h) If so, then we might be able to isolate this issue to PyNVML specifically; I get a EDIT: I see from rapidsai/cudf#9955 that the minimal reproducer also gives you an unknown error, following up there |
Some reason I'm now seeing the expected |
Thanks @lmeyerov @charlesbluca -- closing in favor of rapidsai/cudf#9955 |
Edit 1: Maybe no longer an issue? Attempt to work around wsl @ #5568
Edit 2: Related upstream nvml issue: I filed gpuopenanalytics/pynvml#42
--
The full issue is in rapidsai/cudf#9955
The interesting parts wrt
dask.distributed
:distributed/distributed/system_monitor.py
Line 132 in 96ee7f7
=>
distributed/distributed/diagnostics/nvml.py
Lines 128 to 133 in 96ee7f7
=>
distributed/distributed/diagnostics/nvml.py
Lines 100 to 104 in 96ee7f7
_get_memory_used() uses a working nvml call, but _get_utilization() fails on pynvml.nvmlDeviceGetUtilizationRates(h) throwing
pynvml.nvml.NVMLError_Unknown
instead ofpynvml.NVMLError_NotSupported
. An exception here is reasonable (see original issue), butdask.distributed
is not handling the exn thatpynvml
throws.A separate issue is why
pynvml
receives and propagates an unknown error code in wls2 and whether it can switch toNVMLError_NotSupported
, but in the meanwhile,dask.distributed
should probably just warn and continue on this case. I'll file an upstreampynvml
issue, but suspect it may have an arbitrarily long ETA, so should work around here.The text was updated successfully, but these errors were encountered: