-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] dask_cudf scheduler cannot start in wsl2 likely due to using unsupported nvml diagnostics #9955
Comments
I'm interested in the fact that the minimal PyNVML commands also give you an
Could you share your |
Ok now this is bizarre: I just did another run through to duplicate my commands, and Same versions of nvidia/pynvml/containers:
As expected, dask.distributed is now failing on the appropriate exception, making it their problem (and recently fixed I believe):
Weirdly, I saw |
This issue has been labeled |
This issue has been labeled |
Closing for now until we receive more repro information |
Edit 1: Linked issue with
dask.distributed
: dask/distributed#5628 & gpuopenanalytics/pynvml#42Edit 2: This may be getting worked around via dask starting to skip nvml under wsl: dask/distributed#5568
Edit 3: This seems to be a finer drill down into the issue also reported as rapidsai/dask-cuda#761 (which has a temp workaround)
Describe the bug
Initializing a dask_cudf scheduler throws an exn under wsl2, likely due to dask using unsupported nvml diagnostic calls:
=>
Steps/Code to reproduce bug
docker-compose.yml
:Expected behavior
Scheduler to start
Environment overview (please complete the following information)
Additional context
WSL2 NVML might not support nvmlDeviceGetUtilizationRates
"""
NVML (nvidia-smi) does not support all the queries yet.
"""
"""
GPU utilization, active compute process are some queries that are not yet supported. Modifiable state features (ECC, Compute mode, Persistence mode) will not be supported.
"""
nvml
methods work in my wsl 2 setup:latest Dask GPU mode distributed mishandles nvml metrics calls
https://github.com/dask/distributed/blob/96ee7f7b2cdaac5a23f4e5221083f0bdcff8b862/distributed/system_monitor.py#L132
=>
https://github.com/dask/distributed/blob/96ee7f7b2cdaac5a23f4e5221083f0bdcff8b862/distributed/diagnostics/nvml.py#L128
_get_memory_used()
uses a workingnvml
call, but_get_utilization()
fails onpynvml.nvmlDeviceGetUtilizationRates(h)
Interestingly, they try to tolerate exns by catching
pynvml.NVMLError_NotSupported
exn's...... but latest
pynvml
is throwingpynvml.nvml.NVMLError_Unknown: Unknown Error
, whichdistributed
just escalates and thus failslatest pynvml throws wrong exn
I don't know why the return type handler is giving
Unknown
instead ofNotSupported
: https://github.com/gpuopenanalytics/pynvml/blob/41e1657948b18008d302f5cb8af06539adc7c792/pynvml/nvml.py#L686The text was updated successfully, but these errors were encountered: