[BUG] dask_cudf scheduler cannot start in wsl2 likely due to using unsupported nvml diagnostics #9955

lmeyerov · 2021-12-24T00:54:02Z

Edit 1: Linked issue with dask.distributed : dask/distributed#5628 & gpuopenanalytics/pynvml#42

Edit 2: This may be getting worked around via dask starting to skip nvml under wsl: dask/distributed#5568

Edit 3: This seems to be a finer drill down into the issue also reported as rapidsai/dask-cuda#761 (which has a temp workaround)

Describe the bug

Initializing a dask_cudf scheduler throws an exn under wsl2, likely due to dask using unsupported nvml diagnostic calls:

    command: [
      "dask-scheduler",
        "--port", "8786",
        "--interface", "eth0",
        "--no-show",
        "--dashboard-address", "8787"
    ]

=>

show --dashboard-address 8787 ]                                     dask-scheduler_1       | PWD: /opt/graphistry/apps/forge/etl-server-python                                                                                                                                 
dask-scheduler_1       | distributed.scheduler - INFO - -----------------------------------------------                                                                                                    
dask-scheduler_1       | Traceback (most recent call last):                                                                                                                                                
dask-scheduler_1       |   File "/opt/conda/envs/rapids/bin/dask-scheduler", line 11, in <module>                                                                                                          
dask-scheduler_1       |     sys.exit(go())                                                                                                                                                                
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/cli/dask_scheduler.py", line 217, in go                                                                    
dask-scheduler_1       |     main()                                                                                                                                                                        
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/click/core.py", line 829, in __call__                                                                                  
dask-scheduler_1       |     return self.main(*args, **kwargs)                                                                                                                                             
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/click/core.py", line 782, in main                                                                                      
dask-scheduler_1       |     rv = self.invoke(ctx)                                                                                                                                                         
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/click/core.py", line 1066, in invoke                                                                                   
dask-scheduler_1       |     return ctx.invoke(self.callback, **ctx.params)                                                                                                                                
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/click/core.py", line 610, in invoke                                                                                    
dask-scheduler_1       |     return callback(*args, **kwargs)                                                                                                                                              
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/cli/dask_scheduler.py", line 197, in main                                                                  
dask-scheduler_1       |     **kwargs,                                                                                                                                                                     
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/scheduler.py", line 3814, in __init__                                                                      
dask-scheduler_1       |     **kwargs,                                                                                                                                                                     
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/scheduler.py", line 1980, in __init__                                                                      
dask-scheduler_1       |     super().__init__(**kwargs)                                                                                                                                                    
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 160, in __init__                                                                            
dask-scheduler_1       |     self.monitor = SystemMonitor()                                                                                                                                                
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/system_monitor.py", line 67, in __init__                                                                   
dask-scheduler_1       |     self.update()                                                                                                                                                                 
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/system_monitor.py", line 132, in update                                                                    
dask-scheduler_1       |     gpu_metrics = nvml.real_time()                                                                                                                                                
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/diagnostics/nvml.py", line 87, in 
real_time                                                                dask-scheduler_1       |     "utilization": pynvml.nvmlDeviceGetUtilizationRates(h).gpu,                                                                                                                   
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py", line 2058, in 
nvmlDeviceGetUtilizationRates                                                           dask-scheduler_1       |     _nvmlCheckReturn(ret)                                                                                                                                                         
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn                                                                         
dask-scheduler_1       |     raise NVMLError(ret)                                                                                                                                                          
dask-scheduler_1       | pynvml.nvml.NVMLError_Unknown: Unknown Error

Steps/Code to reproduce bug

docker-compose.yml:

version: "3"
services:
  dask-scheduler:
    image: rapidsai/rapidsai-core:21.12-cuda11.4-base-ubuntu20.04-py3.8
    environment:
    command: [
      "dask-scheduler",
        "--port", "8786",
        "--interface", "eth0",
        "--no-show",
        "--dashboard-address", "8787"
    ]

docker-compose up

Expected behavior

Scheduler to start

Environment overview (please complete the following information)

rtx3070, 11.4
windows 11 with wsl2
docker 20.10.12, compose 1.29.2

Additional context

WSL2 NVML might not support nvmlDeviceGetUtilizationRates

https://docs.nvidia.com/cuda/wsl-user-guide/index.html#features-not-yet-supported

"""
NVML (nvidia-smi) does not support all the queries yet.
"""

"""
GPU utilization, active compute process are some queries that are not yet supported. Modifiable state features (ECC, Compute mode, Persistence mode) will not be supported.
"""

I confirmed only some nvml methods work in my wsl 2 setup:

>>> pynvml.nvmlInit()

>>> pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(0)).used
1079263232

>>> pynvml.nvmlDeviceGetUtilizationRates(pynvml.nvmlDeviceGetHandleByIndex(0))                                                                                                                             
Traceback (most recent call last):                                                                                               
File "<stdin>", line 1, in <module>
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py", line 2137, in nvmlDeviceGetUtilizationRates
    _nvmlCheckReturn(ret)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_Unknown: Unknown Error

latest Dask GPU mode distributed mishandles nvml metrics calls

https://github.com/dask/distributed/blob/96ee7f7b2cdaac5a23f4e5221083f0bdcff8b862/distributed/system_monitor.py#L132

=>

https://github.com/dask/distributed/blob/96ee7f7b2cdaac5a23f4e5221083f0bdcff8b862/distributed/diagnostics/nvml.py#L128

_get_memory_used() uses a working nvml call, but _get_utilization() fails on pynvml.nvmlDeviceGetUtilizationRates(h)

Interestingly, they try to tolerate exns by catching pynvml.NVMLError_NotSupported exn's...

... but latest pynvml is throwing pynvml.nvml.NVMLError_Unknown: Unknown Error, which distributed just escalates and thus fails

latest pynvml throws wrong exn

I don't know why the return type handler is giving Unknown instead of NotSupported: https://github.com/gpuopenanalytics/pynvml/blob/41e1657948b18008d302f5cb8af06539adc7c792/pynvml/nvml.py#L686

The text was updated successfully, but these errors were encountered:

charlesbluca · 2022-01-05T20:55:55Z

I'm interested in the fact that the minimal PyNVML commands also give you an NVMLError_Unknown, as I am unable to reproduce this on my own WSL2 setup with

→ nvidia-smi
Wed Jan  5 15:47:00 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.00       Driver Version: 510.06       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     On   | 00000000:15:00.0 Off |                  Off |
| 34%   32C    P8    19W / 260W |    444MiB / 49152MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:2D:00.0  On |                  Off |
| 34%   61C    P0    72W / 260W |   2083MiB / 49152MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
→ python -c "import pynvml; print(pynvml.__version__)"
11.0.0

Could you share your nvidia-smi output and PyNVML version?

lmeyerov · 2022-01-06T00:35:39Z

Ok now this is bizarre: I just did another run through to duplicate my commands, and pynvml.nvml.NVMLError_Unknown: Unknown Error is now the expected pynvml.nvml.NVMLError_NotSupported. We're about to update to 2021-12/2022-01, so will confirm either way.

Same versions of nvidia/pynvml/containers:

wsl2, windows 11
host nvidia-smi 495.53/497.29/11.5
host pynvml 11.0.0+11.4.1 -> NVMLError_NotSupported as expected <- unsure if was throwing NVMLError_Unknown
container: same nvidia-smi + pynvml versions, except using rapids base container at 11.0 w/ rapids 2021-10 via mamba
container 11.0.0+11.4.1 -> NVMLError_NotSupported as expected <- was throwing NVMLError_Unknown

As expected, dask.distributed is now failing on the appropriate exception, making it their problem (and recently fixed I believe):

dask-scheduler_1       | 2022-01-06T08:32:49.455966951Z   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/system_monitor.py", line 132, in update         
dask-scheduler_1       | 2022-01-06T08:32:49.455967482Z     gpu_metrics = nvml.real_time()                                                                                     
dask-scheduler_1       | 2022-01-06T08:32:49.455967752Z   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/diagnostics/nvml.py", line 87, in real_time     
dask-scheduler_1       | 2022-01-06T08:32:49.455968073Z     "utilization": pynvml.nvmlDeviceGetUtilizationRates(h).gpu,                                                        
dask-scheduler_1       | 2022-01-06T08:32:49.455968373Z   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py", line 2058, in nvmlDeviceGetUtilizationRates
dask-scheduler_1       | 2022-01-06T08:32:49.455968684Z     _nvmlCheckReturn(ret)                                                                                              
dask-scheduler_1       | 2022-01-06T08:32:49.455968935Z   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn              
dask-scheduler_1       | 2022-01-06T08:32:49.455969245Z     raise NVMLError(ret)                                                                                               
dask-scheduler_1       | 2022-01-06T08:32:49.455970287Z pynvml.nvml.NVMLError_NotSupported: Not Supported                                                                      f

Weirdly, I saw NVMLError_Unknown enough to be able to do the initial digging and copy-paste reporting in these tickets. Not sure what's changed -- the host & containers & py versions didn't, but reboots did happen

github-actions · 2022-02-05T01:17:40Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-05-06T01:31:09Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

GregoryKimball · 2022-06-24T00:08:17Z

Closing for now until we receive more repro information

lmeyerov added Needs Triage Need team to review and classify bug Something isn't working labels Dec 24, 2021

This was referenced Dec 24, 2021

GPU scheduler cannot start in windows due to use of nvml diagnostics dask/distributed#5628

Closed

Wrong exn for wsl gpuopenanalytics/pynvml#42

Open

NVMLError_Unkown: Unknown Error rapidsai/dask-cuda#761

Closed

github-actions bot added the inactive-30d label Feb 5, 2022

github-actions bot added the inactive-90d label May 6, 2022

GregoryKimball closed this as completed Jun 24, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] dask_cudf scheduler cannot start in wsl2 likely due to using unsupported nvml diagnostics #9955

[BUG] dask_cudf scheduler cannot start in wsl2 likely due to using unsupported nvml diagnostics #9955

lmeyerov commented Dec 24, 2021 •

edited

Loading

charlesbluca commented Jan 5, 2022

lmeyerov commented Jan 6, 2022 •

edited

Loading

github-actions bot commented Feb 5, 2022

github-actions bot commented May 6, 2022

GregoryKimball commented Jun 24, 2022

[BUG] dask_cudf scheduler cannot start in wsl2 likely due to using unsupported nvml diagnostics #9955

[BUG] dask_cudf scheduler cannot start in wsl2 likely due to using unsupported nvml diagnostics #9955

Comments

lmeyerov commented Dec 24, 2021 • edited Loading

WSL2 NVML might not support nvmlDeviceGetUtilizationRates

latest Dask GPU mode distributed mishandles nvml metrics calls

latest pynvml throws wrong exn

charlesbluca commented Jan 5, 2022

lmeyerov commented Jan 6, 2022 • edited Loading

github-actions bot commented Feb 5, 2022

github-actions bot commented May 6, 2022

GregoryKimball commented Jun 24, 2022

lmeyerov commented Dec 24, 2021 •

edited

Loading

lmeyerov commented Jan 6, 2022 •

edited

Loading