Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dask_cudf scheduler cannot start in wsl2 likely due to using unsupported nvml diagnostics #9955

Closed
lmeyerov opened this issue Dec 24, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@lmeyerov
Copy link

lmeyerov commented Dec 24, 2021

Edit 1: Linked issue with dask.distributed : dask/distributed#5628 & gpuopenanalytics/pynvml#42

Edit 2: This may be getting worked around via dask starting to skip nvml under wsl: dask/distributed#5568

Edit 3: This seems to be a finer drill down into the issue also reported as rapidsai/dask-cuda#761 (which has a temp workaround)

Describe the bug

Initializing a dask_cudf scheduler throws an exn under wsl2, likely due to dask using unsupported nvml diagnostic calls:

    command: [
      "dask-scheduler",
        "--port", "8786",
        "--interface", "eth0",
        "--no-show",
        "--dashboard-address", "8787"
    ]

=>

show --dashboard-address 8787 ]                                     dask-scheduler_1       | PWD: /opt/graphistry/apps/forge/etl-server-python                                                                                                                                 
dask-scheduler_1       | distributed.scheduler - INFO - -----------------------------------------------                                                                                                    
dask-scheduler_1       | Traceback (most recent call last):                                                                                                                                                
dask-scheduler_1       |   File "/opt/conda/envs/rapids/bin/dask-scheduler", line 11, in <module>                                                                                                          
dask-scheduler_1       |     sys.exit(go())                                                                                                                                                                
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/cli/dask_scheduler.py", line 217, in go                                                                    
dask-scheduler_1       |     main()                                                                                                                                                                        
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/click/core.py", line 829, in __call__                                                                                  
dask-scheduler_1       |     return self.main(*args, **kwargs)                                                                                                                                             
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/click/core.py", line 782, in main                                                                                      
dask-scheduler_1       |     rv = self.invoke(ctx)                                                                                                                                                         
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/click/core.py", line 1066, in invoke                                                                                   
dask-scheduler_1       |     return ctx.invoke(self.callback, **ctx.params)                                                                                                                                
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/click/core.py", line 610, in invoke                                                                                    
dask-scheduler_1       |     return callback(*args, **kwargs)                                                                                                                                              
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/cli/dask_scheduler.py", line 197, in main                                                                  
dask-scheduler_1       |     **kwargs,                                                                                                                                                                     
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/scheduler.py", line 3814, in __init__                                                                      
dask-scheduler_1       |     **kwargs,                                                                                                                                                                     
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/scheduler.py", line 1980, in __init__                                                                      
dask-scheduler_1       |     super().__init__(**kwargs)                                                                                                                                                    
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 160, in __init__                                                                            
dask-scheduler_1       |     self.monitor = SystemMonitor()                                                                                                                                                
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/system_monitor.py", line 67, in __init__                                                                   
dask-scheduler_1       |     self.update()                                                                                                                                                                 
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/system_monitor.py", line 132, in update                                                                    
dask-scheduler_1       |     gpu_metrics = nvml.real_time()                                                                                                                                                
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/diagnostics/nvml.py", line 87, in 
real_time                                                                dask-scheduler_1       |     "utilization": pynvml.nvmlDeviceGetUtilizationRates(h).gpu,                                                                                                                   
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py", line 2058, in 
nvmlDeviceGetUtilizationRates                                                           dask-scheduler_1       |     _nvmlCheckReturn(ret)                                                                                                                                                         
dask-scheduler_1       |   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn                                                                         
dask-scheduler_1       |     raise NVMLError(ret)                                                                                                                                                          
dask-scheduler_1       | pynvml.nvml.NVMLError_Unknown: Unknown Error                                                          

Steps/Code to reproduce bug

docker-compose.yml:

version: "3"
services:
  dask-scheduler:
    image: rapidsai/rapidsai-core:21.12-cuda11.4-base-ubuntu20.04-py3.8
    environment:
    command: [
      "dask-scheduler",
        "--port", "8786",
        "--interface", "eth0",
        "--no-show",
        "--dashboard-address", "8787"
    ]
docker-compose up

Expected behavior

Scheduler to start

Environment overview (please complete the following information)

  • rtx3070, 11.4
  • windows 11 with wsl2
  • docker 20.10.12, compose 1.29.2

Additional context

WSL2 NVML might not support nvmlDeviceGetUtilizationRates

  1. https://docs.nvidia.com/cuda/wsl-user-guide/index.html#features-not-yet-supported

"""
NVML (nvidia-smi) does not support all the queries yet.
"""

"""
GPU utilization, active compute process are some queries that are not yet supported. Modifiable state features (ECC, Compute mode, Persistence mode) will not be supported.
"""

  1. I confirmed only some nvml methods work in my wsl 2 setup:
>>> pynvml.nvmlInit()

>>> pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(0)).used
1079263232

>>> pynvml.nvmlDeviceGetUtilizationRates(pynvml.nvmlDeviceGetHandleByIndex(0))                                                                                                                             
Traceback (most recent call last):                                                                                               
File "<stdin>", line 1, in <module>
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py", line 2137, in nvmlDeviceGetUtilizationRates
    _nvmlCheckReturn(ret)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_Unknown: Unknown Error

latest Dask GPU mode distributed mishandles nvml metrics calls

https://github.com/dask/distributed/blob/96ee7f7b2cdaac5a23f4e5221083f0bdcff8b862/distributed/system_monitor.py#L132

=>

https://github.com/dask/distributed/blob/96ee7f7b2cdaac5a23f4e5221083f0bdcff8b862/distributed/diagnostics/nvml.py#L128

_get_memory_used() uses a working nvml call, but _get_utilization() fails on pynvml.nvmlDeviceGetUtilizationRates(h)

Interestingly, they try to tolerate exns by catching pynvml.NVMLError_NotSupported exn's...

... but latest pynvml is throwing pynvml.nvml.NVMLError_Unknown: Unknown Error, which distributed just escalates and thus fails

latest pynvml throws wrong exn

I don't know why the return type handler is giving Unknown instead of NotSupported: https://github.com/gpuopenanalytics/pynvml/blob/41e1657948b18008d302f5cb8af06539adc7c792/pynvml/nvml.py#L686

@charlesbluca
Copy link
Member

I'm interested in the fact that the minimal PyNVML commands also give you an NVMLError_Unknown, as I am unable to reproduce this on my own WSL2 setup with

→ nvidia-smi
Wed Jan  5 15:47:00 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.00       Driver Version: 510.06       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     On   | 00000000:15:00.0 Off |                  Off |
| 34%   32C    P8    19W / 260W |    444MiB / 49152MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:2D:00.0  On |                  Off |
| 34%   61C    P0    72W / 260W |   2083MiB / 49152MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
→ python -c "import pynvml; print(pynvml.__version__)"
11.0.0

Could you share your nvidia-smi output and PyNVML version?

@lmeyerov
Copy link
Author

lmeyerov commented Jan 6, 2022

Ok now this is bizarre: I just did another run through to duplicate my commands, and pynvml.nvml.NVMLError_Unknown: Unknown Error is now the expected pynvml.nvml.NVMLError_NotSupported. We're about to update to 2021-12/2022-01, so will confirm either way.

Same versions of nvidia/pynvml/containers:

  • wsl2, windows 11
  • host nvidia-smi 495.53/497.29/11.5
  • host pynvml 11.0.0+11.4.1 -> NVMLError_NotSupported as expected <- unsure if was throwing NVMLError_Unknown
  • container: same nvidia-smi + pynvml versions, except using rapids base container at 11.0 w/ rapids 2021-10 via mamba
  • container 11.0.0+11.4.1 -> NVMLError_NotSupported as expected <- was throwing NVMLError_Unknown

As expected, dask.distributed is now failing on the appropriate exception, making it their problem (and recently fixed I believe):

dask-scheduler_1       | 2022-01-06T08:32:49.455966951Z   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/system_monitor.py", line 132, in update         
dask-scheduler_1       | 2022-01-06T08:32:49.455967482Z     gpu_metrics = nvml.real_time()                                                                                     
dask-scheduler_1       | 2022-01-06T08:32:49.455967752Z   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/diagnostics/nvml.py", line 87, in real_time     
dask-scheduler_1       | 2022-01-06T08:32:49.455968073Z     "utilization": pynvml.nvmlDeviceGetUtilizationRates(h).gpu,                                                        
dask-scheduler_1       | 2022-01-06T08:32:49.455968373Z   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py", line 2058, in nvmlDeviceGetUtilizationRates
dask-scheduler_1       | 2022-01-06T08:32:49.455968684Z     _nvmlCheckReturn(ret)                                                                                              
dask-scheduler_1       | 2022-01-06T08:32:49.455968935Z   File "/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn              
dask-scheduler_1       | 2022-01-06T08:32:49.455969245Z     raise NVMLError(ret)                                                                                               
dask-scheduler_1       | 2022-01-06T08:32:49.455970287Z pynvml.nvml.NVMLError_NotSupported: Not Supported                                                                      f

Weirdly, I saw NVMLError_Unknown enough to be able to do the initial digging and copy-paste reporting in these tickets. Not sure what's changed -- the host & containers & py versions didn't, but reboots did happen

@github-actions
Copy link

github-actions bot commented Feb 5, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

github-actions bot commented May 6, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@GregoryKimball
Copy link
Contributor

Closing for now until we receive more repro information

@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants