Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU scheduler cannot start in windows due to use of nvml diagnostics #5628

Closed
lmeyerov opened this issue Dec 24, 2021 · 4 comments
Closed

GPU scheduler cannot start in windows due to use of nvml diagnostics #5628

lmeyerov opened this issue Dec 24, 2021 · 4 comments
Labels
needs info Needs further information from the user

Comments

@lmeyerov
Copy link

lmeyerov commented Dec 24, 2021

Edit 1: Maybe no longer an issue? Attempt to work around wsl @ #5568

Edit 2: Related upstream nvml issue: I filed gpuopenanalytics/pynvml#42

--

The full issue is in rapidsai/cudf#9955


The interesting parts wrt dask.distributed:

gpu_metrics = nvml.real_time()

=>

def real_time():
h = _pynvml_handles()
return {
"utilization": _get_utilization(h),
"memory-used": _get_memory_used(h),
}

=>

def _get_utilization(h):
try:
return pynvml.nvmlDeviceGetUtilizationRates(h).gpu
except pynvml.NVMLError_NotSupported:
return None

_get_memory_used() uses a working nvml call, but _get_utilization() fails on pynvml.nvmlDeviceGetUtilizationRates(h) throwing pynvml.nvml.NVMLError_Unknown instead of pynvml.NVMLError_NotSupported . An exception here is reasonable (see original issue), but dask.distributed is not handling the exn that pynvml throws.


A separate issue is why pynvml receives and propagates an unknown error code in wls2 and whether it can switch to NVMLError_NotSupported , but in the meanwhile, dask.distributed should probably just warn and continue on this case. I'll file an upstream pynvml issue, but suspect it may have an arbitrarily long ETA, so should work around here.

@jrbourbeau
Copy link
Member

Thanks for raising an issue @lmeyerov

Edit 1: Maybe no longer an issue? Attempt to work around wsl @ #5568

Yeah, I'm not sure if this has already been resolved or not since we've disabled NVML monitoring on WSL (xref #5568). Could you try with the main branch of distributed to see if the issue you're having is still present?

cc @pentschev @charlesbluca for visibility

@jrbourbeau jrbourbeau added the needs info Needs further information from the user label Jan 5, 2022
@charlesbluca
Copy link
Member

charlesbluca commented Jan 5, 2022

This should be resolved with #5568 as that blocks the calls to nvmlDeviceGetUtilizationRates altogether, but I am interested in the fact that you got NVMLError_Unknown instead of NVMLError_NotSupported - do you still get an unknown error when running a minimal reproducer?

from pynvml import *

nvmlInit()

h = nvmlDeviceGetHandleByIndex(0)
nvmlDeviceGetUtilizationRates(h)

If so, then we might be able to isolate this issue to PyNVML specifically; I get a NVMLError_NotSupported from the above with PyNVML 11.0.0 and driver version 510.06.

EDIT:

I see from rapidsai/cudf#9955 that the minimal reproducer also gives you an unknown error, following up there

@lmeyerov
Copy link
Author

lmeyerov commented Jan 8, 2022

Some reason I'm now seeing the expected NotSupported so I think we can close for now, and watch for if my system or anyone else starts seeing Unknown again

@jrbourbeau
Copy link
Member

Thanks @lmeyerov @charlesbluca -- closing in favor of rapidsai/cudf#9955

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs info Needs further information from the user
Projects
None yet
Development

No branches or pull requests

3 participants