Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dashboard] dashboard error because of gpustat version 1.1 #34196

Closed
yuanwu2017 opened this issue Apr 9, 2023 · 11 comments
Closed

[dashboard] dashboard error because of gpustat version 1.1 #34196

yuanwu2017 opened this issue Apr 9, 2023 · 11 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks

Comments

@yuanwu2017
Copy link

What happened + What you expected to happen

2023-04-09 08:36:28,479 ERROR services.py:1195 -- Failed to start the dashboard: Failed to start the dashboard, return code 1
The last 10 lines of /tmp/ray/session_2023-04-09_08-36-26_413307_64/logs/dashboard.log:
File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 52, in
import gpustat.core as gpustat
File "/usr/local/lib/python3.8/dist-packages/gpustat/init.py", line 16, in
from .core import GPUStat, GPUStatCollection
File "/usr/local/lib/python3.8/dist-packages/gpustat/core.py", line 24, in
from gpustat.nvml import pynvml as N
File "/usr/local/lib/python3.8/dist-packages/gpustat/nvml.py", line 57, in
_original_nvmlGetFunctionPointer = pynvml._nvmlGetFunctionPointer
AttributeError: module 'pynvml' has no attribute '_nvmlGetFunctionPointer'

There is no error when using the gpustat 1.0.0

Versions / Dependencies

Ray 2.2.0

Reproduction script

ray start --node-ip-address=${head_address} --head --dashboard-host='0.0.0.0' --dashboard-port=8265

Issue Severity

None

@yuanwu2017 yuanwu2017 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 9, 2023
@yuanwu2017 yuanwu2017 changed the title [<Ray component: dashboard>] dashboard error because of gpustat version 1.1 [Ray component: dashboard] dashboard error because of gpustat version 1.1 Apr 9, 2023
@yuanwu2017 yuanwu2017 changed the title [Ray component: dashboard] dashboard error because of gpustat version 1.1 [dashboard] dashboard error because of gpustat version 1.1 Apr 9, 2023
@scottsun94 scottsun94 added dashboard Issues specific to the Ray Dashboard observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Apr 10, 2023
@scottsun94
Copy link
Contributor

cc: @alanwguo @rkooo567

@rkooo567
Copy link
Contributor

let me try repro when I have bandwidth

@charlesbvll
Copy link

charlesbvll commented Apr 11, 2023

Same here, this error breaks our whole simulation engine. Setting gpustat==1.0.0 manually also fixes it for me.

@rkooo567 rkooo567 added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 18, 2023
@mattip
Copy link
Contributor

mattip commented May 16, 2023

Can you verify what version of nvml you are using (either pip list or conda list) ?

@mattip
Copy link
Contributor

mattip commented May 16, 2023

Actuall, from reading the linked issue, pip list may not help. Apparently this can be due to a legacy nvidia-ml-py3 package leaving old files around, and pip install --force-reinstall --ignore-installed nvidia-ml-py (without the 3 at the end) may be needed to clean out the old package.

@wookayin
Copy link

wookayin commented May 16, 2023

gpustat author here. As noted in wookayin/gpustat#153 this is highly likely to be due to conflicting dependencies on pynvml (nvidia-ml-py3 is a problem, should've been never installed). It'd be appreciated if you can provide the output of the following

\ls -al $(python -c 'import pynvml; print(pynvml.__file__)')
\sha1sum $(python -c 'import pynvml; print(pynvml.__file__)')

to confirm this is a broken pynvml issue. Or please share the file You can also check pip list | grep nvidia to see if both nvidia-ml-py3 and pynvml are present. To fix, please try pip install --force-resintall --ignore-installed --upgrade nvidia-ml-py.

@yuanwu2017
Copy link
Author

yuanwu2017 commented May 17, 2023

# ls -al $(python -c 'import pynvml; print(pynvml.__file__)')
-rw-r--r-- 1 root staff 113 May 17 01:31 /usr/local/lib/python3.8/dist-packages/pynvml/__init__.py
# sha1sum $(python -c 'import pynvml; print(pynvml.__file__)')
cce9f10ef0eeed0de26d200ceea3632692da884d  /usr/local/lib/python3.8/dist-packages/pynvml/__init__.py
# pip list | grep nvidia
nvidia-ml-py                11.525.112

@wookayin
Copy link

The pynvml.py file is only 113 Bytes which is weird. Can you paste the content of the file?

@yuanwu2017
Copy link
Author

# cat /usr/local/lib/python3.8/dist-packages/pynvml/__init__.py
from .nvml import *

from ._version import get_versions
__version__ = get_versions()['version']
del get_versions

@yuanwu2017
Copy link
Author

yuanwu2017 commented May 17, 2023

According to @wookayin analysis(wookayin/gpustat#153), the wrong pynvml should be used in my environment. If most people don't have this issue, I think this issue can be closed. I will continue to see which package introduced this wrong pynvml. thanks.

@mattip
Copy link
Contributor

mattip commented May 17, 2023

Just to make it explicit here: there are two packages that provide pynvml: the more popular nvidia-ml-py which works properly with gpustat, and pynvml, which does not work with gpustat. If you run into this problem and get here via google, you must currently uninstall pynvml and install nvidia-ml-py:

pip install --force-resintall --ignore-installed --upgrade nvidia-ml-py

Thanks for the report.

@mattip mattip closed this as completed May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

6 participants