Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misreported used memory with the driver 535.129.03 #168

Closed
bryant1410 opened this issue Jan 11, 2024 · 3 comments
Closed

Misreported used memory with the driver 535.129.03 #168

bryant1410 opened this issue Jan 11, 2024 · 3 comments
Labels

Comments

@bryant1410
Copy link

bryant1410 commented Jan 11, 2024

Describe the bug

There's a mismatch between the used memory reported by nvidia-smi and gpustat. For example, right now, the former reports 11448 while the latter is 11961 (513 of difference).

Also, I see the following warning being displayed:

<REDACTED_PATH>/lib/python3.10/site-packages/gpustat/nvml.py:145: UserWarning: Your NVIDIA driver requires a compatible version of pynvml (>= 11.510.69) installed to display the correct memory usage information (See #141 for more details). Please try `pip install --upgrade nvidia-ml-py`.

Even when I'm using the latest versions of gpustat (1.1.1) and nvidia-ml-py (12.535.133).

Screenshots or Program Output

$ gpustat --debug
<REDACTED_PATH>/lib/python3.10/site-packages/gpustat/nvml.py:145: UserWarning: Your NVIDIA driver requires a compatible version of pynvml (>= 11.510.69) installed to display the correct memory usage information (See #141 for more details). Please try `pip install --upgrade nvidia-ml-py`.
  warnings.warn(

<REDACTED_HOSTNAME>  Thu Jan 11 13:41:16 2024  535.129.03
[0] NVIDIA A10G | 25°C,   0 % | 11961 / 23028 MB | <REDACTED_USERNAME>(11420M)
$ nvidia-smi
Thu Jan 11 13:44:02 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   25C    P0              57W / 300W |  11448MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1584434      C               <REDACTED_PATH>/bin/python    11420MiB |
+---------------------------------------------------------------------------------------+

Environment information:

  • OS: Ubuntu 22.04.3 LTS
  • NVIDIA Driver version: 535.129.03
  • The name(s) of GPU card: NVIDIA A10G
  • gpustat version: 1.1.1
  • pynvml version: 12.535.133
@bryant1410 bryant1410 added the bug label Jan 11, 2024
@wookayin
Copy link
Owner

wookayin commented Jan 11, 2024

Thanks for reporting! I think this is pynvml installation problem. Do you happen to have the wrong pynvml as well as nvidia-ml-py (which conflicts on providing package)? Please confirm by:

pip list | grep nvml

ls -al $(python -c 'import pynvml; print(pynvml.__file__)')`

Try the following if this can fix the issue:

pip uninstall nvidia-ml-py3 pynvml
pip install --force-reinstall --ignore-installed 'nvidia-ml-py'

Maybe I should revert #153 because this is so common error users might encounter unawarely...

@bryant1410
Copy link
Author

Yes, that fixes the issue. pynvml 11.4.1 was installed. Thanks, Jongwook!!

So what's the general recommendation? Have a particular version of pynvml, or to uninstall it?

Feel free to close this issue if you want.

@wookayin
Copy link
Owner

wookayin commented Jan 11, 2024

It's fine and recommended to use the latest version of pynvml (as the python modulename) or nvidia-ml-py (as the PyPI package name to install), assuming NVIDIA does not break the backward compatibility. In general, any latest nvidia-ml-py with the same driver version prefix (e.g. 535.129.03 matches 12.535.133) should work.

To emphasize again, people should:

Caution

NEVER use pip install pynvml, nor have pynvml as a dependency to your python project.
Instead: pip install nvidia-ml-py is correct.

IMO the pynvml package should be removed from PyPI. I will add this to README, thanks again for reporting.

wookayin added a commit that referenced this issue Jan 12, 2024
Revert "Make gpustat.nvml compatible with a third-party fork of pynvml"

This reverts commit 7c09a0f.

gpustat v1.1.1 allowed the problematic 'pynvml' package to be used
as a workaround, but this still causes many problems (e.g., #168).
Only the official nvidia-ml-py can be used with gpustat.

See #153, #168
@wookayin wookayin removed the bug label Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants