Misreported used memory with the driver 535.129.03 #168

bryant1410 · 2024-01-11T13:46:42Z

Describe the bug

There's a mismatch between the used memory reported by nvidia-smi and gpustat. For example, right now, the former reports 11448 while the latter is 11961 (513 of difference).

Also, I see the following warning being displayed:

<REDACTED_PATH>/lib/python3.10/site-packages/gpustat/nvml.py:145: UserWarning: Your NVIDIA driver requires a compatible version of pynvml (>= 11.510.69) installed to display the correct memory usage information (See #141 for more details). Please try `pip install --upgrade nvidia-ml-py`.

Even when I'm using the latest versions of gpustat (1.1.1) and nvidia-ml-py (12.535.133).

Screenshots or Program Output

$ gpustat --debug
<REDACTED_PATH>/lib/python3.10/site-packages/gpustat/nvml.py:145: UserWarning: Your NVIDIA driver requires a compatible version of pynvml (>= 11.510.69) installed to display the correct memory usage information (See #141 for more details). Please try `pip install --upgrade nvidia-ml-py`.
  warnings.warn(

<REDACTED_HOSTNAME>  Thu Jan 11 13:41:16 2024  535.129.03
[0] NVIDIA A10G | 25°C,   0 % | 11961 / 23028 MB | <REDACTED_USERNAME>(11420M)

$ nvidia-smi
Thu Jan 11 13:44:02 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   25C    P0              57W / 300W |  11448MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1584434      C               <REDACTED_PATH>/bin/python    11420MiB |
+---------------------------------------------------------------------------------------+

Environment information:

OS: Ubuntu 22.04.3 LTS
NVIDIA Driver version: 535.129.03
The name(s) of GPU card: NVIDIA A10G
gpustat version: 1.1.1
pynvml version: 12.535.133

The text was updated successfully, but these errors were encountered:

wookayin · 2024-01-11T15:46:03Z

Thanks for reporting! I think this is pynvml installation problem. Do you happen to have the wrong pynvml as well as nvidia-ml-py (which conflicts on providing package)? Please confirm by:

pip list | grep nvml

ls -al $(python -c 'import pynvml; print(pynvml.__file__)')`

Try the following if this can fix the issue:

pip uninstall nvidia-ml-py3 pynvml
pip install --force-reinstall --ignore-installed 'nvidia-ml-py'

Maybe I should revert #153 because this is so common error users might encounter unawarely...

bryant1410 · 2024-01-11T15:54:11Z

Yes, that fixes the issue. pynvml 11.4.1 was installed. Thanks, Jongwook!!

So what's the general recommendation? Have a particular version of pynvml, or to uninstall it?

Feel free to close this issue if you want.

wookayin · 2024-01-11T16:11:29Z

It's fine and recommended to use the latest version of pynvml (as the python modulename) or nvidia-ml-py (as the PyPI package name to install), assuming NVIDIA does not break the backward compatibility. In general, any latest nvidia-ml-py with the same driver version prefix (e.g. 535.129.03 matches 12.535.133) should work.

To emphasize again, people should:

Caution

NEVER use pip install pynvml, nor have pynvml as a dependency to your python project.
Instead: pip install nvidia-ml-py is correct.

IMO the pynvml package should be removed from PyPI. I will add this to README, thanks again for reporting.

Revert "Make gpustat.nvml compatible with a third-party fork of pynvml" This reverts commit 7c09a0f. gpustat v1.1.1 allowed the problematic 'pynvml' package to be used as a workaround, but this still causes many problems (e.g., #168). Only the official nvidia-ml-py can be used with gpustat. See #153, #168

bryant1410 added the bug label Jan 11, 2024

wookayin added the pynvml label Jan 11, 2024

wookayin closed this as completed Jan 11, 2024

wookayin removed the bug label Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misreported used memory with the driver 535.129.03 #168

Misreported used memory with the driver 535.129.03 #168

bryant1410 commented Jan 11, 2024 •

edited

Loading

wookayin commented Jan 11, 2024 •

edited

Loading

bryant1410 commented Jan 11, 2024

wookayin commented Jan 11, 2024 •

edited

Loading

Misreported used memory with the driver 535.129.03 #168

Misreported used memory with the driver 535.129.03 #168

Comments

bryant1410 commented Jan 11, 2024 • edited Loading

wookayin commented Jan 11, 2024 • edited Loading

bryant1410 commented Jan 11, 2024

wookayin commented Jan 11, 2024 • edited Loading

bryant1410 commented Jan 11, 2024 •

edited

Loading

wookayin commented Jan 11, 2024 •

edited

Loading

wookayin commented Jan 11, 2024 •

edited

Loading