add runtimeclass nvidia as a default option for nimcache #177

jxdn · 2024-10-05T09:48:23Z

Hi ,

can help to add runtimeclass on the nimcache and all others crd ?

got this error

Traceback (most recent call last):
File "/usr/local/bin/download-to-cache", line 5, in
from vllm_nvext.hub.pre_download import download_to_cache
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/pre_download.py", line 20, in
from vllm_nvext.hub.ngc_injector import get_optimal_manifest_config
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", line 22, in
from vllm.engine.arg_utils import AsyncEngineArgs
File "/usr/local/lib/python3.10/dist-packages/vllm/init.py", line 3, in
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 6, in
from vllm.config import (CacheConfig, DecodingConfig, DeviceConfig,
File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 12, in
from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/init.py", line 3, in
from vllm.model_executor.layers.quantization.aqlm import AQLMConfig
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/aqlm.py", line 11, in
from vllm._C import ops
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

jxdn · 2024-10-06T03:33:40Z

i add this line and rebuild the nim-operator, and it works

jxdn · 2024-10-06T04:04:00Z

this happened also on nimservices
need to patch with
kubectl patch deployment meta-llama3-8b-instruct --type='merge' -p='{"spec": {"template": {"spec": {"runtimeClassName": "nvidia"}}}}' -n nim

mkhaas · 2024-10-06T15:00:13Z

Thanks for the suggestion. We'll add it to our backlog. In the meantime, recommend adding webhooks to add runtimeclass.

kirson-git · 2024-10-08T08:56:57Z

Can i use the patch command for NIMCACHE ?

shivamerla · 2024-10-14T20:24:11Z

@jxdn thanks for catching this. We didn't hit this error as nvidia was configured as a default runtime with the gpu-operator. Will include this in the next patch.

shivamerla · 2024-10-23T22:19:38Z

This PR should fix for NIM Service deployments. For caching, we should not need to specify "nvidia" runtime class as that Job can be run on a non-gpu node. For the issue reported the fix should be in download-to-cache NIM tool which is loading cuda libs unnecessarily. Going to request to fix that instead.

jxdn changed the title ~~add runtimeclass option for nimcache~~ add runtimeclass nvidia as a default option for nimcache Oct 6, 2024

shivamerla self-assigned this Oct 14, 2024

shivamerla added the bug Something isn't working label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add runtimeclass nvidia as a default option for nimcache #177

add runtimeclass nvidia as a default option for nimcache #177

jxdn commented Oct 5, 2024

jxdn commented Oct 6, 2024

jxdn commented Oct 6, 2024 •

edited

Loading

mkhaas commented Oct 6, 2024

kirson-git commented Oct 8, 2024

shivamerla commented Oct 14, 2024

shivamerla commented Oct 23, 2024

add runtimeclass nvidia as a default option for nimcache #177

add runtimeclass nvidia as a default option for nimcache #177

Comments

jxdn commented Oct 5, 2024

jxdn commented Oct 6, 2024

jxdn commented Oct 6, 2024 • edited Loading

mkhaas commented Oct 6, 2024

kirson-git commented Oct 8, 2024

shivamerla commented Oct 14, 2024

shivamerla commented Oct 23, 2024

jxdn commented Oct 6, 2024 •

edited

Loading