Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman for nvidia may need --group-add keep-groups #655

Closed
khumarahn opened this issue Jan 29, 2025 · 16 comments
Closed

podman for nvidia may need --group-add keep-groups #655

khumarahn opened this issue Jan 29, 2025 · 16 comments

Comments

@khumarahn
Copy link
Collaborator

khumarahn commented Jan 29, 2025

Hi! Ramalama is a very nice project.

I have some trouble making it use an nvidia gpu: on my system /dev/dri/card* are protected by the group video, and this privilege is given up by podman.

$ ls -alh /dev/dri
total 0
drwxr-xr-x   3 root root        140 Jan 28 11:15 .
drwxr-xr-x  21 root root       4.5K Jan 28 13:16 ..
drwxr-xr-x   2 root root        120 Jan 28 11:15 by-path
crw-rw----+  1 root video  226,   0 Jan 28 11:15 card0
crw-rw----+  1 root video  226,   1 Jan 28 11:15 card1
crw-rw-rw-   1 root render 226, 128 Jan 28 11:15 renderD128
crw-rw-rw-   1 root render 226, 129 Jan 28 11:15 renderD129

This is similar to #376 and related to containers/podman#10166

@ericcurtin
Copy link
Collaborator

Lets just add it like @bmahabirbu 's #376 since this is the second reported case.

Please open a PR and ensure it's called in all cases kublet, quadlet, run, serve, etc.

@khumarahn
Copy link
Collaborator Author

khumarahn commented Jan 29, 2025

I managed to make it work with just this. My gpu is old and doesn't have much ram, so I also had to pass --gpu-layers to llama.cpp, which I did in the least universal way I saw

diff --git a/ramalama/model.py b/ramalama/model.py
index 710a27a..3231410 100644
--- a/ramalama/model.py
+++ b/ramalama/model.py
@@ -185,6 +185,7 @@ class Model:
             # Special case for Cuda
             if k == "CUDA_VISIBLE_DEVICES":
                 conman_args += ["--device", "nvidia.com/gpu=all"]
+                conman_args += ["--group-add", "keep-groups"]
             conman_args += ["-e", f"{k}={v}"]
         return conman_args
 
@@ -382,7 +383,7 @@ class Model:
             gpu_args = self.gpu_args(force=args.gpu)
             if gpu_args is not None:
                 exec_args.extend(gpu_args)
-            exec_args.extend(["--host", args.host])
+            exec_args.extend(["--host", args.host, "--gpu-layers", "16"])
         return exec_args
 
     def generate_container_config(self, model_path, args, exec_args):

@khumarahn
Copy link
Collaborator Author

I think, there needs to be a way to pass --gpu-layers to llama.cpp too. Ollama seems to compute that automatically somehow

@ericcurtin
Copy link
Collaborator

--gpu-layers is ngl in llama.cpp cli

@khumarahn
Copy link
Collaborator Author

Ah, right! This also works:

diff --git a/ramalama/model.py b/ramalama/model.py
index 710a27a..ad87b85 100644
--- a/ramalama/model.py
+++ b/ramalama/model.py
@@ -185,6 +185,7 @@ class Model:
             # Special case for Cuda
             if k == "CUDA_VISIBLE_DEVICES":
                 conman_args += ["--device", "nvidia.com/gpu=all"]
+                conman_args += ["--group-add", "keep-groups"]
             conman_args += ["-e", f"{k}={v}"]
         return conman_args
 
@@ -206,7 +207,7 @@ class Model:
             else:
                 gpu_args += ["-ngl"]  # single dash
 
-            gpu_args += ["999"]
+            gpu_args += ["16"]
 
         return gpu_args
 

@khumarahn khumarahn reopened this Jan 29, 2025
@rhatdan
Copy link
Member

rhatdan commented Jan 29, 2025

The keep-groups flag makes sense for rootless containers. But it would not work for docker backend, so we need to be careful.

The gpu_args change I will leave up to @ericcurtin .

@ericcurtin
Copy link
Collaborator

ericcurtin commented Jan 29, 2025

I think we might need:

"--gpu-layers", "16"

but I'm not sure about changing 999 for llama.cpp, most of the time that's what you want, use the maximum amount of layers available.

We have to be careful about merging defaults that "work best on my GPU". CLIs to override defaults are fine too.

@bmahabirbu
Copy link
Collaborator

bmahabirbu commented Jan 29, 2025

@khumarahn did you follow the nvidia cuda setup guide https://github.com/containers/ramalama/blob/main/docs/ramalama-cuda.7.md. Setting up cuda with containers this way I no longer needed to pass --group-add keep-groups.

Could also just be the environment you're using as well.

@khumarahn
Copy link
Collaborator Author

I followed the guide, still need the groups.

"--gpu-layers", "16"should not be hardcoded, that is just what worked for my gpu with a particular AI. Ideally it should be configurable as a ramalama command line option. The reason I wanted to change it is, with 999 layers ramalama would crash failing to allocate enough gpu memory

@ericcurtin
Copy link
Collaborator

Feel free to add a --ngl option to RamaLama @khumarahn

@bmahabirbu
Copy link
Collaborator

for the --ngl bit try using n_gpu_layers = -1 see if that fixes the issue. Supposedly that automatically offloads the correct amount of gpu layers

@khumarahn
Copy link
Collaborator Author

for the --ngl bit try using n_gpu_layers = -1 see if that fixes the issue. Supposedly that automatically offloads the correct amount of gpu layers

I didn't find this in llama.cpp docs, and in my test -1 gpu layer only used 1gb of gpu ram

@bmahabirbu
Copy link
Collaborator

Good to know saw it mentioned in some posts and thought it was still relevant.

@ericcurtin
Copy link
Collaborator

for the --ngl bit try using n_gpu_layers = -1 see if that fixes the issue. Supposedly that automatically offloads the correct amount of gpu layers

I didn't find this in llama.cpp docs, and in my test -1 gpu layer only used 1gb of gpu ram

-1 doesn't work in llama.cpp , in llama.cpp 999 is use max layers.

It may work in vllm I don't know.

@khumarahn
Copy link
Collaborator Author

-1 doesn't work in llama.cpp , in llama.cpp 999 is use max layers.

But my llama.cpp crashes with 999 layers with

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 18508.35 MiB on device 0: cudaMalloc failed: out of memory

@khumarahn
Copy link
Collaborator Author

I created a PR. It seemed pretty straightforward, but please check me...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants