Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add --ngl to specify the number of gpu layers, and --keep-groups so podman has access to gpu #659

Merged
merged 1 commit into from
Feb 2, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/ramalama.1.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,13 @@ The default can be overridden in the ramalama.conf file or via the the
RAMALAMA_IMAGE environment variable. `export RAMALAMA_TRANSPORT=quay.io/ramalama/aiimage:latest` tells
RamaLama to use the `quay.io/ramalama/aiimage:latest` image.

#### **--keep-groups**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have RamaLama figure this out. or always set it when Podman is used in rootless mode.
Users are not going to understand when and when not to use this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a decision of maintainers...

I think that "keep-groups" should only be enabled when necessary, and I can't think of a bulletproof logic to determine when it is necessary. So keeping it as an option, off by default, seemed sensible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of using it all of the time is a leak of GIDs into the container when not necessary. Add option to allow users to disable it for security purposes, But I think we should enable it by default for rootless containers so that users will not stumble upon it.

pass --group-add keep-groups to podman (default: False)
Needed to access the gpu on some systems, but has an impact on security, use with caution.
Comment on lines +112 to +114
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 suggestion (security): Elaborate on the security implications of --keep-groups.

What are the specific security implications of using this flag?

Suggested change
#### **--keep-groups**
pass --group-add keep-groups to podman (default: False)
Needed to access the gpu on some systems, but has an impact on security, use with caution.
#### **--keep-groups**
pass --group-add keep-groups to podman (default: False)
Needed to access the gpu on some systems, but has significant security implications:
- Preserves the host user's supplementary group memberships inside the container
- Could grant the container unnecessary elevated privileges through group memberships (e.g., disk, docker, sudo)
- May allow container processes to access host system resources that share the same group permissions
- Breaks container isolation principles by sharing host's security context
Only use this flag if GPU access cannot be achieved through more secure methods like specific device mapping.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are great suggestions for documentation at least, dunno if you saw this bot comment @rhatdan @khumarahn


#### **--ngl**
number of gpu layers (default: 999)
khumarahn marked this conversation as resolved.
Show resolved Hide resolved

#### **--nocontainer**
do not run RamaLama in the default container (default: False)
The default can be overridden in the ramalama.conf file.
Expand Down
9 changes: 9 additions & 0 deletions docs/ramalama.conf
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,15 @@
#
#host = "0.0.0.0"

# Pass `--group-add keep-groups` to podman, when using podman.
# In some cases this is needed to access the gpu from a rootless container
#
#keep_groups = false

# Default number of layers offloaded to the gpu
#
#ngl = 999

# Specify default port for services to listen on
#
#port = "8080"
Expand Down
9 changes: 9 additions & 0 deletions docs/ramalama.conf.5.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,15 @@ IP address for llama.cpp to listen on.
OCI container image to run with the specified AI model
RAMALAMA_IMAGE environment variable overrides this field.

**keep_groups**=false

Pass `--group-add keep-groups` to podman, when using podman.
In some cases this is needed to access the gpu from a rootless container
khumarahn marked this conversation as resolved.
Show resolved Hide resolved

**ngl**=999

Default number of layers to offload to the gpu
khumarahn marked this conversation as resolved.
Show resolved Hide resolved

**port**="8080"

Specify default port for services to listen on
Expand Down
15 changes: 15 additions & 0 deletions ramalama/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,21 @@ def configure_arguments(parser):
action="store_true",
help="offload the workload to the GPU",
)
parser.add_argument(
"--ngl",
dest="ngl",
type=int,
khumarahn marked this conversation as resolved.
Show resolved Hide resolved
default=config.get("ngl", 999),
help="Number of layers to offload to the gpu, if available"
)
parser.add_argument(
"--keep-groups",
dest="podman_keep_groups",
khumarahn marked this conversation as resolved.
Show resolved Hide resolved
default=config.get("keep_groups", False),
action="store_true",
help="""pass `--group-add keep-groups` to podman, if using podman.
Needed to access gpu on some systems, but has security implications.""",
)
parser.add_argument(
"--image",
default=config.get("image"),
Expand Down
16 changes: 9 additions & 7 deletions ramalama/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,8 @@ def setup_container(self, args):

if os.path.basename(args.engine) == "podman":
conman_args += ["--pull=newer"]
if args.podman_keep_groups:
conman_args += ["--group-add", "keep-groups"]
elif os.path.basename(args.engine) == "docker":
try:
run_cmd([args.engine, "pull", "-q", args.image], ignore_all=True)
Expand Down Expand Up @@ -188,10 +190,10 @@ def setup_container(self, args):
conman_args += ["-e", f"{k}={v}"]
return conman_args

def gpu_args(self, force=False, runner=False):
def gpu_args(self, args, runner=False):
khumarahn marked this conversation as resolved.
Show resolved Hide resolved
gpu_args = []
if (
force
args.gpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this PR is correct, on systems with integrated graphics we don't want to use GPU acceleration, it's typically much slower than CPU inferencing, practically unusable.

With the default being a static 999 this will always be the case now, which is not good. We should look at this again before we spin up a new release.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have this option but the default being 999 is not good

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I changed any existing logic here, I just added two options that allow ramalama to work on my system. The default of 999 layers is of course questionable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the lack of VRAM on integrated graphics also, most of the time one will crash on one of these systems because of running out of VRAM

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but somehow ramalama works on my laptop with no gpu without any configuration. This PR did not change the default behaviour of ramalama, unless I'm missing something

Copy link
Collaborator

@ericcurtin ericcurtin Feb 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try and figure out why 999 was or wasn't included on your system

or os.getenv("HIP_VISIBLE_DEVICES")
or os.getenv("ASAHI_VISIBLE_DEVICES")
or os.getenv("CUDA_VISIBLE_DEVICES")
Expand All @@ -206,7 +208,7 @@ def gpu_args(self, force=False, runner=False):
else:
gpu_args += ["-ngl"] # single dash

gpu_args += ["999"]
gpu_args += [ f'{args.ngl}' ]

return gpu_args

Expand Down Expand Up @@ -256,7 +258,7 @@ def build_exec_args_perplexity(self, args, model_path):
exec_args = ["llama-perplexity"]

get_gpu()
gpu_args = self.gpu_args(force=args.gpu)
gpu_args = self.gpu_args(args=args)
if gpu_args is not None:
exec_args.extend(gpu_args)

Expand Down Expand Up @@ -295,7 +297,7 @@ def build_exec_args_bench(self, args, model_path):
exec_args = ["llama-bench"]

get_gpu()
gpu_args = self.gpu_args(force=args.gpu)
gpu_args = self.gpu_args(args=args)
if gpu_args is not None:
exec_args.extend(gpu_args)

Expand All @@ -314,7 +316,7 @@ def build_exec_args_run(self, args, model_path, prompt):
exec_args += ["-v"]

get_gpu()
gpu_args = self.gpu_args(force=args.gpu, runner=True)
gpu_args = self.gpu_args(args=args, runner=True)
if gpu_args is not None:
exec_args.extend(gpu_args)

Expand Down Expand Up @@ -379,7 +381,7 @@ def handle_runtime(self, args, exec_args, exec_model_path):
exec_args = ["--port", args.port, "--model", MNT_FILE, "--max_model_len", "2048"]
else:
get_gpu()
gpu_args = self.gpu_args(force=args.gpu)
gpu_args = self.gpu_args(args=args)
if gpu_args is not None:
exec_args.extend(gpu_args)
exec_args.extend(["--host", args.host])
Expand Down
Loading