-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add --ngl to specify the number of gpu layers, and --keep-groups so podman has access to gpu #659
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -109,6 +109,13 @@ The default can be overridden in the ramalama.conf file or via the the | |||||||||||||||||||||||||||
RAMALAMA_IMAGE environment variable. `export RAMALAMA_TRANSPORT=quay.io/ramalama/aiimage:latest` tells | ||||||||||||||||||||||||||||
RamaLama to use the `quay.io/ramalama/aiimage:latest` image. | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
#### **--keep-groups** | ||||||||||||||||||||||||||||
pass --group-add keep-groups to podman (default: False) | ||||||||||||||||||||||||||||
Needed to access the gpu on some systems, but has an impact on security, use with caution. | ||||||||||||||||||||||||||||
Comment on lines
+112
to
+114
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🚨 suggestion (security): Elaborate on the security implications of --keep-groups. What are the specific security implications of using this flag?
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think these are great suggestions for documentation at least, dunno if you saw this bot comment @rhatdan @khumarahn |
||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
#### **--ngl** | ||||||||||||||||||||||||||||
number of gpu layers (default: 999) | ||||||||||||||||||||||||||||
khumarahn marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
#### **--nocontainer** | ||||||||||||||||||||||||||||
do not run RamaLama in the default container (default: False) | ||||||||||||||||||||||||||||
The default can be overridden in the ramalama.conf file. | ||||||||||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -160,6 +160,8 @@ def setup_container(self, args): | |
|
||
if os.path.basename(args.engine) == "podman": | ||
conman_args += ["--pull=newer"] | ||
if args.podman_keep_groups: | ||
conman_args += ["--group-add", "keep-groups"] | ||
elif os.path.basename(args.engine) == "docker": | ||
try: | ||
run_cmd([args.engine, "pull", "-q", args.image], ignore_all=True) | ||
|
@@ -188,10 +190,10 @@ def setup_container(self, args): | |
conman_args += ["-e", f"{k}={v}"] | ||
return conman_args | ||
|
||
def gpu_args(self, force=False, runner=False): | ||
def gpu_args(self, args, runner=False): | ||
khumarahn marked this conversation as resolved.
Show resolved
Hide resolved
|
||
gpu_args = [] | ||
if ( | ||
force | ||
args.gpu | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this PR is correct, on systems with integrated graphics we don't want to use GPU acceleration, it's typically much slower than CPU inferencing, practically unusable. With the default being a static 999 this will always be the case now, which is not good. We should look at this again before we spin up a new release. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can have this option but the default being 999 is not good There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think I changed any existing logic here, I just added two options that allow ramalama to work on my system. The default of 999 layers is of course questionable There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's the lack of VRAM on integrated graphics also, most of the time one will crash on one of these systems because of running out of VRAM There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, but somehow ramalama works on my laptop with no gpu without any configuration. This PR did not change the default behaviour of ramalama, unless I'm missing something There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should try and figure out why 999 was or wasn't included on your system |
||
or os.getenv("HIP_VISIBLE_DEVICES") | ||
or os.getenv("ASAHI_VISIBLE_DEVICES") | ||
or os.getenv("CUDA_VISIBLE_DEVICES") | ||
|
@@ -206,7 +208,7 @@ def gpu_args(self, force=False, runner=False): | |
else: | ||
gpu_args += ["-ngl"] # single dash | ||
|
||
gpu_args += ["999"] | ||
gpu_args += [ f'{args.ngl}' ] | ||
|
||
return gpu_args | ||
|
||
|
@@ -256,7 +258,7 @@ def build_exec_args_perplexity(self, args, model_path): | |
exec_args = ["llama-perplexity"] | ||
|
||
get_gpu() | ||
gpu_args = self.gpu_args(force=args.gpu) | ||
gpu_args = self.gpu_args(args=args) | ||
if gpu_args is not None: | ||
exec_args.extend(gpu_args) | ||
|
||
|
@@ -295,7 +297,7 @@ def build_exec_args_bench(self, args, model_path): | |
exec_args = ["llama-bench"] | ||
|
||
get_gpu() | ||
gpu_args = self.gpu_args(force=args.gpu) | ||
gpu_args = self.gpu_args(args=args) | ||
if gpu_args is not None: | ||
exec_args.extend(gpu_args) | ||
|
||
|
@@ -314,7 +316,7 @@ def build_exec_args_run(self, args, model_path, prompt): | |
exec_args += ["-v"] | ||
|
||
get_gpu() | ||
gpu_args = self.gpu_args(force=args.gpu, runner=True) | ||
gpu_args = self.gpu_args(args=args, runner=True) | ||
if gpu_args is not None: | ||
exec_args.extend(gpu_args) | ||
|
||
|
@@ -379,7 +381,7 @@ def handle_runtime(self, args, exec_args, exec_model_path): | |
exec_args = ["--port", args.port, "--model", MNT_FILE, "--max_model_len", "2048"] | ||
else: | ||
get_gpu() | ||
gpu_args = self.gpu_args(force=args.gpu) | ||
gpu_args = self.gpu_args(args=args) | ||
if gpu_args is not None: | ||
exec_args.extend(gpu_args) | ||
exec_args.extend(["--host", args.host]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have RamaLama figure this out. or always set it when Podman is used in rootless mode.
Users are not going to understand when and when not to use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a decision of maintainers...
I think that "keep-groups" should only be enabled when necessary, and I can't think of a bulletproof logic to determine when it is necessary. So keeping it as an option, off by default, seemed sensible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside of using it all of the time is a leak of GIDs into the container when not necessary. Add option to allow users to disable it for security purposes, But I think we should enable it by default for rootless containers so that users will not stumble upon it.