Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable containers on macOS to use the GPU #397

Merged
merged 3 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion container-images/ramalama/Containerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ ARG HUGGINGFACE_HUB_VERSION=0.26.2
ARG OMLMD_VERSION=0.1.6
# renovate: datasource=github-releases depName=tqdm/tqdm extractVersion=^v(?<version>.*)
ARG TQDM_VERSION=4.66.6
ARG LLAMA_CPP_SHA=3f1ae2e32cde00c39b96be6d01c2997c29bae555
ARG LLAMA_CPP_SHA=1329c0a75e6a7defc5c380eaf80d8e0f66d7da78
# renovate: datasource=git-refs depName=ggerganov/whisper.cpp packageName=https://github.com/ggerganov/whisper.cpp gitRef=master versioning=loose type=digest
ARG WHISPER_CPP_SHA=19dca2bb1464326587cbeb7af00f93c4a59b01fd

Expand Down
3 changes: 3 additions & 0 deletions docs/ramalama.1.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,9 @@ show container runtime command without executing it (default: False)
run RamaLama using the specified container engine. Default is `podman` if installed otherwise docker.
The default can be overridden in the ramalama.conf file or via the RAMALAMA_CONTAINER_ENGINE environment variable.

#### **--gpu**
offload the workload to the GPU (default: False)

#### **--help**, **-h**
show this help message and exit

Expand Down
7 changes: 7 additions & 0 deletions ramalama/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,13 @@ def configure_arguments(parser):
help="""do not run RamaLama in the default container.
The RAMALAMA_IN_CONTAINER environment variable modifies default behaviour.""",
)
parser.add_argument(
"--gpu",
dest="gpu",
default=False,
action="store_true",
help="offload the workload to the GPU",
)
parser.add_argument(
"--runtime",
default=config.get("runtime"),
Expand Down
26 changes: 23 additions & 3 deletions ramalama/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,6 @@ class Model:

def __init__(self, model):
self.model = model
if sys.platform == "darwin" or os.getenv("HIP_VISIBLE_DEVICES") or os.getenv("CUDA_VISIBLE_DEVICES"):
self.common_params += ["-ngl", "99"]

def login(self, args):
raise NotImplementedError(f"ramalama login for {self.type} not implemented")
Expand Down Expand Up @@ -146,7 +144,7 @@ def run_container(self, args, shortnames):
if hasattr(args, "port"):
conman_args += ["-p", f"{args.port}:{args.port}"]

if os.path.exists("/dev/dri"):
if sys.platform == "darwin" or os.path.exists("/dev/dri"):
conman_args += ["--device", "/dev/dri"]

if os.path.exists("/dev/kfd"):
Expand Down Expand Up @@ -180,6 +178,20 @@ def cleanup():
run_cmd(conman_args, stdout=None, debug=args.debug)
return True

def gpu_args(self):
gpu_args = [ ]
if sys.platform == "darwin":
# llama.cpp will default to the Metal backend on macOS, so we don't need
# any additional arguments.
pass
elif sys.platform == "linux" and (os.path.exists("/dev/dri") or
os.getenv("HIP_VISIBLE_DEVICES") or os.getenv("CUDA_VISIBLE_DEVICES")):
gpu_args = ["-ngl", "99"]
else:
print("GPU offload was requested but is not available on this system")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be raised as an exception?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a "right" answer to that question. It's mostly a UX decision. Perhaps it'll be less confusing if we stop the application so the users don't miss the message but, on the other hand, llama.cpp's behavior is to ignore -ngl 99 if there's any kind of problem trying to offload the model to the GPU, so there'll be still plenty of cases in which this may happen.


return gpu_args

def run(self, args):
prompt = "You are a helpful assistant"
if args.ARGS:
Expand All @@ -205,6 +217,9 @@ def run(self, args):
if not args.ARGS and sys.stdin.isatty():
exec_args.append("-cnv")

if args.gpu:
exec_args.extend(self.gpu_args())

try:
exec_cmd(exec_args, args.debug, debug=args.debug)
except FileNotFoundError as e:
Expand All @@ -217,6 +232,11 @@ def serve(self, args):
exec_args = ["llama-server", "--port", args.port, "-m", model_path]
if args.runtime == "vllm":
exec_args = ["vllm", "serve", "--port", args.port, model_path]
else:
if args.gpu:
exec_args.extend(self.gpu_args())
if in_container():
exec_args.extend(["--host", "0.0.0.0"])

if args.generate == "quadlet":
return self.quadlet(model_path, args, exec_args)
Expand Down
Loading