Enable containers on macOS to use the GPU #397

slp · 2024-10-31T13:27:31Z

Three changes:

Bump llama.cpp to latest upstream, which enables the kompute backend to offload Q4_K_M models.
Add a --gpu flag to request the model to be offloaded to the GPU.
When running in a container, bind the server to 0.0.0.0 so the port can be accessed from outside the container.

On commit 1329c0a7, llama.cpp's kompute backend acquired the ability to offload Q4_K_M models to a Vulkan-capable GPU. Signed-off-by: Sergio Lopez <[email protected]>

Add a "--gpu" that allows users to request the workload to be offloaded to the GPU. This works natively on macOS using Metal and in containers using Vulkan with llama.cpp's Kompute backend. Signed-off-by: Sergio Lopez <[email protected]>

To be able to properly expose the port outside the container, we need to pass "--host 0.0.0.0" to llama.cpp. Signed-off-by: Sergio Lopez <[email protected]>

slp · 2024-10-31T13:28:25Z

This one supersedes #235

rhatdan · 2024-10-31T14:17:48Z

ramalama/model.py

+              os.getenv("HIP_VISIBLE_DEVICES") or os.getenv("CUDA_VISIBLE_DEVICES")):
+            gpu_args = ["-ngl", "99"]
+        else:
+            print("GPU offload was requested but is not available on this system")


Should this be raised as an exception?

I don't think there's a "right" answer to that question. It's mostly a UX decision. Perhaps it'll be less confusing if we stop the application so the users don't miss the message but, on the other hand, llama.cpp's behavior is to ignore -ngl 99 if there's any kind of problem trying to offload the model to the GPU, so there'll be still plenty of cases in which this may happen.

rhatdan · 2024-10-31T15:45:08Z

LGTM

slp added 3 commits October 31, 2024 14:18

container/ramalama: update llama.cpp to commit 1329c0a7

097e2bb

On commit 1329c0a7, llama.cpp's kompute backend acquired the ability to offload Q4_K_M models to a Vulkan-capable GPU. Signed-off-by: Sergio Lopez <[email protected]>

Add a cli option to enable GPU offload

9ec5b40

Add a "--gpu" that allows users to request the workload to be offloaded to the GPU. This works natively on macOS using Metal and in containers using Vulkan with llama.cpp's Kompute backend. Signed-off-by: Sergio Lopez <[email protected]>

serve: bind to all interfaces if running as a container

0b6f86a

To be able to properly expose the port outside the container, we need to pass "--host 0.0.0.0" to llama.cpp. Signed-off-by: Sergio Lopez <[email protected]>

slp marked this pull request as ready for review October 31, 2024 13:27

ericcurtin approved these changes Oct 31, 2024

View reviewed changes

rhatdan reviewed Oct 31, 2024

View reviewed changes

rhatdan merged commit 85e9803 into containers:main Oct 31, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable containers on macOS to use the GPU #397

Enable containers on macOS to use the GPU #397

slp commented Oct 31, 2024

slp commented Oct 31, 2024

rhatdan Oct 31, 2024

slp Oct 31, 2024

rhatdan commented Oct 31, 2024

Enable containers on macOS to use the GPU #397

Enable containers on macOS to use the GPU #397

Conversation

slp commented Oct 31, 2024

slp commented Oct 31, 2024

rhatdan Oct 31, 2024

Choose a reason for hiding this comment

slp Oct 31, 2024

Choose a reason for hiding this comment

rhatdan commented Oct 31, 2024