Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable containers on macOS to use the GPU #397

Merged
merged 3 commits into from
Oct 31, 2024

Conversation

slp
Copy link
Collaborator

@slp slp commented Oct 31, 2024

Three changes:

  • Bump llama.cpp to latest upstream, which enables the kompute backend to offload Q4_K_M models.
  • Add a --gpu flag to request the model to be offloaded to the GPU.
  • When running in a container, bind the server to 0.0.0.0 so the port can be accessed from outside the container.

slp added 3 commits October 31, 2024 14:18
On commit 1329c0a7, llama.cpp's kompute backend acquired the
ability to offload Q4_K_M models to a Vulkan-capable GPU.

Signed-off-by: Sergio Lopez <[email protected]>
Add a "--gpu" that allows users to request the workload to be
offloaded to the GPU. This works natively on macOS using Metal and
in containers using Vulkan with llama.cpp's Kompute backend.

Signed-off-by: Sergio Lopez <[email protected]>
To be able to properly expose the port outside the container, we
need to pass "--host 0.0.0.0" to llama.cpp.

Signed-off-by: Sergio Lopez <[email protected]>
@slp slp marked this pull request as ready for review October 31, 2024 13:27
@slp
Copy link
Collaborator Author

slp commented Oct 31, 2024

This one supersedes #235

os.getenv("HIP_VISIBLE_DEVICES") or os.getenv("CUDA_VISIBLE_DEVICES")):
gpu_args = ["-ngl", "99"]
else:
print("GPU offload was requested but is not available on this system")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be raised as an exception?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a "right" answer to that question. It's mostly a UX decision. Perhaps it'll be less confusing if we stop the application so the users don't miss the message but, on the other hand, llama.cpp's behavior is to ignore -ngl 99 if there's any kind of problem trying to offload the model to the GPU, so there'll be still plenty of cases in which this may happen.

@rhatdan
Copy link
Member

rhatdan commented Oct 31, 2024

LGTM

@rhatdan rhatdan merged commit 85e9803 into containers:main Oct 31, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants