Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backend] Add Llamacpp backend #2975

Open
wants to merge 53 commits into
base: main
Choose a base branch
from
Open

[Backend] Add Llamacpp backend #2975

wants to merge 53 commits into from

Conversation

angt
Copy link
Collaborator

@angt angt commented Jan 31, 2025

This PR adds support for the llamacpp backend.
The Dockerfile_llamacpp enables native CPU execution as well as CUDA acceleration for GPUs.

For setup and usage details, you can check the doc.

@Cynn1989

This comment was marked as resolved.

@mfuntowicz mfuntowicz changed the title Add Llamacpp backend [Backend] Add Llamacpp backend Feb 3, 2025
angt and others added 26 commits February 4, 2025 13:32
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
angt added 17 commits February 4, 2025 17:53
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@angt angt requested review from mfuntowicz and Hugoch February 6, 2025 10:40
@angt angt marked this pull request as ready for review February 6, 2025 10:43
angt added 4 commits February 6, 2025 13:17
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
mfuntowicz
mfuntowicz previously approved these changes Feb 6, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this, especially if we end up deploying on Grace Hopper but building containers on qemu? Would it impact anything?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I focused on the ARM CPU, where a “native” build is mandatory for now. Removing this requirement will still cause the Docker container to fail on a different host. Some work is necessary here, and I believe in “llama.cpp” as well, to achieve full performance without this constraint.

Dockerfile_llamacpp Show resolved Hide resolved
Comment on lines +27 to +28
#[clap(long, env)]
model_gguf: String, // TODO Option() with hf->gguf & quantize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit reluctant to have two parameters (model_id, model_gguf) here, as it would introduces defacto a discrepency between what users currently deploy (stack TGI) and this new backend.

would it be hard to rely solely on model-id to keep to advantage of backward compat and transparent loading accross backend? Or is it too much efforts for a v1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree, but it was planned for a v2 if there is momentum on this backend :)

use_mlock: bool,

/// Enable offloading of KQV operations to the GPU.
#[clap(default_value = "false", long, env)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this as true for the time being as we mostly deploy this on GPU in this specific release?

Also, what happens if we set it to true but no GPU is present? Is it ignored? If its the case, I would definitely argue for true by default

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works on the CPU when enabled, but I’m worried that it might not be as safe on some model x GPU combinations.
Let's try to enable it by default we will see 👍

type_v: LlamacppGGMLType,

/// Number of tokenizer workers used for payload validation and truncation.
#[clap(default_value = "2", long, env)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 seems arbitrary here? Should we default to 1 and let user override for more if needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the default used in the launcher 😅

Comment on lines 131 to 140
#[clap(long, env)]
ngrok: bool,

/// Ngrok authentication token.
#[clap(long, env)]
ngrok_authtoken: Option<String>,

/// Ngrok edge to use for tunneling.
#[clap(long, env)]
ngrok_edge: Option<String>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove these

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🫡

}

impl LlamacppSampler {
fn new(req: &LlamacppRequest) -> Option<Self> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as almost all the lines are being marked unsafe, should just the function be unsafe? Or a introduce an unsafe scope inside the safe function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it’s totally ugly... But I have some ideas to hide them. 👍

angt added 2 commits February 6, 2025 18:33
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants