[Backend] Add Llamacpp backend #2975

angt · 2025-01-31T09:58:54Z

This PR adds support for the llamacpp backend.
The Dockerfile_llamacpp enables native CPU execution as well as CUDA acceleration for GPUs.

For setup and usage details, you can check the doc.

Signed-off-by: Adrien Gallouët <[email protected]>

HuggingFaceDocBuilderDev · 2025-02-06T10:32:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: Adrien Gallouët <[email protected]>

mfuntowicz · 2025-02-06T15:47:50Z

backends/llamacpp/.cargo/config.toml

I would remove this, especially if we end up deploying on Grace Hopper but building containers on qemu? Would it impact anything?

Initially, I focused on the ARM CPU, where a “native” build is mandatory for now. Removing this requirement will still cause the Docker container to fail on a different host. Some work is necessary here, and I believe in “llama.cpp” as well, to achieve full performance without this constraint.

Dockerfile_llamacpp

mfuntowicz · 2025-02-06T15:51:50Z

backends/llamacpp/src/main.rs

+    #[clap(long, env)]
+    model_gguf: String, // TODO Option() with hf->gguf & quantize


I'm a bit reluctant to have two parameters (model_id, model_gguf) here, as it would introduces defacto a discrepency between what users currently deploy (stack TGI) and this new backend.

would it be hard to rely solely on model-id to keep to advantage of backward compat and transparent loading accross backend? Or is it too much efforts for a v1?

Totally agree, but it was planned for a v2 if there is momentum on this backend :)

mfuntowicz · 2025-02-06T15:53:30Z

backends/llamacpp/src/main.rs

+    use_mlock: bool,
+
+    /// Enable offloading of KQV operations to the GPU.
+    #[clap(default_value = "false", long, env)]


Should we put this as true for the time being as we mostly deploy this on GPU in this specific release?

Also, what happens if we set it to true but no GPU is present? Is it ignored? If its the case, I would definitely argue for true by default

It works on the CPU when enabled, but I’m worried that it might not be as safe on some model x GPU combinations.
Let's try to enable it by default we will see 👍

mfuntowicz · 2025-02-06T15:54:15Z

backends/llamacpp/src/main.rs

+    type_v: LlamacppGGMLType,
+
+    /// Number of tokenizer workers used for payload validation and truncation.
+    #[clap(default_value = "2", long, env)]


2 seems arbitrary here? Should we default to 1 and let user override for more if needed?

I took the default used in the launcher 😅

mfuntowicz · 2025-02-06T15:54:43Z

backends/llamacpp/src/main.rs

+    #[clap(long, env)]
+    ngrok: bool,
+
+    /// Ngrok authentication token.
+    #[clap(long, env)]
+    ngrok_authtoken: Option<String>,
+
+    /// Ngrok edge to use for tunneling.
+    #[clap(long, env)]
+    ngrok_edge: Option<String>,


I think you can remove these

mfuntowicz · 2025-02-06T16:02:53Z

backends/llamacpp/src/backend.rs

+}
+
+impl LlamacppSampler {
+    fn new(req: &LlamacppRequest) -> Option<Self> {


as almost all the lines are being marked unsafe, should just the function be unsafe? Or a introduce an unsafe scope inside the safe function?

Yes it’s totally ugly... But I have some ideas to hide them. 👍

Signed-off-by: Adrien Gallouët <[email protected]>

angt force-pushed the llamacpp branch from 78857ec to fb74409 Compare January 31, 2025 10:24

This comment was marked as resolved.

Sign in to view

mfuntowicz changed the title ~~Add Llamacpp backend~~ [Backend] Add Llamacpp backend Feb 3, 2025

angt force-pushed the llamacpp branch from 76516da to 5877542 Compare February 4, 2025 12:37

angt and others added 26 commits February 4, 2025 13:32

Add llamacpp backend

95e221e

Signed-off-by: Adrien Gallouët <[email protected]>

Get rid of llama_batch_get_one()

bd0cc99

Signed-off-by: Adrien Gallouët <[email protected]>

Use max_batch_total_tokens

3eb4823

Signed-off-by: Adrien Gallouët <[email protected]>

Handle max_batch_size

e7facf6

Signed-off-by: Adrien Gallouët <[email protected]>

Add some input validation checks

a7b4b04

Signed-off-by: Adrien Gallouët <[email protected]>

Handle ctx args & fix sampling

8d2dfdf

Signed-off-by: Adrien Gallouët <[email protected]>

Add GPU args

f388747

Signed-off-by: Adrien Gallouët <[email protected]>

Add --defrag-threshold

e07835c

Signed-off-by: Adrien Gallouët <[email protected]>

Add a stupid batch mechanism

d6ded89

Signed-off-by: Adrien Gallouët <[email protected]>

Cleanup

390f0ec

Signed-off-by: Adrien Gallouët <[email protected]>

Add --numa

7a3ed41

Signed-off-by: Adrien Gallouët <[email protected]>

Fix args

3f19913

Signed-off-by: Adrien Gallouët <[email protected]>

Enable flash attention by default

ae5bb78

Signed-off-by: Adrien Gallouët <[email protected]>

Add --offload-kqv

e88a527

Signed-off-by: Adrien Gallouët <[email protected]>

Fix batch_pos

f38c34a

Signed-off-by: Adrien Gallouët <[email protected]>

backend(llama): add CUDA Dockerfile_llamacpp for now

960c12b

Only export the latest logits

161280f

Signed-off-by: Adrien Gallouët <[email protected]>

Output real logprobs

2a51e41

Signed-off-by: Adrien Gallouët <[email protected]>

Fix batching

96434a1

Signed-off-by: Adrien Gallouët <[email protected]>

Fix seq iterations

27534d8

Signed-off-by: Adrien Gallouët <[email protected]>

Auto-detect n_threads when not provided

c8505fb

Signed-off-by: Adrien Gallouët <[email protected]>

Clear request cache after completion

8ed362d

Signed-off-by: Adrien Gallouët <[email protected]>

Remove warmup

104a968

Signed-off-by: Adrien Gallouët <[email protected]>

Cleanup

ea28332

Signed-off-by: Adrien Gallouët <[email protected]>

backend(llama): add CUDA architectures build argument for Dockerfile

e6a8d33

Add specific args for batch

bfb8e03

Signed-off-by: Adrien Gallouët <[email protected]>

angt added 17 commits February 4, 2025 17:53

Cleanup Dockerfile

906c265

Signed-off-by: Adrien Gallouët <[email protected]>

Update Cargo.lock

e007529

Signed-off-by: Adrien Gallouët <[email protected]>

Update args

d3a772a

Signed-off-by: Adrien Gallouët <[email protected]>

Simplify batching logic

dbee804

Signed-off-by: Adrien Gallouët <[email protected]>

Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION

c52f083

Signed-off-by: Adrien Gallouët <[email protected]>

Rename bindings

051ff2d

Signed-off-by: Adrien Gallouët <[email protected]>

Remove n_ctx

09a745f

Signed-off-by: Adrien Gallouët <[email protected]>

Make max_batch_total_tokens optional

5b77787

Signed-off-by: Adrien Gallouët <[email protected]>

Ensure all samplers are freed on error

695b129

Signed-off-by: Adrien Gallouët <[email protected]>

Initialize penalty_last_n with llamacpp default value

0f62401

Signed-off-by: Adrien Gallouët <[email protected]>

Cleanup

f22e2fb

Signed-off-by: Adrien Gallouët <[email protected]>

Improve default settings

b3e40c4

Signed-off-by: Adrien Gallouët <[email protected]>

Add doc

1641c22

Signed-off-by: Adrien Gallouët <[email protected]>

Update docs

e4d5fa7

Signed-off-by: Adrien Gallouët <[email protected]>

Thanks clippy

fb81c0d

Signed-off-by: Adrien Gallouët <[email protected]>

Thanks cargo fmt

2b0d99c

Signed-off-by: Adrien Gallouët <[email protected]>

Update docs

8bc10d3

Signed-off-by: Adrien Gallouët <[email protected]>

angt requested review from mfuntowicz and Hugoch February 6, 2025 10:40

angt marked this pull request as ready for review February 6, 2025 10:43

angt added 4 commits February 6, 2025 13:17

Do not use HOSTNAME env

7bff88b

Signed-off-by: Adrien Gallouët <[email protected]>

Bump llama.cpp & cuda

df723e6

Signed-off-by: Adrien Gallouët <[email protected]>

Fix requirements.txt

5367d94

Signed-off-by: Adrien Gallouët <[email protected]>

Fix fmt

809e288

Signed-off-by: Adrien Gallouët <[email protected]>

mfuntowicz previously approved these changes Feb 6, 2025

View reviewed changes

angt added 2 commits February 6, 2025 18:33

Enable KQV offload by default

3b1b049

Signed-off-by: Adrien Gallouët <[email protected]>

Remove Ngrok tunneling

acca9c3

Signed-off-by: Adrien Gallouët <[email protected]>

angt dismissed mfuntowicz’s stale review via acca9c3 February 6, 2025 18:40

angt force-pushed the llamacpp branch from bc935dc to acca9c3 Compare February 6, 2025 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend] Add Llamacpp backend #2975

[Backend] Add Llamacpp backend #2975

angt commented Jan 31, 2025 •

edited

Loading

This comment was marked as resolved.

HuggingFaceDocBuilderDev commented Feb 6, 2025

mfuntowicz Feb 6, 2025

angt Feb 6, 2025

mfuntowicz Feb 6, 2025

angt Feb 6, 2025

mfuntowicz Feb 6, 2025

angt Feb 6, 2025

mfuntowicz Feb 6, 2025

angt Feb 6, 2025

mfuntowicz Feb 6, 2025

angt Feb 6, 2025

mfuntowicz Feb 6, 2025

angt Feb 6, 2025

		#[clap(long, env)]
		model_gguf: String, // TODO Option() with hf->gguf & quantize

[Backend] Add Llamacpp backend #2975

Are you sure you want to change the base?

[Backend] Add Llamacpp backend #2975

Conversation

angt commented Jan 31, 2025 • edited Loading

This comment was marked as resolved.

HuggingFaceDocBuilderDev commented Feb 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angt commented Jan 31, 2025 •

edited

Loading