Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batched inference #771

Open
1 of 3 tasks
Tracked by #487
abetlen opened this issue Sep 30, 2023 · 37 comments · May be fixed by #951
Open
1 of 3 tasks
Tracked by #487

Add batched inference #771

abetlen opened this issue Sep 30, 2023 · 37 comments · May be fixed by #951
Labels
enhancement New feature or request high-priority

Comments

@abetlen
Copy link
Owner

abetlen commented Sep 30, 2023

  • Use llama_decode instead of deprecated llama_eval in Llama class
  • Implement batched inference support for generate and create_completion methods in Llama class
  • Add support for streaming / infinite completion
@abetlen abetlen added the enhancement New feature or request label Sep 30, 2023
@abetlen abetlen pinned this issue Sep 30, 2023
@abetlen abetlen mentioned this issue Sep 29, 2023
9 tasks
@JackKCWong
Copy link

Silly question, does that also support for parallel decoding in llama.cpp?

@steveoon
Copy link

steveoon commented Oct 12, 2023

Does the newest version support "batched decoding" of llama.cpp?

https://github.com/ggerganov/llama.cpp/pull/3228

@abetlen

@LoopControl
Copy link

LoopControl commented Oct 30, 2023

This would be a huge improvement for production use.

I tested locally with 4 parallel requests to the built-in ./server binary in llama.cpp and am able to hit some insanely good tokens/sec -- multiple times faster than what we get with a single request via the non-batched inference.

@hockeybro12
Copy link

This would be a huge improvement for production use.

I tested locally with 4 parallel requests to the built-in ./server binary in llama.cpp and am able to hit some insanely good tokens/sec -- multiple times faster than what we get with a single request via the non-batched inference.

@LoopControl How did you do 4 parallel requests to the ./server binary? Can you please provide an example, I'm trying to do the same. Thanks!

@LoopControl
Copy link

@LoopControl How did you do 4 parallel requests to the ./server binary? Can you please provide an example, I'm trying to do the same. Thanks!

There's 2 new flags in llama.cpp to add to your normal command -cb -np 4 (cb = continuous batching, np = parallel request count).

@hockeybro12
Copy link

@

@LoopControl How did you do 4 parallel requests to the ./server binary? Can you please provide an example, I'm trying to do the same. Thanks!

There's 2 new flags in llama.cpp to add to your normal command -cb -np 4 (cb = continuous batching, np = parallel request count).

Thanks, that works for me with llama.cpp, but not llama-cpp-python, which I think is expected. Unfortunately, the server API in llama.cpp here doesn't seem to be as good as the server in llama-cpp-python, at least for my task. Using the same llama model, I get better results with llama-cpp-python. So, I hope this can be added soon!

@zpzheng
Copy link

zpzheng commented Nov 20, 2023

When will this feature be available? I hope anyone can help solve this problem please.

@ggerganov
Copy link

Let me know if there are any roadblocks - I might be able to provide some insight

@abetlen
Copy link
Owner Author

abetlen commented Nov 23, 2023

Hey @ggerganov I missed this earlier.

Thank you, yeah I just need some quick clarifications around the kv cache behaviour.

The following is my understanding of the kv_cache implementation

  • The kv cache starts with a number of free cells initially equal to n_ctx
  • If the number of free cells gets down to 0 the kv cache / available context is full and some cells must be cleared to process any more tokens
  • When calling llama_decode, batch.n_tokens can only be as large as the largest free slot, if n_tokens is too large (llama_decode returns >1) you reduce the batch size it and retry
  • The number of occupied cells increases by batch.n_tokens on every call to llama_decode
  • The number of free cells increases when an occupied cell no longer belongs to any sequences or is shifted to pos < 0
  • Calling llama_kv_cache_seq_cp does not use cause any additional free cells to be occupied, the copy is "shallow" and only adds the new sequence id to the set
  • Calling llama_kv_cache_shift works by modifying the kv cells that belong to a given sequence however this also shifts this cell in all of the other sequences it belongs to

Is this correct?

@ggerganov
Copy link

ggerganov commented Nov 23, 2023

Yes, all of this is correct.

Calling llama_kv_cache_shift works by modifying the kv cells that belong to a given sequence however this also shifts this cell in all of the other sequences it belongs to

This call also sets a flag that upon the next llama_decode, the computation will first shift the KV cache data before proceeding as usual.

Will soon add a couple of functions to the API that can be useful for monitoring the KV cache state:

ggerganov/llama.cpp#4170

One of the main applications of llama_kv_cache_seq_cp is to "share" a common prompt (i.e. same tokens at the same positions) across multiple sequences. Most trivial example is a system prompt which is at the start for all generated sequences. By sharing it, the KV cache will be reused and thus less memory will be consumed, instead of having a copy for each sequences.

@abetlen abetlen linked a pull request Nov 28, 2023 that will close this issue
3 tasks
@zpzheng
Copy link

zpzheng commented Dec 8, 2023

I updated the version and saw the batch configuration. But when I ran it, the batch didn't take effect.When I send multiple requests, it still handles them one by one. My startup configuration is as follows:

python3 -m llama_cpp.server --model ./models/WizardLM-13B-V1.2/ggml-model-f16-Q5.gguf --n_gpu_layers 2 --n_ctx 8000 --n_batch 512 --n_threads 10 --n_threads_batch 10 --interrupt_requests False

Is there something wrong with my configuration? @abetlen

@LoopControl
Copy link

LoopControl commented Dec 8, 2023

@zpzheng It’s a draft PR so it’s not complete - you can see “Add support for parallel requests” is in the todo list

@Zahgrom34
Copy link

@abetlen Is there any progress on this?

@K-Mistele
Copy link
Contributor

+1, would be really great to have this

@everyfin-in
Copy link

+1, would be so great to have this!

@sadaisystems
Copy link

+1

@ArtyomZemlyak
Copy link

+1

@chenwr727
Copy link

+1

1 similar comment
@Connor2573
Copy link

+1

@shoaibmalek21
Copy link

Guys, any other solution in this??

@jasongst
Copy link

+1

1 similar comment
@ganliqiang
Copy link

+1

@stanier
Copy link

stanier commented Apr 3, 2024

+1

I use llama-cpp-python as (presently) the sole first-class-support backend in a project I've been developing. I'm not sure how much it would benefit from batching, as I've yet to do performance testing against other backends, but I feel like it could be a significant boon.

What's the current status of this and #951? I might be interested in taking a look at this, but I'm not certain I'd bring much to the table, I'll have to review the related code more.

@K-Mistele
Copy link
Contributor

I use llama-cpp-python as (presently) the sole first-class-support backend in a project I've been developing.

I would not do this. batching is super important and I had to move to llama.cpp's server (easy to deploy w/ docker or python, or even just the exe) because of lack of features on llama-cpp-python. If you're doing CPU inference, llama.cpp is a great option, otherwise I would use something like vLLM, BentoML's OpenLLM, or Predibase's LoRAx

@stanier
Copy link

stanier commented Apr 4, 2024

I would not do this. batching is super important and I had to move to llama.cpp's server

This is something I was considering, appreciate the advice. I'll likely end up doing that. I had to do the same with Ollama, but I wasn't on Ollama long and by no means felt it was the right fit for the job, support for it merely started from a peer showing interest and my compulsion to explore all viable options where possible.

I'm doing GPU inference and sadly that means Nvidia's antics have hindered me from getting things running in a container just the way I'd like them to up until now... but that's another story. I haven't tried vLLM, OpenLLM or LoRAx, llama.cpp and llama-cpp-python have generally been all I've needed up till now (and for longer, I hope-- I really appreciate the work done by all contributors to both projects, exciting that we're at least where we are today). Are those libraries any good if you're looking to do something with the perplexity of say q6_k on a (VRAM) budget? I'd prefer to be able to run it on my 1080Ti, even when I have access to more VRAM in another environment.

@yourbuddyconner
Copy link

yourbuddyconner commented Jun 11, 2024

I am dealing with this right now -- and unfortunately llama-cpp-server has the only completions endpoint that I can find that supports logprobs properly (random hard requirement I have). llama.cpp doesn't natively support it in its API, which is why I use llama-cpp-python.

vllm is great for big GPUs, however, it doesn't support gguf-quantized models or running at full-precision for smaller GPUs (I use T4s).

Whereas ideally I could run llama-cpp-python on an instance with 4 T4s and batch across them (like vllm can do), I am now creating instances with 1 gpu and scaling horizontally.

If anyone knows of a better openai-compatible API endpoint that wraps llama.cpp, I am listening, but I haven't found one.

@dimaioksha
Copy link

I am dealing with this right now -- and unfortunately llama-cpp-server has the only completions endpoint that I can find that supports logprobs properly (random hard requirement I have). llama.cpp doesn't natively support it in its API, which is why I use llama-cpp-python.

vllm is great for big GPUs, however, it doesn't support gguf-quantized models or running at full-precision for smaller GPUs (I use T4s).

Whereas ideally I could run llama-cpp-python on an instance with 4 T4s and batch across them (like vllm can do), I am now creating instances with 1 gpu and scaling horizontally.

If anyone knows of a better openai-compatible API endpoint that wraps llama.cpp, I am listening, but I haven't found one.

Have you tried https://github.com/ollama/ollama?

@yourbuddyconner
Copy link

ollama doesn't support batched inference what a silly suggestion.

ollama/ollama#1396

@NickCrews
Copy link

NickCrews commented Jun 24, 2024

I case this is useful to others, as a workaround until this is implemented, I wrote a tiny python library that

  • downloads and installs the raw llama.cpp server binary
  • downloads some model weights from huggingface hub
  • provides a simpleServer class to control starting/stopping the binary

This was needed because the raw server binary supports batched inference. All the heavy logic is already in the upstream C server, so all I needed to do was do the CLI and subprocess logic.

@dabs9
Copy link

dabs9 commented Jul 9, 2024

Does this mean that continuous batching is not supported in llama-cpp-python? I assume this is the type of batching under consideration in this issue.

@KohakuBlueleaf
Copy link

KohakuBlueleaf commented Jul 23, 2024

Does this mean that continuous batching is not supported in llama-cpp-python? I assume this is the type of batching under consideration in this issue.

Yes, but, no
Yes: continuous batching is not "utilized" in llama-cpp-python.
No: you can't even just do the simplest batching which encode multiple prompt at the same time, decode multiple sequence at the same time. Continuous batching is something "beyond" this

@yjkim3
Copy link

yjkim3 commented Jul 25, 2024

So, Is n_batch parameter is currently useless ?? I am wondering what the function of 'n_batch' is

@B0-B
Copy link

B0-B commented Aug 22, 2024

So, Is n_batch parameter is currently useless ?? I am wondering what the function of 'n_batch' is

Yes it is. I played around alot with n_batch, n_threads etc. but it's all useless..

Further, I tried using the futures threadpool as well as the threading module to simulate parallelism. All failed with a crashing kernel. As long as a single thread is running everything is fine, proving that it's not the thread process itself. However, once a second thread is started and tries to propagate it crashes. I assume the weights are locked and cannot be accesses by two separate asynchronous processes.

@ExtReMLapin
Copy link
Contributor

Any update on this ?

@Backendmagier
Copy link

Any updates? @abetlen I Think this is highly anticipated by many...

@LukeMoody01
Copy link

Any updates? This would be such an awesome feature to have. Looking forward to it

@helloHKTK
Copy link

helloHKTK commented Nov 27, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high-priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.