-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Hot swap for LoRA #8056
[WIP] Hot swap for LoRA #8056
Conversation
llama.cpp
Outdated
size_t nbytes = ggml_nbytes(tensor); | ||
size_t nbytes_pad = ggml_nbytes_pad(tensor); | ||
file.seek(offset, SEEK_SET); | ||
tensor->data = result->data.data() + data_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct with ggml-backend, you have to read the data to a temporary buffer and use ggml_backend_tensor_set
to load the data into the tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slaren Fixed thanks, and thanks for the references! The code now runs without errors, but it's incredibly slow (slower on metal than cpu with the lora applied).
Can you give a couple of tips on where to look for this performance issue?
I feel the issue is that I am adding new attributes to llama_context
to store the lora weights, basically lora_data
(containing a new ggml context, backend and buffer) and lora_weights_map
below. But these new attributes mess up the way llama_context is understood in the backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm not sure what is the impact of the lora layers on the KV caching, I'll try to understand that.
FYI, I tried the same idea but turns out the performance is very bad, so I ended up removing it (and just merge calculated lora to model weight) Here is my version: ngxson/llama.cpp@master...ngxson:llama.cpp:4e28ad40a099c7f618abf8ae113c4e56ee7705e8
This can be applied to any weights and any architecture, but the main down side is that number of nodes increase a lot (bad performance) |
llama.cpp
Outdated
ggml_tensor * t_lora = ggml_mul_mat(ctx0, | ||
ggml_mul_mat(ctx0, loraA, loraB), | ||
cur | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matrix multiplication is why it is extremely slow. It materialized a matrix of the same dimension as the weight, and then multiplies that matrix with the hidden state. It is essentially doubling the number of matrix multiplications. The lora_mul_mat
that I gave you does something very different:
ggml_tensor * t_lora =
ggml_mul_mat(ctx0,
loraB,
ggml_mul_mat(ctx0,
loraA,
cur
)
);
These matrix multiplications produce only much smaller vectors as the result. This would be a lot faster, but to make this work loraA
will need to be transposed. For optimal efficiency, new kernels optimized for the sizes of the lora matrices (very low number of columns) will also need to be implemented. Properly implemented, this will only add a small overhead to the computation (I estimate 3-10%, depending on the rank of the lora).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slaren Ops! Thanks! Fixed and it's faster (to the point that I do not perceive a difference from not using the lora).
Is this closed PR #996 something could work to optimise performance of the lora matrices?
Anyway, I wanted to first set up the current PR so that one can spin-up a server and hot-swap adapters. Do you think I should focus on optimising thin matrices multiplication first instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ltoniazzi Great to hear that it works on your side.
I think this PR can be a very good starting point for demo'ing that it works. However, I'll implement proper API for it in #8332
In the meantime, it would be useful to a look on how to convert HF "safetensors" adapter to gguf. Potentially we can introduce --lora
param to convert_hf_to_gguf.py
which already had all the logics for tensor name. An example for the safetensors adapter can be found here: https://huggingface.co/grimjim/Llama-3-Instruct-abliteration-LoRA-8B/tree/main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson Thanks for the comments 😊 !! Btw, I think in your version the large BA matrix is coming up here.
Yes, let me know if you need help. I can start to have a look at converting loras from safetensors to gguf. (This conversion was discussed here as having a script convert-lora-to-ggml.py
before it was removed in #7204 as it might have to be maintained for each model architecture.)
Btw, how do you render the model's graph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, how do you render the model's graph?
You can do ggml_graph_dump_dot(gf, NULL, "/tmp/graph.dot")
Then using dot -Tsvg /tmp/graph.dot -o /tmp/graph.svg
to convert it to svg (default command converts it to png, but I find it's quite difficult to open a big graph). You can also use online service: https://dreampuf.github.io/GraphvizOnline/
A small trick is to change rankdir
to TB
, it fits better on screen. And finally this won't work if graph is too big. I can only print graph of the tinyllama stories15M.gguf
model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also limit the loop in build_llama
to include only one layer in the graph.
Closing as feature has been implemented in #8332 |
Draft on hot-swapping LoRA adapters
Attempt to swap lora adpters at runtime.
Main logic:
bin
files from./finetune
(only one allowed now) are loaded byload_lora
llama_context
inlora_data
and a map base_layer_name -> lora A-B pair (lora_weights
) is created (seebuild_lora_weights_map
)ggml_mul_mat
(seeggml_mul_mat_lora
) checks if any lora tensor exists for the current base tensorW
, so to perform the lora operations asW(x) + B(A(x))
.Performance
open_llama on M2 (and only adding the adapter to
mul_mat
's forQcur
,Kcur
andVcur
andllm_build_ffn
).+10% ms per token
: basefp16
(79.86ms/tok) + loraQ8_0
rank=4
(87.00ms/tok)+23.9% ms per token
: baseQ4_K
(37.88ms/tok) + loraQ4_K
rank=4
(46.94ms/tok)+27.3% ms per token
: baseQ4_K
(37.88ms/tok) + loraQ4_K
rank=16
(48.31ms/tok)ToDos:
Current status
Setup
0. Create a lora adapter bin file
Create a lora adapter bin file
[Already in the branch]
mkdir data && touch data/hot-lora.txt
and write a couple of words in it.mkdir models/open-llama
and download Open-llama (all files) in the folder./models/open-llama
Run:
1. Run main with adapter
With adapter (
eval time = 87.00 ms per token
per token on M2): run main with base model and lora adapter to hot-swapOnly base model (
eval time = 79.86 ms per token
on M2): do not pass the flag--hot-lora
and the adapter is ignored:I have read the contributing guidelines
Self-reported review complexity: