[WIP] Hot swap for LoRA #8056

ltoniazzi · 2024-06-21T16:51:40Z

Draft on hot-swapping LoRA adapters

Attempt to swap lora adpters at runtime.
Main logic:

Lora bin files from ./finetune (only one allowed now) are loaded by load_lora
each lora tensor is added to llama_context in lora_data and a map base_layer_name -> lora A-B pair (lora_weights) is created (see build_lora_weights_map)
a lora version of ggml_mul_mat (see ggml_mul_mat_lora) checks if any lora tensor exists for the current base tensor W, so to perform the lora operations as W(x) + B(A(x)).

Performance

open_llama on M2 (and only adding the adapter to mul_mat's for Qcur, Kcur and Vcur and llm_build_ffn ).

+10% ms per token: base fp16 (79.86ms/tok) + lora Q8_0 rank=4 (87.00ms/tok)
+23.9% ms per token: base Q4_K (37.88ms/tok) + lora Q4_K rank=4 (46.94ms/tok)
+27.3% ms per token: base Q4_K (37.88ms/tok) + lora Q4_K rank=16 (48.31ms/tok)

ToDos:

add mul mat to other base layers.
Rerun performance checks.
set up hot swapping for llama-server.
transpose loraA when loading instead of at inference.
test performance inference with/out adapter and hot-swapping.

Current status

Runs on CPU and GPU (tested on Metal)
Only runs for llama arch
Only one Lora adapter can be passed.
Not all lora tensors are applied.
Need test performance

Setup

0. Create a lora adapter bin file

Create a lora adapter bin file

[Already in the branch] mkdir data && touch data/hot-lora.txt and write a couple of words in it.
mkdir models/open-llama and download Open-llama (all files) in the folder ./models/open-llama

Run:

# Convert base model to gguf
python3 convert-hf-to-gguf.py models/open-llama/ && \
# Quantize base model
./quantize ./models/open-llama/ggml-model-f16.gguf ./models/open-llama/ggml-model-q4.gguf Q4_K && \
# Obtain Lora adapter
./finetune  --model-base models/open-llama/ggml-model-q4.gguf \
--checkpoint-in models/open-llama/chk-lora-ggml-model-q4-hot-lora-LATEST.gguf \
--checkpoint-out models/open-llama/chk-lora-ggml-model-q4-hot-lora-ITERATION.gguf \
--lora-out models/open-llama/lora-ggml-model-q4-hot-lora-ITERATION.bin \
--train-data "data/hot-lora.txt" \
--save-every 1 \
--threads 1 \
--adam-iter 1 \
--batch 1 \
--ctx 16 \
--use-checkpointing

1. Run main with adapter

With adapter (eval time = 87.00 ms per token per token on M2): run main with base model and lora adapter to hot-swap

./main -m ./models/open-llama/ggml-model-q4.gguf \
--hot-lora models/open-llama/lora-ggml-model-q4-hot-lora-LATEST.bin \
-ngl 99 \
-n 128

Only base model (eval time = 79.86 ms per token on M2): do not pass the flag --hot-lora and the adapter is ignored:
```
./main -m ./models/open-llama/ggml-model-q4.gguf \
-ngl 99 \
-n 128
```
I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

slaren · 2024-07-01T23:14:15Z

llama.cpp

+        size_t nbytes     = ggml_nbytes(tensor);
+        size_t nbytes_pad = ggml_nbytes_pad(tensor);
+        file.seek(offset, SEEK_SET);
+        tensor->data = result->data.data() + data_offset;


This is not correct with ggml-backend, you have to read the data to a temporary buffer and use ggml_backend_tensor_set to load the data into the tensor.

@slaren Fixed thanks, and thanks for the references! The code now runs without errors, but it's incredibly slow (slower on metal than cpu with the lora applied).

Can you give a couple of tips on where to look for this performance issue?

I feel the issue is that I am adding new attributes to llama_context to store the lora weights, basically lora_data (containing a new ggml context, backend and buffer) and lora_weights_map below. But these new attributes mess up the way llama_context is understood in the backend.

Also, I'm not sure what is the impact of the lora layers on the KV caching, I'll try to understand that.

ngxson · 2024-07-06T13:15:28Z

FYI, I tried the same idea but turns out the performance is very bad, so I ended up removing it (and just merge calculated lora to model weight)

Here is my version: ngxson/llama.cpp@master...ngxson:llama.cpp:4e28ad40a099c7f618abf8ae113c4e56ee7705e8

llama_lora_patch_tensors function is added to inject a graph into each weight of the model. For example, attn_k.weight will be replaced with a graph to calculate attn_k.weight.merged = w + A*B

This can be applied to any weights and any architecture, but the main down side is that number of nodes increase a lot (bad performance)

slaren · 2024-07-06T18:56:25Z

llama.cpp

+    ggml_tensor * t_lora = ggml_mul_mat(ctx0,
+                ggml_mul_mat(ctx0, loraA, loraB), 
+                cur
+            );


This matrix multiplication is why it is extremely slow. It materialized a matrix of the same dimension as the weight, and then multiplies that matrix with the hidden state. It is essentially doubling the number of matrix multiplications. The lora_mul_mat that I gave you does something very different:

ggml_tensor * t_lora = ggml_mul_mat(ctx0, loraB, ggml_mul_mat(ctx0, loraA, cur ) );

These matrix multiplications produce only much smaller vectors as the result. This would be a lot faster, but to make this work loraA will need to be transposed. For optimal efficiency, new kernels optimized for the sizes of the lora matrices (very low number of columns) will also need to be implemented. Properly implemented, this will only add a small overhead to the computation (I estimate 3-10%, depending on the rank of the lora).

@slaren Ops! Thanks! Fixed and it's faster (to the point that I do not perceive a difference from not using the lora).

Is this closed PR #996 something could work to optimise performance of the lora matrices?

Anyway, I wanted to first set up the current PR so that one can spin-up a server and hot-swap adapters. Do you think I should focus on optimising thin matrices multiplication first instead?

@ltoniazzi Great to hear that it works on your side.

I think this PR can be a very good starting point for demo'ing that it works. However, I'll implement proper API for it in #8332

In the meantime, it would be useful to a look on how to convert HF "safetensors" adapter to gguf. Potentially we can introduce --lora param to convert_hf_to_gguf.py which already had all the logics for tensor name. An example for the safetensors adapter can be found here: https://huggingface.co/grimjim/Llama-3-Instruct-abliteration-LoRA-8B/tree/main

@ngxson Thanks for the comments 😊 !! Btw, I think in your version the large BA matrix is coming up here.

Yes, let me know if you need help. I can start to have a look at converting loras from safetensors to gguf. (This conversion was discussed here as having a script convert-lora-to-ggml.py before it was removed in #7204 as it might have to be maintained for each model architecture.)

Btw, how do you render the model's graph?

Btw, how do you render the model's graph?

You can do ggml_graph_dump_dot(gf, NULL, "/tmp/graph.dot")

Then using dot -Tsvg /tmp/graph.dot -o /tmp/graph.svg to convert it to svg (default command converts it to png, but I find it's quite difficult to open a big graph). You can also use online service: https://dreampuf.github.io/GraphvizOnline/

A small trick is to change rankdir to TB, it fits better on screen. And finally this won't work if graph is too big. I can only print graph of the tinyllama stories15M.gguf model

You can also limit the loop in build_llama to include only one layer in the graph.

ltoniazzi · 2024-07-21T15:22:35Z

Closing as feature has been implemented in #8332

ltoniazzi added 2 commits June 21, 2024 16:44

Add basic cpu setup

12112bf

Fix passing param

26df64a

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 21, 2024

Remove comment

5c4ba81

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 21, 2024

Metal running (still buffer issues)

028d3f7

github-actions bot added the examples label Jul 1, 2024

slaren reviewed Jul 1, 2024

View reviewed changes

Lorenzo Toniazzi added 3 commits July 2, 2024 21:59

Fixed buffer allocation

1103bdb

Clean up

1734f3f

Clean up

284e665

ngxson mentioned this pull request Jul 6, 2024

Refactor lora adapter support #8332

Merged

4 tasks

slaren reviewed Jul 6, 2024

View reviewed changes

Lorenzo Toniazzi added 2 commits July 6, 2024 22:04

update branch notes

8f0272c

transpose and run cont

798cde7

ltoniazzi force-pushed the hot-swap-lora branch from 655e5ec to 798cde7 Compare July 6, 2024 21:10

Lorenzo Toniazzi added 5 commits July 6, 2024 22:59

transpose when loading

931134b

Transpose after setting data

41e8c73

Remove files

6597a72

renames

e481eb5

Add ff lora matmuls

9d5089b

ltoniazzi closed this Jul 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Hot swap for LoRA #8056

[WIP] Hot swap for LoRA #8056

ltoniazzi commented Jun 21, 2024 •

edited

Loading

slaren Jul 1, 2024

ltoniazzi Jul 3, 2024

ltoniazzi Jul 4, 2024

ngxson commented Jul 6, 2024 •

edited

Loading

slaren Jul 6, 2024

ltoniazzi Jul 6, 2024

ngxson Jul 6, 2024

ltoniazzi Jul 6, 2024

ngxson Jul 6, 2024

slaren Jul 7, 2024

ltoniazzi commented Jul 21, 2024

[WIP] Hot swap for LoRA #8056

[WIP] Hot swap for LoRA #8056

Conversation

ltoniazzi commented Jun 21, 2024 • edited Loading

Draft on hot-swapping LoRA adapters

Performance

ToDos:

Current status

Setup

0. Create a lora adapter bin file

1. Run main with adapter

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngxson commented Jul 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltoniazzi commented Jul 21, 2024

ltoniazzi commented Jun 21, 2024 •

edited

Loading

ngxson commented Jul 6, 2024 •

edited

Loading