Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Hot swap for LoRA #8056

Closed
wants to merge 14 commits into from
Closed

Conversation

ltoniazzi
Copy link
Contributor

@ltoniazzi ltoniazzi commented Jun 21, 2024

Draft on hot-swapping LoRA adapters

Attempt to swap lora adpters at runtime.
Main logic:

  1. Lora bin files from ./finetune (only one allowed now) are loaded by load_lora
  2. each lora tensor is added to llama_context in lora_data and a map base_layer_name -> lora A-B pair (lora_weights) is created (see build_lora_weights_map)
  3. a lora version of ggml_mul_mat (see ggml_mul_mat_lora) checks if any lora tensor exists for the current base tensor W, so to perform the lora operations as W(x) + B(A(x)).

Performance

open_llama on M2 (and only adding the adapter to mul_mat's for Qcur, Kcur and Vcur and llm_build_ffn ).

  • +10% ms per token: base fp16 (79.86ms/tok) + lora Q8_0 rank=4 (87.00ms/tok)
  • +23.9% ms per token: base Q4_K (37.88ms/tok) + lora Q4_K rank=4 (46.94ms/tok)
  • +27.3% ms per token: base Q4_K (37.88ms/tok) + lora Q4_K rank=16 (48.31ms/tok)

ToDos:

  • add mul mat to other base layers.
  • Rerun performance checks.
  • set up hot swapping for llama-server.
  • transpose loraA when loading instead of at inference.
  • test performance inference with/out adapter and hot-swapping.

Current status

  • Runs on CPU and GPU (tested on Metal)
  • Only runs for llama arch
  • Only one Lora adapter can be passed.
  • Not all lora tensors are applied.
  • Need test performance

Setup

0. Create a lora adapter bin file

Create a lora adapter bin file
  1. [Already in the branch] mkdir data && touch data/hot-lora.txt and write a couple of words in it.

  2. mkdir models/open-llama and download Open-llama (all files) in the folder ./models/open-llama

  3. Run:

    # Convert base model to gguf
    python3 convert-hf-to-gguf.py models/open-llama/ && \
    # Quantize base model
    ./quantize ./models/open-llama/ggml-model-f16.gguf ./models/open-llama/ggml-model-q4.gguf Q4_K && \
    # Obtain Lora adapter
    ./finetune  --model-base models/open-llama/ggml-model-q4.gguf \
    --checkpoint-in models/open-llama/chk-lora-ggml-model-q4-hot-lora-LATEST.gguf \
    --checkpoint-out models/open-llama/chk-lora-ggml-model-q4-hot-lora-ITERATION.gguf \
    --lora-out models/open-llama/lora-ggml-model-q4-hot-lora-ITERATION.bin \
    --train-data "data/hot-lora.txt" \
    --save-every 1 \
    --threads 1 \
    --adam-iter 1 \
    --batch 1 \
    --ctx 16 \
    --use-checkpointing

1. Run main with adapter

  • With adapter (eval time = 87.00 ms per token per token on M2): run main with base model and lora adapter to hot-swap

    ./main -m ./models/open-llama/ggml-model-q4.gguf \
    --hot-lora models/open-llama/lora-ggml-model-q4-hot-lora-LATEST.bin \
    -ngl 99 \
    -n 128
  • Only base model (eval time = 79.86 ms per token on M2): do not pass the flag --hot-lora and the adapter is ignored:

    ./main -m ./models/open-llama/ggml-model-q4.gguf \
    -ngl 99 \
    -n 128
  • I have read the contributing guidelines

  • Self-reported review complexity:

    • Low
    • Medium
    • High

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 21, 2024
@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 21, 2024
llama.cpp Outdated
size_t nbytes = ggml_nbytes(tensor);
size_t nbytes_pad = ggml_nbytes_pad(tensor);
file.seek(offset, SEEK_SET);
tensor->data = result->data.data() + data_offset;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct with ggml-backend, you have to read the data to a temporary buffer and use ggml_backend_tensor_set to load the data into the tensor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slaren Fixed thanks, and thanks for the references! The code now runs without errors, but it's incredibly slow (slower on metal than cpu with the lora applied).

Can you give a couple of tips on where to look for this performance issue?

I feel the issue is that I am adding new attributes to llama_context to store the lora weights, basically lora_data (containing a new ggml context, backend and buffer) and lora_weights_map below. But these new attributes mess up the way llama_context is understood in the backend.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm not sure what is the impact of the lora layers on the KV caching, I'll try to understand that.

@ngxson ngxson mentioned this pull request Jul 6, 2024
4 tasks
@ngxson
Copy link
Collaborator

ngxson commented Jul 6, 2024

FYI, I tried the same idea but turns out the performance is very bad, so I ended up removing it (and just merge calculated lora to model weight)

Here is my version: ngxson/llama.cpp@master...ngxson:llama.cpp:4e28ad40a099c7f618abf8ae113c4e56ee7705e8

llama_lora_patch_tensors function is added to inject a graph into each weight of the model. For example, attn_k.weight will be replaced with a graph to calculate attn_k.weight.merged = w + A*B

image

This can be applied to any weights and any architecture, but the main down side is that number of nodes increase a lot (bad performance)

image

llama.cpp Outdated
Comment on lines 9733 to 9736
ggml_tensor * t_lora = ggml_mul_mat(ctx0,
ggml_mul_mat(ctx0, loraA, loraB),
cur
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matrix multiplication is why it is extremely slow. It materialized a matrix of the same dimension as the weight, and then multiplies that matrix with the hidden state. It is essentially doubling the number of matrix multiplications. The lora_mul_mat that I gave you does something very different:

    ggml_tensor * t_lora =
        ggml_mul_mat(ctx0,
            loraB,
            ggml_mul_mat(ctx0,
                loraA,
                cur
            )
        );

These matrix multiplications produce only much smaller vectors as the result. This would be a lot faster, but to make this work loraA will need to be transposed. For optimal efficiency, new kernels optimized for the sizes of the lora matrices (very low number of columns) will also need to be implemented. Properly implemented, this will only add a small overhead to the computation (I estimate 3-10%, depending on the rank of the lora).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slaren Ops! Thanks! Fixed and it's faster (to the point that I do not perceive a difference from not using the lora).

Is this closed PR #996 something could work to optimise performance of the lora matrices?

Anyway, I wanted to first set up the current PR so that one can spin-up a server and hot-swap adapters. Do you think I should focus on optimising thin matrices multiplication first instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ltoniazzi Great to hear that it works on your side.

I think this PR can be a very good starting point for demo'ing that it works. However, I'll implement proper API for it in #8332

In the meantime, it would be useful to a look on how to convert HF "safetensors" adapter to gguf. Potentially we can introduce --lora param to convert_hf_to_gguf.py which already had all the logics for tensor name. An example for the safetensors adapter can be found here: https://huggingface.co/grimjim/Llama-3-Instruct-abliteration-LoRA-8B/tree/main

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson Thanks for the comments 😊 !! Btw, I think in your version the large BA matrix is coming up here.

Yes, let me know if you need help. I can start to have a look at converting loras from safetensors to gguf. (This conversion was discussed here as having a script convert-lora-to-ggml.py before it was removed in #7204 as it might have to be maintained for each model architecture.)

Btw, how do you render the model's graph?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, how do you render the model's graph?

You can do ggml_graph_dump_dot(gf, NULL, "/tmp/graph.dot")

Then using dot -Tsvg /tmp/graph.dot -o /tmp/graph.svg to convert it to svg (default command converts it to png, but I find it's quite difficult to open a big graph). You can also use online service: https://dreampuf.github.io/GraphvizOnline/

A small trick is to change rankdir to TB, it fits better on screen. And finally this won't work if graph is too big. I can only print graph of the tinyllama stories15M.gguf model

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also limit the loop in build_llama to include only one layer in the graph.

@ltoniazzi
Copy link
Contributor Author

Closing as feature has been implemented in #8332

@ltoniazzi ltoniazzi closed this Jul 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants