LORA Adapter Hot Swap Implementation Problem #10374

michaellin99999 · 2024-11-18T05:43:29Z

I have been following the discussions in the following threads:

Pull Request #8332
Pull Request #8857
I believe that the ideal implementation of "hot swap" should address the following scenario:

When processing a request, llama.cpp should be able to dynamically determine and apply the correct LoRA adapter based on the specific requirements of the request. While I understand that the current implementation involves a scaling mechanism, this approach introduces significant issues.

For example, when llama.cpp is running as a server handling multiple simultaneous requests with different LoRA adapters, the scaling method creates a problematic dependency. If Request 1 comes in requiring LoRA Adapter 1, the scaling is adjusted to prioritize Adapter 1. However, if Request 2 arrives shortly afterward, requiring LoRA Adapter 2, the scaling is adjusted again, effectively disabling Adapter 1 in favor of Adapter 2. This adjustment disrupts Request 1 if it is still in the middle of processing.

This issue becomes even more pronounced in streaming scenarios where a high volume of concurrent requests are being processed, as is often the case with production-level systems.

Why must LoRA adapters rely on scaling adjustments? Why can’t they be separated and applied independently per request? In both threads (#8332 and #8857), I see other users emphasizing that the entire purpose of hot swap functionality is to enable per-request adapter switching. Yet, the authors repeatedly suggest that merging should happen beforehand, citing computational expense. I see the authors practically shutting down the other users suggesting this change.

However, the whole point of hot swap is precisely to avoid merging, as this is impractical in many real-world applications. Whether for runtime environments, pre-deployment preparations, or edge devices, merging is often not feasible—especially when considering dynamic content updates or systems with continuously expanding features.

For example, in a system where NPCs need to roleplay various characters that can be expanded or updated, hot swapping LoRA adapters on a per-request basis is essential.

I also note that this hot swap functionality is already implemented in frameworks like ollama and vLLM. Why, then, has it not been properly implemented in llama.cpp? (Or perhaps I’ve missed something and this feature already exists—if so, I’d appreciate guidance on how to use it). At the moment, however, I do not see this capability.

ngxson · 2024-11-18T09:46:05Z

Please refer to this discussion: #8849 (reply in thread)

TL;DR:

LoRA changes only apply when server is not processing any requests
LoRA cannot be changed mid-way (i.e. per request) because multiple requests can be processed in one batch

michaellin99999 · 2024-11-18T09:55:41Z

Please refer to this discussion: #8849 (reply in thread)

TL;DR:

LoRA changes only apply when server is not processing any requests

LoRA cannot be changed mid-way (i.e. per request) because multiple requests can be processed in one batch

Hi thank you for your posts. for this "LoRA cannot be changed mid-way (i.e. per request) because multiple requests can be processed in one batch". what I am concerned with is say request A we designated LoRA #1, and the model is in the process of generating the response for request A. What if at this instance, the server recieves a request for LoRA#2, the scales are changed, this affects the response for request A. How do I avoid this? Is there no way to make it possible such that LoRA#1, and LoRA#2 can coexist?

Thanks

ngxson · 2024-11-18T10:34:40Z

Is there no way to make it possible such that LoRA#1, and LoRA#2 can coexist?

It's not possible at least for now, because both requests are processed in the same batch (and thus, uses the same model weight)

On the other hand, seems like what you observed is indeed a bug. The server does not currently waits for all requests to done, but it can apply lora changes while it's generating tokens. This is not desirable and need to be fixed.

Another thing worth notice, you can use /slots endpoint to detect when server is idle, then apply lora. However, if your server has a high amount of requests, there will be lower chance you can find a time slot for safely changing lora adapters.

michaellin99999 · 2024-11-18T12:41:10Z

it is weird to apply LoRa swap when server is idle, the swap is only meaningful when acutal users Request it to happen. i.e. summarize this for me, calculate this for me etc.... I think the overall design for Llamacpp Lora is just counter intuitive. I really think should prioritize how this will be used rather minimizing compute. this issue is more evident considering edge use cases. especially small models that have to fit to multiple use cases

github-actions · 2025-01-03T01:07:23Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ngxson mentioned this issue Nov 18, 2024

Feature Request: Apply LoRA adapters per-request #10377

Closed

4 tasks

github-actions bot added the stale label Dec 19, 2024

github-actions bot closed this as completed Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LORA Adapter Hot Swap Implementation Problem #10374

LORA Adapter Hot Swap Implementation Problem #10374

michaellin99999 commented Nov 18, 2024

ngxson commented Nov 18, 2024

michaellin99999 commented Nov 18, 2024

ngxson commented Nov 18, 2024

michaellin99999 commented Nov 18, 2024

github-actions bot commented Jan 3, 2025

LORA Adapter Hot Swap Implementation Problem #10374

LORA Adapter Hot Swap Implementation Problem #10374

Comments

michaellin99999 commented Nov 18, 2024

ngxson commented Nov 18, 2024

michaellin99999 commented Nov 18, 2024

ngxson commented Nov 18, 2024

michaellin99999 commented Nov 18, 2024

github-actions bot commented Jan 3, 2025