-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LORA Adapter Hot Swap Implementation Problem #10374
Comments
Please refer to this discussion: #8849 (reply in thread) TL;DR:
|
Hi thank you for your posts. for this "LoRA cannot be changed mid-way (i.e. per request) because multiple requests can be processed in one batch". what I am concerned with is say request A we designated LoRA #1, and the model is in the process of generating the response for request A. What if at this instance, the server recieves a request for LoRA#2, the scales are changed, this affects the response for request A. How do I avoid this? Is there no way to make it possible such that LoRA#1, and LoRA#2 can coexist? Thanks |
It's not possible at least for now, because both requests are processed in the same batch (and thus, uses the same model weight) On the other hand, seems like what you observed is indeed a bug. The server does not currently waits for all requests to done, but it can apply lora changes while it's generating tokens. This is not desirable and need to be fixed. Another thing worth notice, you can use |
it is weird to apply LoRa swap when server is idle, the swap is only meaningful when acutal users Request it to happen. i.e. summarize this for me, calculate this for me etc.... I think the overall design for Llamacpp Lora is just counter intuitive. I really think should prioritize how this will be used rather minimizing compute. this issue is more evident considering edge use cases. especially small models that have to fit to multiple use cases |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I have been following the discussions in the following threads:
Pull Request #8332
Pull Request #8857
I believe that the ideal implementation of "hot swap" should address the following scenario:
When processing a request, llama.cpp should be able to dynamically determine and apply the correct LoRA adapter based on the specific requirements of the request. While I understand that the current implementation involves a scaling mechanism, this approach introduces significant issues.
For example, when llama.cpp is running as a server handling multiple simultaneous requests with different LoRA adapters, the scaling method creates a problematic dependency. If Request 1 comes in requiring LoRA Adapter 1, the scaling is adjusted to prioritize Adapter 1. However, if Request 2 arrives shortly afterward, requiring LoRA Adapter 2, the scaling is adjusted again, effectively disabling Adapter 1 in favor of Adapter 2. This adjustment disrupts Request 1 if it is still in the middle of processing.
This issue becomes even more pronounced in streaming scenarios where a high volume of concurrent requests are being processed, as is often the case with production-level systems.
Why must LoRA adapters rely on scaling adjustments? Why can’t they be separated and applied independently per request? In both threads (#8332 and #8857), I see other users emphasizing that the entire purpose of hot swap functionality is to enable per-request adapter switching. Yet, the authors repeatedly suggest that merging should happen beforehand, citing computational expense. I see the authors practically shutting down the other users suggesting this change.
However, the whole point of hot swap is precisely to avoid merging, as this is impractical in many real-world applications. Whether for runtime environments, pre-deployment preparations, or edge devices, merging is often not feasible—especially when considering dynamic content updates or systems with continuously expanding features.
For example, in a system where NPCs need to roleplay various characters that can be expanded or updated, hot swapping LoRA adapters on a per-request basis is essential.
I also note that this hot swap functionality is already implemented in frameworks like ollama and vLLM. Why, then, has it not been properly implemented in llama.cpp? (Or perhaps I’ve missed something and this feature already exists—if so, I’d appreciate guidance on how to use it). At the moment, however, I do not see this capability.
The text was updated successfully, but these errors were encountered: