Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LORA Adapter Hot Swap Implementation Problem #10374

Closed
michaellin99999 opened this issue Nov 18, 2024 · 5 comments
Closed

LORA Adapter Hot Swap Implementation Problem #10374

michaellin99999 opened this issue Nov 18, 2024 · 5 comments
Labels

Comments

@michaellin99999
Copy link

I have been following the discussions in the following threads:

Pull Request #8332
Pull Request #8857
I believe that the ideal implementation of "hot swap" should address the following scenario:

When processing a request, llama.cpp should be able to dynamically determine and apply the correct LoRA adapter based on the specific requirements of the request. While I understand that the current implementation involves a scaling mechanism, this approach introduces significant issues.

For example, when llama.cpp is running as a server handling multiple simultaneous requests with different LoRA adapters, the scaling method creates a problematic dependency. If Request 1 comes in requiring LoRA Adapter 1, the scaling is adjusted to prioritize Adapter 1. However, if Request 2 arrives shortly afterward, requiring LoRA Adapter 2, the scaling is adjusted again, effectively disabling Adapter 1 in favor of Adapter 2. This adjustment disrupts Request 1 if it is still in the middle of processing.

This issue becomes even more pronounced in streaming scenarios where a high volume of concurrent requests are being processed, as is often the case with production-level systems.

Why must LoRA adapters rely on scaling adjustments? Why can’t they be separated and applied independently per request? In both threads (#8332 and #8857), I see other users emphasizing that the entire purpose of hot swap functionality is to enable per-request adapter switching. Yet, the authors repeatedly suggest that merging should happen beforehand, citing computational expense. I see the authors practically shutting down the other users suggesting this change.

However, the whole point of hot swap is precisely to avoid merging, as this is impractical in many real-world applications. Whether for runtime environments, pre-deployment preparations, or edge devices, merging is often not feasible—especially when considering dynamic content updates or systems with continuously expanding features.

For example, in a system where NPCs need to roleplay various characters that can be expanded or updated, hot swapping LoRA adapters on a per-request basis is essential.

I also note that this hot swap functionality is already implemented in frameworks like ollama and vLLM. Why, then, has it not been properly implemented in llama.cpp? (Or perhaps I’ve missed something and this feature already exists—if so, I’d appreciate guidance on how to use it). At the moment, however, I do not see this capability.

@ngxson
Copy link
Collaborator

ngxson commented Nov 18, 2024

Please refer to this discussion: #8849 (reply in thread)

TL;DR:

  • LoRA changes only apply when server is not processing any requests
  • LoRA cannot be changed mid-way (i.e. per request) because multiple requests can be processed in one batch

@michaellin99999
Copy link
Author

Please refer to this discussion: #8849 (reply in thread)

TL;DR:

  • LoRA changes only apply when server is not processing any requests
  • LoRA cannot be changed mid-way (i.e. per request) because multiple requests can be processed in one batch

Hi thank you for your posts. for this "LoRA cannot be changed mid-way (i.e. per request) because multiple requests can be processed in one batch". what I am concerned with is say request A we designated LoRA #1, and the model is in the process of generating the response for request A. What if at this instance, the server recieves a request for LoRA#2, the scales are changed, this affects the response for request A. How do I avoid this? Is there no way to make it possible such that LoRA#1, and LoRA#2 can coexist?

Thanks

@ngxson
Copy link
Collaborator

ngxson commented Nov 18, 2024

Is there no way to make it possible such that LoRA#1, and LoRA#2 can coexist?

It's not possible at least for now, because both requests are processed in the same batch (and thus, uses the same model weight)

On the other hand, seems like what you observed is indeed a bug. The server does not currently waits for all requests to done, but it can apply lora changes while it's generating tokens. This is not desirable and need to be fixed.

Another thing worth notice, you can use /slots endpoint to detect when server is idle, then apply lora. However, if your server has a high amount of requests, there will be lower chance you can find a time slot for safely changing lora adapters.

@michaellin99999
Copy link
Author

it is weird to apply LoRa swap when server is idle, the swap is only meaningful when acutal users Request it to happen. i.e. summarize this for me, calculate this for me etc.... I think the overall design for Llamacpp Lora is just counter intuitive. I really think should prioritize how this will be used rather minimizing compute. this issue is more evident considering edge use cases. especially small models that have to fit to multiple use cases

@github-actions github-actions bot added the stale label Dec 19, 2024
Copy link
Contributor

github-actions bot commented Jan 3, 2025

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants