LoRA support #51

amirvenus · 2024-11-28T00:21:57Z

Hi,

I use MLX to create LoRA adapters in .safetensors format.

MLX LM has a web server that can take the name/path of the LoRA adapter at inference time when using the Chat completion endpoint by specifying the adapter parameter in the JSON payload.

Would be great to see the same being supported on LM Studio

neilmehta24 · 2024-11-30T01:53:08Z

Hello, thanks for raising this issue. Looking through the mlx_lm documentation, I am seeing that mlx_lm offers an option to load an adapter with a model [Source].

It seems quite reasonable for us to add adapter path as a load option in LM Studio. Would this be sufficient enough for your use case?

amirvenus · 2024-11-30T14:53:33Z

Hello, thanks for raising this issue. Looking through the mlx_lm documentation, I am seeing that mlx_lm offers an option to load an adapter with a model [Source].

It seems quite reasonable for us to add adapter path as a load option in LM Studio. Would this be sufficient enough for your use case?

Hi,

Thanks for your prompt response.

Absolutely!
I store my lora adapter inside the adapters folders, which is usually placed at the same directory as the model itself.

However, ideally, it would be nice for the web server to accept such [optional] argument so that it can be passed as a parameter in the json payload so I can make requests like this:

{
    "model": "mlx-community/Llama-3.2-3B-Instruct",
    "adapters": "adapters",
    "stream" : false,
    "messages": 
    [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Is it okay to install a version of localDB other than 2016?"
        }
    ],
    "max_completion_tokens": 250,
    "max_tokens": 250,
    "stop" : ["<end_of_turn>", "<|im_end|>" ],
    "tempoerature" : 0.1
}

Please note the adapters in the json body.
This way, the correct LoRA adapter can be used for each request without having to reload the entire model (with a different adapter) with each and every request.

neilmehta24 · 2024-11-30T20:53:29Z

Hello, thanks for raising this issue. Looking through the mlx_lm documentation, I am seeing that mlx_lm offers an option to load an adapter with a model [Source].
It seems quite reasonable for us to add adapter path as a load option in LM Studio. Would this be sufficient enough for your use case?

Hi,

Thanks for your prompt response.

Absolutely! I store my lora adapter inside the adapters folders, which is usually placed at the same directory as the model itself.

However, ideally, it would be nice for the web server to accept such [optional] argument so that it can be passed as a parameter in the json payload so I can make requests like this:
{
    "model": "mlx-community/Llama-3.2-3B-Instruct",
    "adapters": "adapters",
    "stream" : false,
    "messages": 
    [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Is it okay to install a version of localDB other than 2016?"
        }
    ],
    "max_completion_tokens": 250,
    "max_tokens": 250,
    "stop" : ["<end_of_turn>", "<|im_end|>" ],
    "tempoerature" : 0.1
}
Please note the adapters in the json body. This way, the correct LoRA adapter can be used for each request without having to reload the entire model (with a different adapter) with each and every request.

Would creating and using a fused model be a suitable alternative? More documentation here https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md#fuse .

I think that it's non-trivial to enhance the API to make adapter selection a good user experience. The LM Studio team would need to design a process to register an adapter or multiple adapters for each model to make the API request as simple as "adapters": "adapters".

amirvenus · 2024-12-01T17:48:15Z

I have used the fused model approach and it works but it requires a large number of near identical models to be stored and managed whereas with mlx, only the adapter part will need to be loaded especially if the actual model is already loaded in the memory.

Using fused models is not scalable either as imagine receiving multiple requests at the same time each pointing to a different fused model so in this case, it thinks it's a different model altogether and the entire LLM will need to be reloaded in the memory even if another instance of it - albeit with the LoRA difference - is already loaded whereas with MLX approach, the base model is loaded once in the memory and it's just matter of loading different LoRA adapters for each request which makes a huge difference.

neilmehta24 · 2024-12-02T16:57:59Z

I have used the fused model approach and it works but it requires a large number of near identical models to be stored and managed whereas with mlx, only the adapter part will need to be loaded especially if the actual model is already loaded in the memory.

Using fused models is not scalable either as imagine receiving multiple requests at the same time each pointing to a different fused model so in this case, it thinks it's a different model altogether and the entire LLM will need to be reloaded in the memory even if another instance of it - albeit with the LoRA difference - is already loaded whereas with MLX approach, the base model is loaded once in the memory and it's just matter of loading different LoRA adapters for each request which makes a huge difference.

Looking at the mlx_lm implementation, the LoRA adapter is a load-time setting that overwrites the model weights. To support inference-time adapter selection, we would need to refactor our implementation of model loading.

I will raise this feature request to the team so we can place it on our roadmap. In the meantime, using fused models seem to be a workable (though suboptimal/inflexible) way to use adapters. I will keep this issue open to track this feature request since we can definitely improve UX on this front.

neilmehta24 added the enhancement New feature or request label Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA support #51

LoRA support #51

amirvenus commented Nov 28, 2024

neilmehta24 commented Nov 30, 2024

amirvenus commented Nov 30, 2024 •

edited

Loading

neilmehta24 commented Nov 30, 2024

amirvenus commented Dec 1, 2024

neilmehta24 commented Dec 2, 2024

LoRA support #51

LoRA support #51

Comments

amirvenus commented Nov 28, 2024

neilmehta24 commented Nov 30, 2024

amirvenus commented Nov 30, 2024 • edited Loading

neilmehta24 commented Nov 30, 2024

amirvenus commented Dec 1, 2024

neilmehta24 commented Dec 2, 2024

amirvenus commented Nov 30, 2024 •

edited

Loading