Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choosing Model via the post request, when making API Call. #7

Open
MervinPraison opened this issue Jul 16, 2024 · 10 comments
Open

Choosing Model via the post request, when making API Call. #7

MervinPraison opened this issue Jul 16, 2024 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@MervinPraison
Copy link

MervinPraison commented Jul 16, 2024

Currently Providing model is a required argument.

python vision.py
usage: vision.py [-h] -m MODEL [-b BACKEND] [-f FORMAT] [-d DEVICE] [--device-map DEVICE_MAP]
                 [--max-memory MAX_MEMORY] [--no-trust-remote-code] [-4] [-8] [-F] [-T MAX_TILES]
                 [-L {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [-P PORT] [-H HOST] [--preload]
vision.py: error: the following arguments are required: -m/--model

AIM:Adding the ability to Choose the model when Calling the API. This would be a great option.
It gives the additional flexibility.

@matatonic
Copy link
Owner

If I understand you correctly, you would want the model loaded as specified by the client side?

So, something like:

response = client.chat.completions.create(model="OpenGVLab/InternVL2-Llama3-76B", messages=messages, **params)

This is a bit complex because you can't specify any options like --load-in-4bit, flash-attn, etc. It would probably need a model specific default config which would be loaded also with the request. I'm working on a system for this with the openedai-image server, but am not really happy yet with how complex it is.

@MervinPraison
Copy link
Author

Yes Correct.

So to make it simple, we could set some default value such as --load-in-4bit, flash-attn ..etc for all models to start with.

Based on the request it receives, it download the model and get it ready to be served ( Which means, first API call will take some time to get the response back )

@matatonic
Copy link
Owner

Just FYI, I think the openai client times out after about 30 or 60 seconds, so it's likely this will not work well unless the model is very small.

What about a web UI instead? I just don't think the API is well suited for model management but I do admit it's a nice feature.

@MervinPraison
Copy link
Author

Sure. I get you. That's fine.

Just another thought.
For the first request it should just respond saying "Model is being downloaded, please try again after few mins" (Considering openai client time out is 60 seconds)

If it's not well suited and complicated then we don't need to do this. These are just ideas.

Yes, Web UI is fine instead. Thanks :)

@saket424
Copy link

here is how it is done by llama cpp

https://github.com/Jaimboh/Llama.cpp-Local-OpenAI-server/blob/main/README.md

Multiple Model Load with Config
python -m llama_cpp.server --config_file config.json

cat config.json

{
    "host": "0.0.0.0",
    "port": 8000,
    "models": [
      {
        "model": "models/mistral-7b-instruct-v0.1.Q4_0.gguf",
        "model_alias": "mistral",
        "chat_format": "chatml",
        "n_gpu_layers": -1,
        "offload_kqv": true,
        "n_threads": 12,
        "n_batch": 512,
        "n_ctx": 2048
      },
      {
        "model": "models/mixtral-8x7b-instruct-v0.1.Q2_K.gguf",
        "model_alias": "mixtral",
        "chat_format": "chatml",
        "n_gpu_layers": -1,
        "offload_kqv": true,
        "n_threads": 12,
        "n_batch": 512,
        "n_ctx": 2048
      },
      {
        "model": "models/mistral-7b-instruct-v0.1.Q4_0.gguf",
        "model_alias": "mistral-function-calling",
        "chat_format": "functionary",
        "n_gpu_layers": -1,
        "offload_kqv": true,
        "n_threads": 12,
        "n_batch": 512,
        "n_ctx": 2048
      }
    ]
  }

you can preload an array of models as specified in config.json and it is smart enough to swap in to the right model as specified in the client request

@saket424
Copy link

here is a blog with a simple streamlit gui interface too

https://medium.com/@odhitom09/running-openais-server-locally-with-llama-cpp-5f29e0d955b7

@saket424
Copy link

saket424 commented Jul 16, 2024

i would love some kinda gui when interacting with these multimodal models especially initially before i know how to automate it

@matatonic
Copy link
Owner

i would love some kinda gui when interacting with these multimodal models especially initially before i know how to automate it

For manually interacting with the models I can highly recommend either open-webui (via docker, which also works with openedai-speech, whisper, images, etc - I use this) or web.chatbox.app (can be used fully in browser, without any installation), you can configure an openai api provider (with the API BASE URL) for the gpt-4-vision-preview model in either of those and use the GUI there to upload and chat with images in context.

For testing, I prefer the 'raw' text output from the included console app chat_with_image.py.

@saket424
Copy link

@matatonic
I am very interested in the ability of openedai-vision to load multiple models simultaneously . For example florence2 is very good at boundingboxes, moondream2 is very fast with a summary inference, the minicpm2.5 model is the newest kid on the block and takes more firepower but performs better and can compare two input images etc. The point is different models excel at different things and the pipeline may need to invoke all three of them efficiently. It would be excellent is we could support all three models and preload them and have the right model be invoked by the post request when the API call is made. How feasible is this to do?

Thanks again for providing the newest models so swiftly.

@matatonic
Copy link
Owner

It's doable, and I will probably do this along with model switching/selecting via API in an upcoming release.

It's a more significant change, and I'll need to update my testing also so it might take a bit longer.

PS. I'm currently out of country and have limited access to the internet.

@matatonic matatonic self-assigned this Sep 11, 2024
@matatonic matatonic added the enhancement New feature or request label Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants