Skip to content

Deployment a light and full OpenAI API for production with vLLM to support /v1/embeddings with all embeddings models.

License

Notifications You must be signed in to change notification settings

etalab-ia/albert-models

Repository files navigation

Albert models

Deploy a full OpenAI API with vLLM that supports all embedding models

🔥 News :

  • [2024-07-11] add .env in docker compose to add VLLM additional variable environments
  • [2024-07-10] add API swagger for test all endpoints
  • [2024-07-10] support API_KEY for protect your model API
  • [2024-07-10] use directly base url /v1 for Langchain integration

Incoming :

  • Use PyTest for unit test (tests.py file)

vLLM is one of the state of the art libraries for deploying a Large Language Model (LLM) and its API with better generation performance. However, vLLM does not currently support all embeddings models for endpoint /v1/embeddings, although it can be used to deploy an API according to OpenAI conventions (see this discussion).

This repository makes it easy to add the /v1/embeddings endpoint by deploying an embedding model with HuggingFace Text Embeddings Inference (TEI) and serves it all on a single port. The aim of this repository is to have a complete API that's very light, easy to use and maintain !

API offer the following OpenAI endpoints:

  • /v1/models
  • /v1/completions
  • /v1/chat/completions
  • /v1/embeddings

⚙️ How it works ?

🍿 The swagger

📦 Models

Currently, this architecture support almost all LLM and embeddings models. The return of the /v1/models endpoint adds a new "type" key which takes the value "text-generation" or "text-embeddings-inference" depending on the nature of the model (language or embeddings). These values correspond to the label given to models on Huggingface. Example :

{
    "object": "list", 
    "data": [
        {
            "model": < language model >,
            "type": "text-generation",
            ...
        },
        {
            "model": < embeddings model >,
            "type": "text-embeddings-inference",
            ...
        }
    ]
}

🚀 Quickstart

  • First, configure a .env file or modify the .env.example file in this repository. For more informations about the configuration, please refer to the configuration section.

  • Then, run the containers with Docker compose :

    docker compose --env-file env.example up --detach

🔧 Configuration

For additional variables of VLLM (see https://docs.vllm.ai/en/stable/serving/env_vars.html), add them in the .env file.

variable values
EMBEDDINGS_HF_REPO_ID HuggingFace repository ID of the embeddings model. Please refer to HuggingFace Text Embeddings Inference documentation to find supported models.
LLM_HF_REPO_ID HuggingFace repository ID of the LLM model. Please refer to vLLM documentation to find supported models.
TEI_ARGS Arguments for Text Embeddings Inference (format: --arg1 --arg2 ). Please refer to HuggingFace Text Embeddings Inference documentation for more information.
VLLM_ARGS Arguments for vLLM (format: --arg1 --arg2 ). Please refer to vLLM documentation for more information.
HF_TOKEN  HuggingFace API token for private model on HuggingFace Hub (optional).
API_KEY API key for protect your model (optional).

🦜 Lanchain integration

You can use the deployed API with Langchain to create embedding vectors for your vector store. For example:

from langchain_community.embeddings import HuggingFaceHubEmbeddings

embeddings = HuggingFaceHubEmbeddings(model="http://localhost:8080/v1")

🔦 Tests

To test if your deployment is up, you can use the following command:

python tests.py --base-url http://localhost:8080 --api-key mysecretkey --debug

⚠️ The vllm container may take several minutes to run, particularly if it has to download the model.

About

Deployment a light and full OpenAI API for production with vLLM to support /v1/embeddings with all embeddings models.

Topics

Resources

License

Stars

Watchers

Forks

Packages