🔥 News :
- [2024-07-11] add .env in docker compose to add VLLM additional variable environments
- [2024-07-10] add API swagger for test all endpoints
- [2024-07-10] support API_KEY for protect your model API
- [2024-07-10] use directly base url /v1 for Langchain integration
Incoming :
- Use PyTest for unit test (tests.py file)
vLLM is one of the state of the art libraries for deploying a Large Language Model (LLM) and its API with better generation performance. However, vLLM does not currently support all embeddings models for endpoint /v1/embeddings
, although it can be used to deploy an API according to OpenAI conventions (see this discussion).
This repository makes it easy to add the /v1/embeddings
endpoint by deploying an embedding model with HuggingFace Text Embeddings Inference (TEI) and serves it all on a single port. The aim of this repository is to have a complete API that's very light, easy to use and maintain !
API offer the following OpenAI endpoints:
/v1/models
/v1/completions
/v1/chat/completions
/v1/embeddings
Currently, this architecture support almost all LLM and embeddings models. The return of the /v1/models
endpoint adds a new "type" key which takes the value "text-generation" or "text-embeddings-inference" depending on the nature of the model (language or embeddings). These values correspond to the label given to models on Huggingface. Example :
{
"object": "list",
"data": [
{
"model": < language model >,
"type": "text-generation",
...
},
{
"model": < embeddings model >,
"type": "text-embeddings-inference",
...
}
]
}
-
First, configure a .env file or modify the .env.example file in this repository. For more informations about the configuration, please refer to the configuration section.
-
Then, run the containers with Docker compose :
docker compose --env-file env.example up --detach
For additional variables of VLLM (see https://docs.vllm.ai/en/stable/serving/env_vars.html), add them in the .env
file.
variable | values |
---|---|
EMBEDDINGS_HF_REPO_ID | HuggingFace repository ID of the embeddings model. Please refer to HuggingFace Text Embeddings Inference documentation to find supported models. |
LLM_HF_REPO_ID | HuggingFace repository ID of the LLM model. Please refer to vLLM documentation to find supported models. |
TEI_ARGS | Arguments for Text Embeddings Inference (format: --arg1 --arg2 ). Please refer to HuggingFace Text Embeddings Inference documentation for more information. |
VLLM_ARGS | Arguments for vLLM (format: --arg1 --arg2 ). Please refer to vLLM documentation for more information. |
HF_TOKEN | HuggingFace API token for private model on HuggingFace Hub (optional). |
API_KEY | API key for protect your model (optional). |
You can use the deployed API with Langchain to create embedding vectors for your vector store. For example:
from langchain_community.embeddings import HuggingFaceHubEmbeddings
embeddings = HuggingFaceHubEmbeddings(model="http://localhost:8080/v1")
To test if your deployment is up, you can use the following command:
python tests.py --base-url http://localhost:8080 --api-key mysecretkey --debug
⚠️ The vllm container may take several minutes to run, particularly if it has to download the model.