For a step-by-step guide and more context, please look at my blog post: https://blog.timleers.com/a-stupidly-minimal-llm-api-starterkit-deploy-llm-endpoints-in-a-minute-with-langchain-and-fastapi
This repository is the most minimal implementation of an LLM API possible, focusing on making this technology accessible to coders new to LLMs & APIs. The core libraries are langchain
& fastapi
.
For an example that is more comprehensive and adhering to best practices, I will soon share more examples.
- Serverless LLM application deployment examples are currently being developed at https://github.com/tleers/serverless-llm-app-factory, extending the llm-api-starterkit to web deployment & alternative compute options
There's three steps to starting the demo or starting development with this template.
- Installation of general python package requirements/dependencies
- Selection of LLM model & dependencies
- Running the FastAPI application
We use the most common way of installing dependencies, which is using pip install
with a requirements.txt.
Tutorial was created using Python 3.10
.
pip install -r requirements.txt
It is advised to install these requirements in a virtual environment. To create a virtual environment and install the requirements there, use the following:
python3 -m venv venv
. venv/bin/activate
pip install -r requirements.txt
Ideally, we use dependency management with poetry
for a smoother experience (see https://github.com/tleers/minimal-serverless-llm-deployment for an example). We ignore this additional complexity for now in this example.
- Change the filename of .env.example to .env
- Add your OpenAI API key to .env
Done.
Note that you need sufficiently powerful hardware to run a local model. It's easier to use the OpenAI API if you're initially experimenting. Making an account means you get free credits, which are usually more than you need.
We use LlamaCpp. https://python.langchain.com/en/latest/modules/models/llms/integrations/llamacpp.html
-
Download model weights that are compatible with the llamacpp implementation. I use vicuna 1.1 quantized https://huggingface.co/vicuna/ggml-vicuna-7b-1.1/blob/main/ggml-vic7b-uncensored-q4_0.bin, as recommended on https://old.reddit.com/r/LocalLLaMA/wiki/models
-
Make sure the model weights are in the current directory and you know the filename. In this tutorial, the filename is
ggml-vic7b-uncensored-q4_0.bins
LangChain support for LLamaCpp is currently iffy on Apple Silicon. Therefore, we instead use the GPT4ALL integration. Download the model file here:
-
Download model weights from https://gpt4all.io/index.html There are many different ones available, take a look at what best fits your usecase. I use "ggml-gpt4all-j-v1.3-groovy.bin"
-
Make sure the model weights are in the current directory and you know the filename. In this tutorial, the filename is
ggml-gpt4all-j-v1.3-groovy.bin
You should be ready to run the most basic example.
With OpenAI API
uvicorn app.main_openai:app --port 80 --env-file .env
With local LLM using Vicuna, compatible with X86_64 architecture
uvicorn app.main_local_lamacpp:app --port 80
With local LLM using GPT4All, compatible with X86_64 as well as arch_64 (mac m1, m2) architectures.
uvicorn app.main_local_gpt_4_all:app --port 80
Go to https://localhost:80/docs
to see the automatically generated API documentation.
You can also try out the summarization endpoint by clicking Try it out!