This project demonstrates how to integrate FastAPI with Ollama, a tool for running and managing AI models. It showcases three main functionalities:
- Streaming Responses: Receive and display raw streaming responses from the Ollama API.
- Formatted Responses: Aggregate and format streaming responses into a cohesive output.
- Complete JSON Responses: Handle and display complete JSON responses from the Ollama API.
- Python: Ensure you have Python 3.7 or later installed on your system. You can download Python from the official Python website.
- FastAPI: A modern, fast (high-performance) web framework for building APIs with Python.
- Requests: A simple HTTP library for Python.
- Ollama: A tool for running AI models locally.
-
FastAPI and Requests: You can install FastAPI and Requests using pip:
pip install fastapi requests
-
Ollama: Follow the instructions on the Ollama GitHub repository to install Ollama. Make sure to download and install the version that includes the
llama3.1
model.For a quick installation via the command line, use:
pip install ollama
Ensure that you have thellama3.1
model available. You can usually download and install it through Ollama’s CLI or the web interface.
- app.py: Defines a FastAPI application with endpoints for generating raw and formatted responses from the Ollama API.
- send_request.py: A command-line script to send requests to the FastAPI server and print responses. It supports both raw and formatted responses.
- demo_script.py: Demonstrates how to use the the send_request function to retrieve streaming, formatted, and complete JSON responses.
git clone https://github.com/darcyg32/fastapi-ollama-demo.git
cd fastapi-ollama-demo
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
- Start the FastAPI server:
uvicorn app:app --reload
The server will be available athttp://localhost:8000
.
-
Using the Command-Line Script: You can use
send_request.py
to interact with the FastAPI server. Here’s how to use it:python send_request.py <model> <prompt> [stream] [formatted]
<model>
: The name of the model to use (e.g.,llama3.1
).<prompt>
: The prompt to send to the model.[stream]
: Optional flag to enable streaming (default isFalse
).[formatted]
: Optional flag to get a formatted response (default isFalse
).
Example:
python send_request.py llama3.1 "Write a haiku." True True
-
Using the Demo Script: Run
demo_script.py
to see the demo in action:python demo_script.py
This script will show examples of streaming, formatted, and complete JSON responses. -
Using cURL:
- Get Raw Streaming Response Example:
curl -X POST "http://localhost:8000/generate" -H "Content-Type: application/json" -d '{ "model": "llama3.1", "prompt": "Write a haiku.", "stream": true }'
- Get Formatted Response Example:
curl -X POST "http://localhost:8000/generate_formatted" -H "Content-Type: application/json" -d '{ "model": "llama3.1", "prompt": "Write a haiku.", "stream": false }'
- Get Complete JSON Response Example:
curl -X POST "http://localhost:8000/generate" -H "Content-Type: application/json" -d '{ "model": "llama3.1", "prompt": "Write a haiku.", "stream": false }'
- Get Raw Streaming Response Example:
- Ensure that Ollama is properly configured and running locally on
http://localhost:11434
. Update the URL inapp.py
if your Ollama instance is hosted elsewhere. - The FastAPI server and Ollama must be running simultaneously to process requests successfully.
- For more details on FastAPI and Requests, refer to their respective documentation:
For reference, this project was developed and tested on the following hardware:
- Processor: AMD Ryzen 5 5600X 6-Core
- GPU: NVIDIA GeForce RTX 3060 Ti
- RAM: 32 GB
- Operating System: Ubuntu/WSL on Windows 11
- Storage: 2 TB SSD
- These specifications were sufficient for running the FastAPI server and Ollama integration demo. If you encounter any performance issues or have different specifications, you may need to adjust your setup accordingly.
This project is licensed under the MIT License - see the LICENSE file for details.