Deploy a llama.cpp server on fly.io.
Uses the most minimal dependencies possible to create a small image. Downloads model files on initial boot and caches them in a volume for fast subsequent cold starts.
fly launch --no-deploy
fly vol create models -s 10 --vm-gpu-kind a10 --region ord
fly secrets set API_KEY=<your-api-key>
fly deploy
The provided Dockerfile is configured to use the a10
GPU kind. To use a different GPU:
- Update the
CUDA_DOCKER_ARCH
variable in the build step to an appropriate value for the desired GPU. A list of arch values can be found here. e.g. putCUDA_DOCKER_ARCH=compute_86
for compute capability 8.6. - Update the
--vm-gpu-kind
flag in thefly vol create
command to the desired GPU kind. e.g. put--vm-gpu-kind a100
for an A100 GPU. - Update the vm.gpu_kind in the fly.toml file to the desired GPU kind. e.g. put
gpu_kind = "a100"
for an A100 GPU.
This example uses the phi-3-mini-4k-instruct
model by default. To use a different model:
- update the
MODEL_URL
andMODEL_FILE
env variables in the fly.toml file to your desired model. The file will be downloaded as/models/$MODEL_FILE
on next deploy. - To delete any existing model files, use
fly ssh console
to connect to your machine and runrm /models/<model_file>
.
This example sets the --api-key
flag on the server start command to guard against unauthorized access. To set the API key:
fly secrets set API_KEY=<your-api-key>
The app will use the new API key on the next deploy.