Skip to content

hazelnutcloud/fly-llama.cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

fly-llama.cpp

Deploy a llama.cpp server on fly.io.

Uses the most minimal dependencies possible to create a small image. Downloads model files on initial boot and caches them in a volume for fast subsequent cold starts.

Usage

fly launch --no-deploy
fly vol create models -s 10 --vm-gpu-kind a10 --region ord
fly secrets set API_KEY=<your-api-key>
fly deploy

Configuration

GPU

The provided Dockerfile is configured to use the a10 GPU kind. To use a different GPU:

  1. Update the CUDA_DOCKER_ARCH variable in the build step to an appropriate value for the desired GPU. A list of arch values can be found here. e.g. put CUDA_DOCKER_ARCH=compute_86 for compute capability 8.6.
  2. Update the --vm-gpu-kind flag in the fly vol create command to the desired GPU kind. e.g. put --vm-gpu-kind a100 for an A100 GPU.
  3. Update the vm.gpu_kind in the fly.toml file to the desired GPU kind. e.g. put gpu_kind = "a100" for an A100 GPU.

Model

This example uses the phi-3-mini-4k-instruct model by default. To use a different model:

  1. update the MODEL_URL and MODEL_FILE env variables in the fly.toml file to your desired model. The file will be downloaded as /models/$MODEL_FILE on next deploy.
  2. To delete any existing model files, use fly ssh console to connect to your machine and run rm /models/<model_file>.

API Key

This example sets the --api-key flag on the server start command to guard against unauthorized access. To set the API key:

fly secrets set API_KEY=<your-api-key>

The app will use the new API key on the next deploy.

About

Deploy a llama.cpp server on fly.io

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published