π | A RunPod worker for the Aphrodite Engine, enabling efficient text generation and processing.
This worker runs the Aphrodite Engine on RunPod Serverless, allowing for efficient text generation and processing. To set up the worker on RunPod, follow these steps:
- Go to the RunPod dashboard and create a new serverless template.
- Input your container image or use the
joachimchauvet/worker-aphrodite-engine:latest
pre-built image from DockerHub. - Select your desired GPU and other hardware specifications.
- Set the environment variables as needed (see below).
- Deploy a serverless endpoint using the template.
The following environment variables can be set to configure the Aphrodite Engine:
DOWNLOAD_DIR
: Directory to download the model (recommended: "/runpod-volume", see below)MODEL
orMODEL_NAME
(required): Name or path of the Hugging Face model to useREVISION
: Specific model version to use (branch, tag, or commit ID)DATATYPE
: Data type to use (auto, float16, bfloat16, float32)KVCACHE
: KV cache data typeMAX_MODEL_LEN
orCONTEXT_LENGTH
: Model context sizeNUM_GPUS
: Number of GPUs for tensor parallelismGPU_MEMORY_UTILIZATION
: GPU memory utilization factorQUANTIZATION
: Quantization methodENFORCE_EAGER
: If set, disables CUDA graphsKOBOLD_API
: If set, launches the Kobold APICMD_ADDITIONAL_ARGUMENTS
: Any additional command-line arguments
It's recommended to use a network volume for model storage. To do this:
- Create a network volume in your RunPod account.
- When deploying the pod, attach the network volume.
- Set the
DOWNLOAD_DIR
environment variable to "/runpod-volume".
This ensures that your models are persistently stored and can be reused across deployments.
{
"input": {
"prompt": "Once upon a time",
"sampling_params": {
"max_tokens": 400,
"temperature": 0.7
}
}
}
{
"input": {
"messages": [{ "role": "user", "content": "Hello" }],
"sampling_params": {
"max_tokens": 100,
"temperature": 0.7
}
}
}
π Aphrodite Engine π RunPod (affiliate link)