diff --git a/0.4/development/index.html b/0.4/development/index.html index 119d345..6706655 100644 --- a/0.4/development/index.html +++ b/0.4/development/index.html @@ -1564,7 +1564,7 @@
Install python 3.10+
.
Install Python (version 3.10 to 3.12).
make install
This page describes the software and networking requirements for the nodes where GPUStack will be installed.
GPUStack requires Python version 3.10 or higher.
+GPUStack requires Python version 3.10 to 3.12.
GPUStack is supported on the following operating systems:
Install python3.10 or above with pip.
+Install Python version 3.10 to 3.12.
Run the following to install GPUStack:
# You can add extra dependencies, options are "vllm", "audio" and "all".
diff --git a/0.4/search/search_index.json b/0.4/search/search_index.json
index f364e91..97a9dbf 100644
--- a/0.4/search/search_index.json
+++ b/0.4/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":""},{"location":"api-reference/","title":"API Reference","text":"GPUStack provides a built-in Swagger UI. You can access it by navigating to <gpustack-server-url>/docs
in your browser to view and interact with the APIs.
"},{"location":"architecture/","title":"Architecture","text":"The following diagram shows the architecture of GPUStack:
"},{"location":"architecture/#server","title":"Server","text":"The GPUStack server consists of the following components:
- API Server: Provides a RESTful interface for clients to interact with the system. It handles authentication and authorization.
- Scheduler: Responsible for assigning model instances to workers.
- Model Controller: Manages the rollout and scaling of model instances to match the desired model replicas.
- HTTP Proxy: Routes completion API requests to backend inference servers.
"},{"location":"architecture/#worker","title":"Worker","text":"GPUStack workers are responsible for:
- Running inference servers for model instances assigned to the worker.
- Reporting status to the server.
"},{"location":"architecture/#sql-database","title":"SQL Database","text":"The GPUStack server connects to a SQL database as the datastore. GPUStack uses SQLite by default, but you can configure it to use an external PostgreSQL as well.
"},{"location":"architecture/#inference-server","title":"Inference Server","text":"Inference servers are the backends that performs the inference tasks. GPUStack supports llama-box, vLLM and vox-box as the inference server.
"},{"location":"architecture/#rpc-server","title":"RPC Server","text":"The RPC server enables running llama-box backend on a remote host. The Inference Server communicates with one or several instances of RPC server, offloading computations to these remote hosts. This setup allows for distributed LLM inference across multiple workers, enabling the system to load larger models even when individual resources are limited.
"},{"location":"code-of-conduct/","title":"Contributor Code of Conduct","text":""},{"location":"code-of-conduct/#our-pledge","title":"Our Pledge","text":"We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.
"},{"location":"code-of-conduct/#our-standards","title":"Our Standards","text":"Examples of behavior that contributes to a positive environment for our community include:
- Demonstrating empathy and kindness toward other people
- Being respectful of differing opinions, viewpoints, and experiences
- Giving and gracefully accepting constructive feedback
- Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
- Focusing on what is best not just for us as individuals, but for the overall community
Examples of unacceptable behavior include:
- The use of sexualized language or imagery, and sexual attention or advances of any kind
- Trolling, insulting or derogatory comments, and personal or political attacks
- Public or private harassment
- Publishing others' private information, such as a physical or email address, without their explicit permission
- Other conduct which could reasonably be considered inappropriate in a professional setting
"},{"location":"code-of-conduct/#enforcement-responsibilities","title":"Enforcement Responsibilities","text":"Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.
Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.
"},{"location":"code-of-conduct/#scope","title":"Scope","text":"This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
"},{"location":"code-of-conduct/#enforcement","title":"Enforcement","text":"Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at contact@gpustack.ai. All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the reporter of any incident.
"},{"location":"code-of-conduct/#enforcement-guidelines","title":"Enforcement Guidelines","text":"Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:
"},{"location":"code-of-conduct/#1-correction","title":"1. Correction","text":"Community Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.
Consequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.
"},{"location":"code-of-conduct/#2-warning","title":"2. Warning","text":"Community Impact: A violation through a single incident or series of actions.
Consequence: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.
"},{"location":"code-of-conduct/#3-temporary-ban","title":"3. Temporary Ban","text":"Community Impact: A serious violation of community standards, including sustained inappropriate behavior.
Consequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.
"},{"location":"code-of-conduct/#4-permanent-ban","title":"4. Permanent Ban","text":"Community Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.
Consequence: A permanent ban from any sort of public interaction within the community.
"},{"location":"code-of-conduct/#attribution","title":"Attribution","text":"This Code of Conduct is adapted from the Contributor Covenant, version 2.1, available at https://www.contributor-covenant.org/version/2/1/code_of_conduct.html.
"},{"location":"contributing/","title":"Contributing to GPUStack","text":"Thanks for taking the time to contribute to GPUStack!
Please review and follow the Code of Conduct.
"},{"location":"contributing/#filing-issues","title":"Filing Issues","text":"If you find any bugs or are having any trouble, please search the reported issue as someone may have experienced the same issue, or we are actively working on a solution.
If you can't find anything related to your issue, contact us by filing an issue. To help us diagnose and resolve, please include as much information as possible, including:
- Software: GPUStack version, installation method, operating system info, etc.
- Hardware: Node info, GPU info, etc.
- Steps to reproduce: Provide as much detail on how you got into the reported situation.
- Logs: Please include any relevant logs, such as server logs, worker logs, etc.
"},{"location":"contributing/#contributing-code","title":"Contributing Code","text":"For setting up development environment, please refer to Development Guide.
If you're fixing a small issue, you can simply submit a PR. However, if you're planning to submit a bigger PR to implement a new feature or fix a relatively complex bug, please open an issue that explains the change and the motivation for it. If you're addressing a bug, please explain how to reproduce it.
"},{"location":"contributing/#updating-documentation","title":"Updating Documentation","text":"If you have any updates to our documentation, feel free to file an issue with the documentation
label or make a pull request.
"},{"location":"development/","title":"Development Guide","text":""},{"location":"development/#prerequisites","title":"Prerequisites","text":"Install python 3.10+
.
"},{"location":"development/#set-up-environment","title":"Set Up Environment","text":"make install\n
"},{"location":"development/#run","title":"Run","text":"poetry run gpustack\n
"},{"location":"development/#build","title":"Build","text":"make build\n
And check artifacts in dist
.
"},{"location":"development/#test","title":"Test","text":"make test\n
"},{"location":"development/#update-dependencies","title":"Update Dependencies","text":"poetry add <something>\n
Or
poetry add --group dev <something>\n
For dev/testing dependencies.
"},{"location":"overview/","title":"GPUStack","text":"GPUStack is an open-source GPU cluster manager for running AI models.
"},{"location":"overview/#key-features","title":"Key Features","text":" - Broad Hardware Compatibility: Run with different brands of GPUs in Apple MacBooks, Windows PCs, and Linux servers.
- Broad Model Support: From LLMs to diffusion models, audio, embedding, and reranker models.
- Scales with Your GPU Inventory: Easily add more GPUs or nodes to scale up your operations.
- Distributed Inference: Supports both single-node multi-GPU and multi-node inference and serving.
- Multiple Inference Backends: Supports llama-box (llama.cpp & stable-diffusion.cpp), vox-box and vLLM as the inference backend.
- Lightweight Python Package: Minimal dependencies and operational overhead.
- OpenAI-compatible APIs: Serve APIs that are compatible with OpenAI standards.
- User and API key management: Simplified management of users and API keys.
- GPU metrics monitoring: Monitor GPU performance and utilization in real-time.
- Token usage and rate metrics: Track token usage and manage rate limits effectively.
"},{"location":"overview/#supported-platforms","title":"Supported Platforms","text":" - macOS
- Windows
- Linux
"},{"location":"overview/#supported-accelerators","title":"Supported Accelerators","text":" - Apple Metal (M-series chips)
- NVIDIA CUDA (Compute Capability 6.0 and above)
- Ascend CANN
- Moore Threads MUSA
We plan to support the following accelerators in future releases.
- AMD ROCm
- Intel oneAPI
- Qualcomm AI Engine
"},{"location":"overview/#supported-models","title":"Supported Models","text":"GPUStack uses llama-box (bundled llama.cpp and stable-diffusion.cpp server), vLLM and vox-box as the backends and supports a wide range of models. Models from the following sources are supported:
-
Hugging Face
-
ModelScope
-
Ollama Library
-
Local File Path
"},{"location":"overview/#example-models","title":"Example Models:","text":"Category Models Large Language Models(LLMs) Qwen, LLaMA, Mistral, Deepseek, Phi, Yi Vision Language Models(VLMs) Llama3.2-Vision, Pixtral , Qwen2-VL, LLaVA, InternVL2 Diffusion Models Stable Diffusion, FLUX Rerankers GTE, BCE, BGE, Jina Audio Models Whisper (speech-to-text), CosyVoice (text-to-speech) For full list of supported models, please refer to the supported models section in the inference backends documentation.
"},{"location":"overview/#openai-compatible-apis","title":"OpenAI-Compatible APIs","text":"GPUStack serves OpenAI compatible APIs. For details, please refer to OpenAI Compatible APIs
"},{"location":"quickstart/","title":"Quickstart","text":""},{"location":"quickstart/#installation","title":"Installation","text":""},{"location":"quickstart/#linux-or-macos","title":"Linux or macOS","text":"GPUStack provides a script to install it as a service on systemd or launchd based systems. To install GPUStack using this method, just run:
curl -sfL https://get.gpustack.ai | sh -s -\n
"},{"location":"quickstart/#windows","title":"Windows","text":"Run PowerShell as administrator (avoid using PowerShell ISE), then run the following command to install GPUStack:
Invoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n
"},{"location":"quickstart/#other-installation-methods","title":"Other Installation Methods","text":"For manual installation, docker installation or detailed configuration options, please refer to the Installation Documentation.
"},{"location":"quickstart/#getting-started","title":"Getting Started","text":" - Run and chat with the llama3.2 model:
gpustack chat llama3.2 \"tell me a joke.\"\n
- Run and generate an image with the stable-diffusion-v3-5-large-turbo model:
Tip
This command downloads the model (~12GB) from Hugging Face. The download time depends on your network speed. Ensure you have enough disk space and VRAM (12GB) to run the model. If you encounter issues, you can skip this step and move to the next one.
gpustack draw hf.co/gpustack/stable-diffusion-v3-5-large-turbo-GGUF:stable-diffusion-v3-5-large-turbo-Q4_0.gguf \\\n\"A minion holding a sign that says 'GPUStack'. The background is filled with futuristic elements like neon lights, circuit boards, and holographic displays. The minion is wearing a tech-themed outfit, possibly with LED lights or digital patterns. The sign itself has a sleek, modern design with glowing edges. The overall atmosphere is high-tech and vibrant, with a mix of dark and neon colors.\" \\\n--sample-steps 5 --show\n
Once the command completes, the generated image will appear in the default viewer. You can experiment with the prompt and CLI options to customize the output.
- Open
http://myserver
in the browser to access the GPUStack UI. Log in to GPUStack with username admin
and the default password. You can run the following command to get the password for the default setup:
Linux or macOS
cat /var/lib/gpustack/initial_admin_password\n
Windows
Get-Content -Path \"$env:APPDATA\\gpustack\\initial_admin_password\" -Raw\n
- Click
Playground
in the navigation menu. Now you can chat with the LLM in the UI playground.
-
Click API Keys
in the navigation menu, then click the New API Key
button.
-
Fill in the Name
and click the Save
button.
-
Copy the generated API key and save it somewhere safe. Please note that you can only see it once on creation.
-
Now you can use the API key to access the OpenAI-compatible API. For example, use curl as the following:
export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/chat/completions \\\n -H \"Content-Type: application/json\" \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -d '{\n \"model\": \"llama3.2\",\n \"messages\": [\n {\n \"role\": \"system\",\n \"content\": \"You are a helpful assistant.\"\n },\n {\n \"role\": \"user\",\n \"content\": \"Hello!\"\n }\n ],\n \"stream\": true\n }'\n
"},{"location":"quickstart/#cleanup","title":"Cleanup","text":"After you complete using the deployed models, you can go to the Models
page in the GPUStack UI and delete the models to free up resources.
"},{"location":"scheduler/","title":"Scheduler","text":""},{"location":"scheduler/#summary","title":"Summary","text":"The scheduler's primary responsibility is to calculate the resources required by models instance and to evaluate and select the optimal workers/GPUs for model instances through a series of strategies. This ensures that model instances can run efficiently. This document provides a detailed overview of the policies and processes used by the scheduler.
"},{"location":"scheduler/#scheduling-process","title":"Scheduling Process","text":""},{"location":"scheduler/#filtering-phase","title":"Filtering Phase","text":"The filtering phase aims to narrow down the available workers or GPUs to those that meet specific criteria. The main policies involved are:
- Label Matching Policy
- Status Policy
- Resource Fit Policy
"},{"location":"scheduler/#label-matching-policy","title":"Label Matching Policy","text":"This policy filters workers based on the label selectors configured for the model. If no label selectors are defined for the model, all workers are considered. Otherwise, the system checks whether the labels of each worker node match the model's label selectors, retaining only those workers that match.
"},{"location":"scheduler/#status-policy","title":"Status Policy","text":"This policy filters workers based on their status, retaining only those that are in a READY state.
"},{"location":"scheduler/#resource-fit-policy","title":"Resource Fit Policy","text":"The Resource Fit Policy is a critical strategy in the scheduling system, used to filter workers or GPUs based on resource compatibility. The goal of this policy is to ensure that model instances can run on the selected nodes without exceeding resource limits. The Resource Fit Policy prioritizes candidates in the following order:
- Single Worker Node, Single GPU Full Offload: Identifies candidates where a single GPU on a single worker can fully offload the model, which usually offers the best performance.
- Single Worker Node, Multiple GPU Full Offload: Identifies candidates where multiple GPUs on a single worker can fully the offload the model.
- Single Worker Node Partial Offload: Identifies candidates on a single worker that can handle a partial offload, used only when partial offloading is allowed.
- Distributed Inference Across Multiple Workers: Identifies candidates where a combination of GPUs across multiple workers can handle full or partial offloading, used only when distributed inference across nodes is permitted.
- Single Worker Node, CPU: When no GPUs are available, the system will use the CPU for inference, identifying candidates where memory resources on a single worker are sufficient.
"},{"location":"scheduler/#scoring-phase","title":"Scoring Phase","text":"The scoring phase evaluates the filtered candidates, scoring them to select the optimal deployment location. The primary strategy involved is:
- Placement Strategy Policy
"},{"location":"scheduler/#placement-strategy-policy","title":"Placement Strategy Policy","text":" - Binpack
This strategy aims to \"pack\" as many model instances as possible into the fewest number of \"bins\" (e.g., Workers/GPUs) to optimize resource utilization. The goal is to minimize the number of bins used while maximizing resource efficiency, ensuring each bin is filled as efficiently as possible without exceeding its capacity. Model instances are placed in the bin with the least remaining space to minimize leftover capacity in each bin.
- Spread
This strategy seeks to distribute multiple model instances across different worker nodes as evenly as possible, improving system fault tolerance and load balancing.
"},{"location":"troubleshooting/","title":"Troubleshooting","text":""},{"location":"troubleshooting/#view-gpustack-logs","title":"View GPUStack Logs","text":"If you installed GPUStack using the installation script, you can view GPUStack logs at the following path:
"},{"location":"troubleshooting/#linux-or-macos","title":"Linux or macOS","text":"/var/log/gpustack.log\n
"},{"location":"troubleshooting/#windows","title":"Windows","text":"\"$env:APPDATA\\gpustack\\log\\gpustack.log\"\n
"},{"location":"troubleshooting/#configure-log-level","title":"Configure Log Level","text":"You can enable the DEBUG log level on gpustack start
by setting the --debug
parameter.
You can configure log level of GPUStack server at runtime by running the following command on the server node:
curl -X PUT http://localhost/debug/log_level -d \"debug\"\n
"},{"location":"troubleshooting/#reset-admin-password","title":"Reset Admin Password","text":"In case you forgot the admin password, you can reset it by running the following command on the server node:
gpustack reset-admin-password\n
"},{"location":"upgrade/","title":"Upgrade","text":"You can upgrade GPUStack using the installation script or by manually installing the desired version of the GPUStack Python package.
Note
When upgrading, upgrade the GPUStack server first, then upgrade the workers.
"},{"location":"upgrade/#upgrade-gpustack-using-the-installation-script","title":"Upgrade GPUStack Using the Installation Script","text":"To upgrade GPUStack from an older version, re-run the installation script using the same configuration options you originally used.
Running the installation script will:
- Install the latest version of the GPUStack Python package.
- Update the system service (systemd, launchd, or Windows) init script to reflect the arguments passed to the installation script.
- Restart the GPUStack service.
"},{"location":"upgrade/#linux-and-macos","title":"Linux and macOS","text":"For example, to upgrade GPUStack to the latest version on a Linux system and macOS:
curl -sfL https://get.gpustack.ai | <EXISTING_INSTALL_ENV> sh -s - <EXISTING_GPUSTACK_ARGS>\n
To upgrade to a specific version, specify the INSTALL_PACKAGE_SPEC
environment variable similar to the pip install
command:
curl -sfL https://get.gpustack.ai | INSTALL_PACKAGE_SPEC=gpustack==x.y.z <EXISTING_INSTALL_ENV> sh -s - <EXISTING_GPUSTACK_ARGS>\n
"},{"location":"upgrade/#windows","title":"Windows","text":"To upgrade GPUStack to the latest version on a Windows system:
$env:<EXISTING_INSTALL_ENV> = <EXISTING_INSTALL_ENV_VALUE>\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n
To upgrade to a specific version:
$env:INSTALL_PACKAGE_SPEC = gpustack==x.y.z\n$env:<EXISTING_INSTALL_ENV> = <EXISTING_INSTALL_ENV_VALUE>\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } <EXISTING_GPUSTACK_ARGS>\"\n
"},{"location":"upgrade/#docker-upgrade","title":"Docker Upgrade","text":"If you installed GPUStack using Docker, upgrade to the a new version by pulling the Docker image with the desired version tag.
For example:
docker pull gpustack/gpustack:vX.Y.Z\n
Then restart the GPUStack service with the new image.
"},{"location":"upgrade/#manual-upgrade","title":"Manual Upgrade","text":"If you install GPUStack manually, upgrade using the common pip
workflow.
For example, to upgrade GPUStack to the latest version:
pip install --upgrade gpustack\n
Then restart the GPUStack service according to your setup.
"},{"location":"cli-reference/chat/","title":"gpustack chat","text":"Chat with a large language model.
gpustack chat model [prompt]\n
"},{"location":"cli-reference/chat/#positional-arguments","title":"Positional Arguments","text":"Name Description model The model to use for chat. prompt The prompt to send to the model. [Optional]"},{"location":"cli-reference/chat/#one-time-chat-with-a-prompt","title":"One-time Chat with a Prompt","text":"If a prompt is provided, it performs a one-time inference. For example:
gpustack chat llama3 \"tell me a joke.\"\n
Example output:
Why couldn't the bicycle stand up by itself?\n\nBecause it was two-tired!\n
"},{"location":"cli-reference/chat/#interactive-chat","title":"Interactive Chat","text":"If the prompt
argument is not provided, you can chat with the large language model interactively. For example:
gpustack chat llama3\n
Example output:
>tell me a joke.\nHere's one:\n\nWhy couldn't the bicycle stand up by itself?\n\n(wait for it...)\n\nBecause it was two-tired!\n\nHope that made you smile!\n>Do you have a better one?\nHere's another one:\n\nWhy did the scarecrow win an award?\n\n(think about it for a sec...)\n\nBecause he was outstanding in his field!\n\nHope that one stuck with you!\n\nDo you want to hear another one?\n>\\quit\n
"},{"location":"cli-reference/chat/#interactive-commands","title":"Interactive Commands","text":"Followings are available commands in interactive chat:
Commands:\n \\q or \\quit - Quit the chat\n \\c or \\clear - Clear chat context in prompt\n \\? or \\h or \\help - Print this help message\n
"},{"location":"cli-reference/chat/#connect-to-external-gpustack-server","title":"Connect to External GPUStack Server","text":"If you are not running gpustack chat
on the server node, or if you are serving on a custom host or port, you should provide the following environment variables:
Name Description GPUSTACK_SERVER_URL URL of the GPUStack server, e.g., http://myserver
. GPUSTACK_API_KEY GPUStack API key."},{"location":"cli-reference/download-tools/","title":"gpustack download-tools","text":"Download dependency tools, including llama-box, gguf-parser, and fastfetch.
gpustack download-tools [OPTIONS]\n
"},{"location":"cli-reference/download-tools/#configurations","title":"Configurations","text":"Flag Default Description ----tools-download-base-url
value (empty) Base URL to download dependency tools. --save-archive
value (empty) Path to save downloaded tools as a tar archive. --load-archive
value (empty) Path to load downloaded tools from a tar archive, instead of downloading. --system
value Default is the current OS. Operating system to download tools for. Options: linux
, windows
, macos
. --arch
value Default is the current architecture. Architecture to download tools for. Options: amd64
, arm64
. --device
value Default is the current device. Device to download tools for. Options: cuda
, mps
, npu
, musa
, cpu
."},{"location":"cli-reference/draw/","title":"gpustack draw","text":"Generate an image with a diffusion model.
gpustack draw [model] [prompt]\n
"},{"location":"cli-reference/draw/#positional-arguments","title":"Positional Arguments","text":"Name Description model The model to use for image generation. prompt Text prompt to use for image generation. The model
can be either of the following:
- Name of a GPUStack model. You need to create a model in GPUStack before using it here.
- Reference to a Hugging Face GGUF diffusion model in Ollama style. When using this option, the model will be deployed if it is not already available. When not specified the default
Q4_0
tag is used. Examples:
hf.co/gpustack/stable-diffusion-v3-5-large-turbo-GGUF
hf.co/gpustack/stable-diffusion-v3-5-large-turbo-GGUF:FP16
hf.co/gpustack/stable-diffusion-v3-5-large-turbo-GGUF:stable-diffusion-v3-5-large-turbo-Q4_0.gguf
"},{"location":"cli-reference/draw/#configurations","title":"Configurations","text":"Flag Default Description --size
value 512x512
Size of the image to generate, specified as widthxheight
. --sampler
value euler
Sampling method. Options include: euler_a, euler, heun, dpm2, dpm++2s_a, dpm++2m, lcm, etc. --sample-steps
value (Empty) Number of sampling steps. --cfg-scale
value (Empty) Classifier-free guidance scale for balancing prompt adherence and creativity. --seed
value (Empty) Seed for random number generation. Useful for reproducibility. --negative-prompt
value (Empty) Text prompt for what to avoid in the image. --output
value (Empty) Path to save the generated image. --show
False
If True, opens the generated image in the default image viewer. -d
, --debug
False
Enable debug mode."},{"location":"cli-reference/start/","title":"gpustack start","text":"Run GPUStack server or worker.
gpustack start [OPTIONS]\n
"},{"location":"cli-reference/start/#configurations","title":"Configurations","text":""},{"location":"cli-reference/start/#common-options","title":"Common Options","text":"Flag Default Description --config-file
value (empty) Path to the YAML config file. -d
value, --debug
value False
To enable debug mode, the short flag -d is not supported in Windows because this flag is reserved by PowerShell for CommonParameters. --data-dir
value (empty) Directory to store data. Default is OS specific. --cache-dir
value (empty) Directory to store cache (e.g., model files). Defaults to /cache. -t
value, --token
value Auto-generated. Shared secret used to add a worker. --huggingface-token
value (empty) User Access Token to authenticate to the Hugging Face Hub. Can also be configured via the HF_TOKEN
environment variable."},{"location":"cli-reference/start/#server-options","title":"Server Options","text":"Flag Default Description --host
value 0.0.0.0
Host to bind the server to. --port
value 80
Port to bind the server to. --disable-worker
False
Disable embedded worker. --bootstrap-password
value Auto-generated. Initial password for the default admin user. --database-url
value sqlite:///<data-dir>/database.db
URL of the database. Example: postgresql://user:password@hostname:port/db_name --ssl-keyfile
value (empty) Path to the SSL key file. --ssl-certfile
value (empty) Path to the SSL certificate file. --force-auth-localhost
False
Force authentication for requests originating from localhost (127.0.0.1).When set to True, all requests from localhost will require authentication. --ollama-library-base-url
https://registry.ollama.ai
Base URL for the Ollama library. --disable-update-check
False
Disable update check."},{"location":"cli-reference/start/#worker-options","title":"Worker Options","text":"Flag Default Description -s
value, --server-url
value (empty) Server to connect to. --worker-ip
value (empty) IP address of the worker node. Auto-detected by default. --disable-metrics
False
Disable metrics. --disable-rpc-servers
False
Disable RPC servers. --metrics-port
value 10151
Port to expose metrics. --worker-port
value 10150
Port to bind the worker to. Use a consistent value for all workers. --log-dir
value (empty) Directory to store logs. --system-reserved
value \"{\\\"ram\\\": 2, \\\"vram\\\": 0}\"
The system reserves resources for the worker during scheduling, measured in GiB. By default, 2 GiB of RAM is reserved, Note: '{\\\"memory\\\": 2, \\\"gpu_memory\\\": 0}' is also supported, but it is deprecated and will be removed in future releases. --tools-download-base-url
value Base URL for downloading dependency tools."},{"location":"cli-reference/start/#available-environment-variables","title":"Available Environment Variables","text":"Most of the options can be set via environment variables. The environment variables are prefixed with GPUSTACK_
and are in uppercase. For example, --data-dir
can be set via the GPUSTACK_DATA_DIR
environment variable.
Below are additional environment variables that can be set:
Flag Description HF_ENDPOINT
Hugging Face Hub endpoint. e.g., https://hf-mirror.com
"},{"location":"cli-reference/start/#config-file","title":"Config File","text":"You can configure start options using a YAML-format config file when starting GPUStack server or worker. Here is a complete example:
# Common Options\ndebug: false\ndata_dir: /path/to/data_dir\ncache_dir: /path/to/cache_dir\ntoken: mytoken\n\n# Server Options\nhost: 0.0.0.0\nport: 80\ndisable_worker: false\ndatabase_url: postgresql://user:password@hostname:port/db_name\nssl_keyfile: /path/to/keyfile\nssl_certfile: /path/to/certfile\nforce_auth_localhost: false\nbootstrap_password: myadminpassword\nollama_library_base_url: https://registry.mycompany.com\ndisable_update_check: false\n\n# Worker Options\nserver_url: http://myserver\nworker_ip: 192.168.1.101\ndisable_metrics: false\ndisable_rpc_servers: false\nmetrics_port: 10151\nworker_port: 10150\nlog_dir: /path/to/log_dir\nsystem_reserved:\n ram: 2\n vram: 0\ntools_download_base_url: https://mirror.mycompany.com\n
"},{"location":"installation/air-gapped-installation/","title":"Air-Gapped Installation","text":"You can install GPUStack in an air-gapped environment. An air-gapped environment refers to a setup where GPUStack will be installed offline, behind a firewall, or behind a proxy.
The following methods are available for installing GPUStack in an air-gapped environment:
- Docker Installation
- Manual Installation
"},{"location":"installation/air-gapped-installation/#docker-installation","title":"Docker Installation","text":"When running GPUStack with Docker, it works out of the box in an air-gapped environment as long as the Docker images are available. To do this, follow these steps:
- Pull GPUStack Docker images in an online environment.
- Publish Docker images to a private registry.
- Refer to the Docker Installation guide to run GPUStack using Docker.
"},{"location":"installation/air-gapped-installation/#manual-installation","title":"Manual Installation","text":"For manual installation, you need to prepare the required packages and tools in an online environment and then transfer them to the air-gapped environment.
"},{"location":"installation/air-gapped-installation/#prerequisites","title":"Prerequisites","text":"Set up an online environment identical to the air-gapped environment, including OS, architecture, and Python version.
"},{"location":"installation/air-gapped-installation/#step-1-download-the-required-packages","title":"Step 1: Download the Required Packages","text":"Run the following commands in an online environment:
# On Windows (PowerShell):\n# $PACKAGE_SPEC = \"gpustack\"\n\n# Optional: To include extra dependencies (vllm, audio, all) or install a specific version\n# PACKAGE_SPEC=\"gpustack[all]\"\n# PACKAGE_SPEC=\"gpustack==0.4.0\"\nPACKAGE_SPEC=\"gpustack\"\n\n# Download all required packages\npip wheel $PACKAGE_SPEC -w gpustack_offline_packages\n\n# Install GPUStack to access its CLI\npip install gpustack\n\n# Download dependency tools and save them as an archive\ngpustack download-tools --save-archive gpustack_offline_tools.tar.gz\n
Optional: Additional Dependencies for macOS.
# Deploying the speech-to-text CosyVoice model on macOS requires additional dependencies.\nbrew install openfst\nCPLUS_INCLUDE_PATH=$(brew --prefix openfst)/include\nLIBRARY_PATH=$(brew --prefix openfst)/lib\n\nAUDIO_DEPENDENCY_PACKAGE_SPEC=\"wetextprocessing\"\npip wheel $AUDIO_DEPENDENCY_PACKAGE_SPEC -w gpustack_audio_dependency_offline_packages\nmv gpustack_audio_dependency_offline_packages/* gpustack_offline_packages/ && rm -rf gpustack_audio_dependency_offline_packages\n
Note
This instruction assumes that the online environment uses the same GPU type as the air-gapped environment. If the GPU types differ, use the --device
flag to specify the device type for the air-gapped environment. Refer to the download-tools command for more information.
"},{"location":"installation/air-gapped-installation/#step-2-transfer-the-packages","title":"Step 2: Transfer the Packages","text":"Transfer the following files from the online environment to the air-gapped environment.
gpustack_offline_packages
directory. gpustack_offline_tools.tar.gz
file.
"},{"location":"installation/air-gapped-installation/#step-3-install-gpustack","title":"Step 3: Install GPUStack","text":"In the air-gapped environment, run the following commands:
# Install GPUStack from the downloaded packages\npip install --no-index --find-links=gpustack_offline_packages gpustack\n\n# Load and apply the pre-downloaded tools archive\ngpustack download-tools --load-archive gpustack_offline_tools.tar.gz\n
Optional: Additional Dependencies for macOS.
# Install the additional dependencies for speech-to-text CosyVoice model on macOS.\nbrew install openfst\n\npip install --no-index --find-links=gpustack_offline_packages wetextprocessing\n
Now you can run GPUStack by following the instructions in the Manual Installation guide.
"},{"location":"installation/docker-installation/","title":"Docker Installation","text":"You can use the official Docker image to run GPUStack in a container. Installation using docker is supported on:
- Linux with Nvidia GPUs
"},{"location":"installation/docker-installation/#prerequisites","title":"Prerequisites","text":" - Docker
- Nvidia Container Toolkit
"},{"location":"installation/docker-installation/#run-gpustack-with-docker","title":"Run GPUStack with Docker","text":"Run the following command to start the GPUStack server:
docker run -d --gpus all -p 80:80 --ipc=host \\\n -v gpustack-data:/var/lib/gpustack gpustack/gpustack\n
Note
You can either use the --ipc=host
flag or --shm-size
flag to allow the container to access the host\u2019s shared memory. It is used by vLLM and pyTorch to share data between processes under the hood, particularly for tensor parallel inference.
You can set additional flags for the gpustack start
command by appending them to the docker run command.
For example, to start a GPUStack worker:
docker run -d --gpus all --ipc=host --network=host \\\n gpustack/gpustack --server-url http://myserver --token mytoken\n
Note
The --network=host
flag is used to ensure that server is accessible to the worker and inference services running on it. Alternatively, you can set --worker-ip <host-ip> -p 10150:10150 -p 40000-41024:40000-41024
to expose relevant ports.
For configuration details, please refer to the CLI Reference.
"},{"location":"installation/docker-installation/#run-gpustack-with-docker-compose","title":"Run GPUStack with Docker Compose","text":"Get the docker-compose file from GPUStack repository, run the following command to start the GPUStack server:
docker-compose up -d\n
You can update the docker-compose.yml
file to customize the command while starting a GPUStack worker.
"},{"location":"installation/docker-installation/#build-your-own-docker-image","title":"Build Your Own Docker Image","text":"The official Docker image is built with CUDA 12.4. If you want to use a different version of CUDA, you can build your own Docker image.
# Example Dockerfile\nARG CUDA_VERSION=12.4.1\n\nFROM nvidia/cuda:$CUDA_VERSION-cudnn-runtime-ubuntu22.04\n\nENV DEBIAN_FRONTEND=noninteractive\n\nRUN apt-get update && apt-get install -y \\\n wget \\\n tzdata \\\n python3 \\\n python3-pip \\\n && rm -rf /var/lib/apt/lists/*\n\n\nRUN pip3 install gpustack[all] && \\\n pip3 cache purge\n\nENTRYPOINT [ \"gpustack\", \"start\" ]\n
Run the following command to build the Docker image:
docker build -t my/gpustack --build-arg CUDA_VERSION=12.0.0 .\n
"},{"location":"installation/installation-requirements/","title":"Installation Requirements","text":"This page describes the software and networking requirements for the nodes where GPUStack will be installed.
"},{"location":"installation/installation-requirements/#python-requirements","title":"Python Requirements","text":"GPUStack requires Python version 3.10 or higher.
"},{"location":"installation/installation-requirements/#operating-system-requirements","title":"Operating System Requirements","text":"GPUStack is supported on the following operating systems:
- macOS
- Windows
- Linux
GPUStack has been tested and verified to work on the following operating systems:
OS Versions Windows 10, 11 Ubuntu >= 20.04 Debian >= 11 RHEL >= 8 Rocky >= 8 Fedora >= 36 OpenSUSE >= 15.3 (leap) OpenEuler >= 22.03 Note
The installation of GPUStack worker on a Linux system requires that the GLIBC version be 2.29 or higher. If your system uses a lower GLIBC version, consider using the Docker Installation method as an alternative.
"},{"location":"installation/installation-requirements/#supported-architectures","title":"Supported Architectures","text":"GPUStack supports both AMD64 and ARM64 architectures, with the following notes:
- On Linux and macOS, when using Python versions below 3.12, ensure that the installed Python distribution corresponds to your system architecture.
- On Windows, please use the AMD64 distribution of Python, as wheel packages for certain dependencies are unavailable for ARM64. If you use tools like
conda
, this will be handled automatically, as conda installs the AMD64 distribution by default.
"},{"location":"installation/installation-requirements/#accelerator-runtime-requirements","title":"Accelerator Runtime Requirements","text":"GPUStack supports the following accelerators:
- Apple Metal (M-series chips)
- NVIDIA CUDA (Compute Capability 6.0 and above)
- Ascend CANN
- Moore Threads MUSA
Ensure all necessary drivers and libraries are installed on the system prior to installing GPUStack.
"},{"location":"installation/installation-requirements/#nvidia-cuda","title":"NVIDIA CUDA","text":"To use NVIDIA CUDA as an accelerator, ensure the following components are installed:
- NVIDIA CUDA Toolkit
- NVIDIA cuBLAS (Optional, required for audio models)
- NVIDIA cuDNN (Optional, required for audio models)
- NVIDIA Container Toolkit (Optional, required for docker installation)
"},{"location":"installation/installation-requirements/#ascend-cann","title":"Ascend CANN","text":"For Ascend CANN as an accelerator, ensure the following components are installed:
- Ascend NPU driver & firmware
- Ascend CANN Toolkit & kernels
"},{"location":"installation/installation-requirements/#musa","title":"MUSA","text":"To use Moore Threads MUSA as an accelerator, ensure the following components are installed:
- MUSA SDK
- MT Container Toolkits(Optional, required for docker installation)
"},{"location":"installation/installation-requirements/#networking-requirements","title":"Networking Requirements","text":""},{"location":"installation/installation-requirements/#connectivity-requirements","title":"Connectivity Requirements","text":"The following network connectivity is required to ensure GPUStack functions properly:
Server-to-Worker: The server must be able to reach the workers for proxying inference requests.
Worker-to-Server: Workers must be able to reach the server to register themselves and send updates.
Worker-to-Worker: Necessary for distributed inference across multiple workers
"},{"location":"installation/installation-requirements/#port-requirements","title":"Port Requirements","text":"GPUStack uses the following ports for communication:
Server Ports
Port Description TCP 80 Default port for the GPUStack UI and API endpoints TCP 443 Default port for the GPUStack UI and API endpoints (when TLS is enabled) Worker Ports
Port Description TCP 10150 Default port for the GPUStack worker TCP 10151 Default port for exposing metrics TCP 40000-41024 Port range allocated for inference services"},{"location":"installation/installation-script/","title":"Installation Script","text":""},{"location":"installation/installation-script/#linux-and-macos","title":"Linux and macOS","text":"You can use the installation script available at https://get.gpustack.ai
to install GPUStack as a service on systemd and launchd based systems.
You can set additional environment variables and CLI flags when running the script. The following are examples running the installation script with different configurations:
# Run server.\ncurl -sfL https://get.gpustack.ai | sh -s -\n\n# Run server without the embedded worker.\ncurl -sfL https://get.gpustack.ai | sh -s - --disable-worker\n\n# Run server with TLS.\ncurl -sfL https://get.gpustack.ai | sh -s - --ssl-keyfile /path/to/keyfile --ssl-certfile /path/to/certfile\n\n# Run server with external postgresql database.\ncurl -sfL https://get.gpustack.ai | sh -s - --database-url \"postgresql://username:password@host:port/database_name\"\n\n# Run worker with specified IP.\ncurl -sfL https://get.gpustack.ai | sh -s - --server-url http://myserver --token mytoken --worker-ip 192.168.1.100\n\n# Install with a custom index URL.\ncurl -sfL https://get.gpustack.ai | INSTALL_INDEX_URL=https://pypi.tuna.tsinghua.edu.cn/simple sh -s -\n\n# Install a custom wheel package other than releases form pypi.org.\ncurl -sfL https://get.gpustack.ai | INSTALL_PACKAGE_SPEC=https://repo.mycompany.com/my-gpustack.whl sh -s -\n\n# Install a specific version with extra audio dependencies.\ncurl -sfL https://get.gpustack.ai | INSTALL_PACKAGE_SPEC=gpustack[audio]==0.4.0 sh -s -\n
"},{"location":"installation/installation-script/#windows","title":"Windows","text":"You can use the installation script available at https://get.gpustack.ai
to install GPUStack as a service on Windows Service Manager.
You can set additional environment variables and CLI flags when running the script. The following are examples running the installation script with different configurations:
# Run server.\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n\n# Run server without the embedded worker.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --disable-worker\"\n\n# Run server with TLS.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --ssl-keyfile 'C:\\path\\to\\keyfile' --ssl-certfile 'C:\\path\\to\\certfile'\"\n\n\n# Run server with external postgresql database.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --database-url 'postgresql://username:password@host:port/database_name'\"\n\n# Run worker with specified IP.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --server-url 'http://myserver' --token 'mytoken' --worker-ip '192.168.1.100'\"\n\n# Run worker with customize reserved resource.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --server-url 'http://myserver' --token 'mytoken' --system-reserved '{\"\"ram\"\":5, \"\"vram\"\":5}'\"\n\n# Install with a custom index URL.\n$env:INSTALL_INDEX_URL = \"https://pypi.tuna.tsinghua.edu.cn/simple\"\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n\n# Install a custom wheel package other than releases form pypi.org.\n$env:INSTALL_PACKAGE_SPEC = \"https://repo.mycompany.com/my-gpustack.whl\"\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n\n# Install a specific version with extra audio dependencies.\n$env:INSTALL_PACKAGE_SPEC = \"gpustack[audio]==0.4.0\"\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n
Warning
Avoid using PowerShell ISE as it is not compatible with the installation script.
"},{"location":"installation/installation-script/#available-environment-variables-for-the-installation-script","title":"Available Environment Variables for the Installation Script","text":"Name Default Description INSTALL_INDEX_URL
(empty) Base URL of the Python Package Index. INSTALL_PACKAGE_SPEC
gpustack[all]
or gpustack[audio]
The package spec to install. The install script will automatically decide based on the platform. It supports PYPI package names, URLs, and local paths. See the pip install documentation for details. gpustack[all]
: With all inference backends: llama-box, vllm, vox-box.gpustack[vllm]
: With inference backends: llama-box, vllm.gpustack[audio]
: With inference backends: llama-box, vox-box.
INSTALL_SKIP_POST_CHECK
(empty) If set to 1, the installation script will skip the post-installation check."},{"location":"installation/installation-script/#set-environment-variables-for-the-gpustack-service","title":"Set Environment Variables for the GPUStack Service","text":"You can set environment variables for the GPUStack service in an environment file located at:
- Linux and macOS:
/etc/default/gpustack
- Windows:
$env:APPDATA\\gpustack\\gpustack.env
The following is an example of the content of the file:
HF_TOKEN=\"mytoken\"\nHF_ENDPOINT=\"https://my-hf-endpoint\"\n
Note
Unlike Systemd, Launchd and Windows services do not natively support reading environment variables from a file. Configuration via the environment file is implemented by the installation script. It reads the file and applies the variables to the service configuration. After modifying the environment file on Windows and macOS, you need to re-run the installation script to apply changes to the GPUStack service.
"},{"location":"installation/installation-script/#available-cli-flags","title":"Available CLI Flags","text":"The appended CLI flags of the installation script are passed directly as flags for the gpustack start
command. You can refer to the CLI Reference for details.
"},{"location":"installation/installation-script/#install-server","title":"Install Server","text":"To set up the GPUStack server (the management node), install GPUStack without the --server-url
flag. By default, the GPUStack server includes an embedded worker. To disable this embedded worker on the server, use the --disable-worker
flag.
"},{"location":"installation/installation-script/#install-worker","title":"Install Worker","text":"To form a cluster, you can add GPUStack workers on additional nodes. Install GPUStack with the --server-url
flag to specify the server' address and the --token
flag for worker authenticate.
Examples are as follows:
"},{"location":"installation/installation-script/#linux-or-macos","title":"Linux or macOS","text":"curl -sfL https://get.gpustack.ai | sh -s - --server-url http://myserver --token mytoken\n
In the default setup, you can run the following on the server node to get the token used for adding workers:
cat /var/lib/gpustack/token\n
"},{"location":"installation/installation-script/#windows_1","title":"Windows","text":"Invoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --server-url http://myserver --token mytoken\"\n
In the default setup, you can run the following on the server node to get the token used for adding workers:
Get-Content -Path \"$env:APPDATA\\gpustack\\token\" -Raw\n
"},{"location":"installation/manual-installation/","title":"Manual Installation","text":""},{"location":"installation/manual-installation/#prerequites","title":"Prerequites:","text":"Install python3.10 or above with pip.
"},{"location":"installation/manual-installation/#install-gpustack-cli","title":"Install GPUStack CLI","text":"Run the following to install GPUStack:
# You can add extra dependencies, options are \"vllm\", \"audio\" and \"all\".\n# e.g., gpustack[all]\npip install gpustack\n
To verify, run:
gpustack version\n
"},{"location":"installation/manual-installation/#run-gpustack","title":"Run GPUStack","text":"Run the following command to start the GPUStack server:
gpustack start\n
By default, GPUStack uses /var/lib/gpustack
as the data directory so you need sudo
or proper permission for that. You can also set a custom data directory by running:
gpustack start --data-dir mypath\n
"},{"location":"installation/manual-installation/#run-gpustack-as-a-system-service","title":"Run GPUStack as a System Service","text":"A recommended way is to run GPUStack as a startup service. For example, using systemd:
Create a service file in /etc/systemd/system/gpustack.service
:
[Unit]\nDescription=GPUStack Service\nWants=network-online.target\nAfter=network-online.target\n\n[Service]\nEnvironmentFile=-/etc/default/%N\nExecStart=gpustack start\nRestart=always\nRestartSec=3\nStandardOutput=append:/var/log/gpustack.log\nStandardError=append:/var/log/gpustack.log\n\n[Install]\nWantedBy=multi-user.target\n
Then start GPUStack:
systemctl daemon-reload\nsystemctl enable gpustack\n
"},{"location":"installation/uninstallation/","title":"Uninstallation","text":""},{"location":"installation/uninstallation/#uninstallation-script","title":"Uninstallation Script","text":"Warning
Uninstallation script deletes the data in local datastore(sqlite), configuration, model cache, and all of the scripts and CLI tools. It does not remove any data from external datastores.
If you installed GPUStack using the installation script, a script to uninstall GPUStack was generated during installation.
"},{"location":"installation/uninstallation/#linux-or-macos","title":"Linux or macOS","text":"Run the following command to uninstall GPUStack:
sudo /var/lib/gpustack/uninstall.sh\n
"},{"location":"installation/uninstallation/#windows","title":"Windows","text":"Run the following command in PowerShell to uninstall GPUStack:
Set-ExecutionPolicy Bypass -Scope Process -Force; & \"$env:APPDATA\\gpustack\\uninstall.ps1\"\n
"},{"location":"installation/uninstallation/#manual-uninstallation","title":"Manual Uninstallation","text":"If you install GPUStack manually, the followings are example commands to uninstall GPUStack. You can modify according to your setup:
# Stop and remove the service.\nsystemctl stop gpustack.service\nrm /etc/systemd/system/gpustack.service\nsystemctl daemon-reload\n# Uninstall the CLI.\npip uninstall gpustack\n# Remove the data directory.\nrm -rf /var/lib/gpustack\n
"},{"location":"tutorials/creating-text-embeddings/","title":"Creating Text Embeddings","text":"Text embeddings are numerical representations of text that capture semantic meaning, enabling machines to understand relationships and similarities between different pieces of text. In essence, they transform text into vectors in a continuous space, where texts with similar meanings are positioned closer together. Text embeddings are widely used in applications such as natural language processing, information retrieval, and recommendation systems.
In this tutorial, we will demonstrate how to deploy embedding models in GPUStack and generate text embeddings using the deployed models.
"},{"location":"tutorials/creating-text-embeddings/#prerequisites","title":"Prerequisites","text":"Before you begin, ensure that you have the following:
- GPUStack is installed and running. If not, refer to the Quickstart Guide.
- Access to Hugging Face for downloading the model files.
"},{"location":"tutorials/creating-text-embeddings/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":"Follow these steps to deploy the model from Hugging Face:
- Navigate to the
Models
page in the GPUStack UI. - Click the
Deploy Model
button. - In the dropdown, select
Hugging Face
as the source for your model. - Enable the
GGUF
checkbox to filter models by GGUF format. - Use the search bar in the top left to search for the model name
CompendiumLabs/bge-small-en-v1.5-gguf
. - Leave everything as default and click the
Save
button to deploy the model.
After deployment, you can monitor the model's status on the Models
page.
"},{"location":"tutorials/creating-text-embeddings/#step-2-generate-an-api-key","title":"Step 2: Generate an API Key","text":"We will use the GPUStack API to generate text embeddings, and an API key is required:
- Navigate to the
API Keys
page in the GPUStack UI. - Click the
New API Key
button. - Enter a name for the API key and click the
Save
button. - Copy the generated API key. You can only view the API key once, so make sure to save it securely.
"},{"location":"tutorials/creating-text-embeddings/#step-3-generate-text-embeddings","title":"Step 3: Generate Text Embeddings","text":"With the model deployed and an API key, you can generate text embeddings via the GPUStack API. Here is an example script using curl
:
export SERVER_URL=<your-server-url>\nexport GPUSTACK_API_KEY=<your-api-key>\ncurl $SERVER_URL/v1-openai/embeddings \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"input\": \"The food was delicious and the waiter...\",\n \"model\": \"bge-small-en-v1.5\",\n \"encoding_format\": \"float\"\n }'\n
Replace <your-server-url>
with the URL of your GPUStack server and <your-api-key>
with the API key you generated in the previous step.
Example response:
{\n \"data\": [\n {\n \"embedding\": [\n -0.012189436703920364, 0.016934078186750412, 0.003965042531490326,\n -0.03453584015369415, -0.07623119652271271, -0.007116147316992283,\n 0.11278388649225235, 0.019714849069714546, 0.010370955802500248,\n -0.04219457507133484, -0.029902394860982895, 0.01122555136680603,\n 0.022912170737981796, 0.031186765059828758, 0.006303929258137941,\n # ... additional values\n ],\n \"index\": 0,\n \"object\": \"embedding\"\n }\n ],\n \"model\": \"bge-small-en-v1.5\",\n \"object\": \"list\",\n \"usage\": { \"prompt_tokens\": 12, \"total_tokens\": 12 }\n}\n
"},{"location":"tutorials/inference-on-cpus/","title":"Inference on CPUs","text":"GPUStack supports inference on CPUs, offering flexibility when GPU resources are limited or when model sizes exceed available GPU memory. The following CPU inference modes are available:
- CPU+GPU Hybrid Inference: Enables partial acceleration by offloading portions of large models to the CPU when VRAM capacity is insufficient.
- Full CPU Inference: Operates entirely on CPU when no GPU resources are available.
Note
CPU inference is supported when using the llama-box (llama.cpp) backend.
To deploy a model with CPU offloading, enable the Allow CPU Offloading
option in the deployment configuration (this setting is enabled by default).
After deployment, you can view the number of model layers offloaded to the CPU.
"},{"location":"tutorials/inference-with-function-calling/","title":"Inference with Function Calling","text":"Function calling allows you to connect models to external tools and systems. This is useful for many things such as empowering AI assistants with capabilities, or building deep integrations between your applications and the models.
In this tutorial, you\u2019ll learn how to set up and use function calling within GPUStack to extend your AI\u2019s capabilities.
Note
- Function calling is supported in the vLLM inference backend.
- Function calling is essentially achieved through prompt engineering, requiring models to be trained with internalized templates to enable this capability. Therefore, not all LLMs support function calling.
"},{"location":"tutorials/inference-with-function-calling/#prerequisites","title":"Prerequisites","text":"Before proceeding, ensure the following:
- GPUStack is installed and running.
- A Linux worker node with a GPU is available. We'll use Qwen2.5-7B-Instruct as the model for this tutorial. The model requires a GPU with at least 18GB VRAM.
- Access to Hugging Face for downloading the model files.
"},{"location":"tutorials/inference-with-function-calling/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":" - Navigate to the
Models
page in the GPUStack UI and click the Deploy Model
button. In the dropdown, select Hugging Face
as the source for your model. - Use the search bar to find the
Qwen/Qwen2.5-7B-Instruct
model. - Expand the
Advanced
section in configurations and scroll down to the Backend Parameters
section. - Click on the
Add Parameter
button and add the following parameters:
--enable-auto-tool-choice
--tool-call-parser=hermes
- Click the
Save
button to deploy the model.
After deployment, you can monitor the model's status on the Models
page.
"},{"location":"tutorials/inference-with-function-calling/#step-2-generate-an-api-key","title":"Step 2: Generate an API Key","text":"We will use the GPUStack API to interact with the model. To do this, you need to generate an API key:
- Navigate to the
API Keys
page in the GPUStack UI. - Click the
New API Key
button. - Enter a name for the API key and click the
Save
button. - Copy the generated API key for later use.
"},{"location":"tutorials/inference-with-function-calling/#step-3-do-inference","title":"Step 3: Do Inference","text":"With the model deployed and an API key, you can call the model via the GPUStack API. Here is an example script using curl
(replace <your-server-url>
with your GPUStack server URL and <your-api-key>
with the API key generated in the previous step):
export GPUSTACK_SERVER_URL=<your-server-url>\nexport GPUSTACK_API_KEY=<your-api-key>\ncurl $GPUSTACK_SERVER_URL/v1-openai/chat/completions \\\n-H \"Content-Type: application/json\" \\\n-H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n-d '{\n \"model\": \"qwen2.5-7b-instruct\",\n \"messages\": [\n {\n \"role\": \"user\",\n \"content\": \"What'\\''s the weather like in Boston today?\"\n }\n ],\n \"tools\": [\n {\n \"type\": \"function\",\n \"function\": {\n \"name\": \"get_current_weather\",\n \"description\": \"Get the current weather in a given location\",\n \"parameters\": {\n \"type\": \"object\",\n \"properties\": {\n \"location\": {\n \"type\": \"string\",\n \"description\": \"The city and state, e.g. San Francisco, CA\"\n },\n \"unit\": {\n \"type\": \"string\",\n \"enum\": [\"celsius\", \"fahrenheit\"]\n }\n },\n \"required\": [\"location\"]\n }\n }\n }\n ],\n \"tool_choice\": \"auto\"\n}'\n
Example response:
{\n \"model\": \"qwen2.5-7b-instruct\",\n \"choices\": [\n {\n \"index\": 0,\n \"message\": {\n \"role\": \"assistant\",\n \"content\": null,\n \"tool_calls\": [\n {\n \"id\": \"chatcmpl-tool-b99d32848b324eaea4bac5a5830d00b8\",\n \"type\": \"function\",\n \"function\": {\n \"name\": \"get_current_weather\",\n \"arguments\": \"{\\\"location\\\": \\\"Boston, MA\\\", \\\"unit\\\": \\\"fahrenheit\\\"}\"\n }\n }\n ]\n },\n \"finish_reason\": \"tool_calls\"\n }\n ],\n \"usage\": {\n \"prompt_tokens\": 212,\n \"total_tokens\": 242,\n \"completion_tokens\": 30\n }\n}\n
"},{"location":"tutorials/performing-distributed-inference-across-workers/","title":"Performing Distributed Inference Across Workers","text":"This tutorial will guide you through the process of configuring and running distributed inference across multiple workers using GPUStack. Distributed inference allows you to handle larger language models by distributing the computational workload among multiple workers. This is particularly useful when individual workers do not have sufficient resources, such as VRAM, to run the entire model independently.
"},{"location":"tutorials/performing-distributed-inference-across-workers/#prerequisites","title":"Prerequisites","text":"Before proceeding, ensure the following:
- GPUStack is installed and running. Refer to the Setting Up a Multi-node GPUStack Cluster tutorial if needed.
- Access to Hugging Face for downloading the model files.
In this tutorial, we\u2019ll assume a cluster with two nodes, each equipped with an NVIDIA P40 GPU (22GB VRAM), as shown in the following image:
We aim to run a large language model that requires more VRAM than a single worker can provide. For this tutorial, we\u2019ll use the Qwen/Qwen2.5-72B-Instruct
model with the q2_k
quantization format. The required resources for running this model can be estimated using the gguf-parser tool:
$ gguf-parser --hf-repo Qwen/Qwen2.5-72B-Instruct-GGUF --hf-file qwen2.5-72b-instruct-q2_k-00001-of-00007.gguf --ctx-size=8192 --in-short --skip-architecture --skip-metadata --skip-tokenizer\n\n+--------------------------------------------------------------------------------------+\n| ESTIMATE |\n+----------------------------------------------+---------------------------------------+\n| RAM | VRAM 0 |\n+--------------------+------------+------------+----------------+----------+-----------+\n| LAYERS (I + T + O) | UMA | NONUMA | LAYERS (T + O) | UMA | NONUMA |\n+--------------------+------------+------------+----------------+----------+-----------+\n| 1 + 0 + 0 | 243.89 MiB | 393.89 MiB | 80 + 1 | 2.50 GiB | 28.92 GiB |\n+--------------------+------------+------------+----------------+----------+-----------+\n
From the output, we can see that the estimated VRAM requirement for this model exceeds the 22GB VRAM available on each worker node. Thus, we need to distribute the inference across multiple workers to successfully run the model.
"},{"location":"tutorials/performing-distributed-inference-across-workers/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":"Follow these steps to deploy the model from Hugging Face, enabling distributed inference:
- Navigate to the
Models
page in the GPUStack UI. - Click the
Deploy Model
button. - In the dropdown, select
Hugging Face
as the source for your model. - Enable the
GGUF
checkbox to filter models by GGUF format. - Use the search bar in the top left to search for the model name
Qwen/Qwen2.5-72B-Instruct-GGUF
. - In the
Available Files
section, select the q2_k
quantization format. - Expand the
Advanced
section and scroll down. Disable the Allow CPU Offloading
option and verify that the Allow Distributed Inference Across Workers
option is enabled(this is enabled by default). GPUStack will evaluate the available resources in the cluster and run the model in a distributed manner if required. - Click the
Save
button to deploy the model.
"},{"location":"tutorials/performing-distributed-inference-across-workers/#step-2-verify-the-model-deployment","title":"Step 2: Verify the Model Deployment","text":"Once the model is deployed, verify the deployment on the Models
page, where you can view details about how the model is running across multiple workers.
You can also check worker and GPU resource usage by navigating to the Resources
page.
Finally, go to the Playground
page to interact with the model and verify that everything is functioning correctly.
"},{"location":"tutorials/performing-distributed-inference-across-workers/#conclusion","title":"Conclusion","text":"Congratulations! You have successfully configured and run distributed inference across multiple workers using GPUStack.
"},{"location":"tutorials/running-inference-with-ascend-npus/","title":"Running Inference With Ascend NPUs","text":"GPUStack supports running inference on Ascend NPUs. This tutorial will guide you through the configuration steps.
"},{"location":"tutorials/running-inference-with-ascend-npus/#system-and-hardware-support","title":"System and Hardware Support","text":"OS Status Verified Linux Support Ubuntu 20.04 Device Status Verified Ascend 910 Support Ascend 910B"},{"location":"tutorials/running-inference-with-ascend-npus/#setup-steps","title":"Setup Steps","text":""},{"location":"tutorials/running-inference-with-ascend-npus/#install-ascend-packages","title":"Install Ascend packages","text":" - Download Ascend packages
Choose the packages according to your system, hardware and GPUStack is compatible with CANN 8.x from resources download center(links below).
Download the driver and firmware from here.
Package Name Description Ascend-hdk-{chiptype}-npu-driver{version}_linux-{arch}.run Ascend Driver (run format) Ascend-hdk-{chiptype}-npu-firmware{version}.run Ascend (run format) Download the toolkit and kernels from here.
Package Name Description Ascend-cann-toolkit_{version}_linux-{arch}.run CANN Toolkit (run format) Ascend-cann-kernels-{chiptype}{version}_linux-{arch}.run CANN Kernels (run format) - Create the user and group for running
sudo groupadd -g HwHiAiUser\nsudo useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash\nsudo usermod -aG HwHiAiUser $USER\n
- Install driver
sudo chmod +x Ascend-hdk-xxx-npu-driver_x.x.x_linux-{arch}.run\n# Driver installation, default installation path: \"/usr/local/Ascend\"\nsudo sh Ascend-hdk-xxx-npu-driver_x.x.x_linux-{arch}.run --full --install-for-all\n
If you see the following message, the firmware installation is complete:
Driver package installed successfully!\n
- Verify successful driver installation
After the driver successful installation, run the npu-smi info
command to check if the driver was installed correctly.
$npu-smi info\n+------------------------------------------------------------------------------------------------+\n| npu-smi 23.0.1 Version: 23.0.1 |\n+---------------------------+---------------+----------------------------------------------------+\n| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|\n| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |\n+===========================+===============+====================================================+\n| 4 910B3 | OK | 93.6 40 0 / 0 |\n| 0 | 0000:01:00.0 | 0 0 / 0 3161 / 65536 |\n+===========================+===============+====================================================+\n+---------------------------+---------------+----------------------------------------------------+\n| NPU Chip | Process id | Process name | Process memory(MB) |\n+===========================+===============+====================================================+\n| No running processes found in NPU 4 |\n+===========================+===============+====================================================+\n
- Install firmware
sudo chmod +x Ascend-hdk-xxx-npu-firmware_x.x.x.x.X.run\nsudo sh Ascend-hdk-xxx-npu-firmware_x.x.x.x.X.run --full\n
If you see the following message, the firmware installation is complete:
Firmware package installed successfully!\n
- Install toolkit and kernels
As an example for Ubuntu, adapt commands according to your system.
Check for dependencies to ensure Python, GCC, and other required tools are installed.
gcc --version\ng++ --version\nmake --version\ncmake --version\ndpkg -l zlib1g| grep zlib1g| grep ii\ndpkg -l zlib1g-dev| grep zlib1g-dev| grep ii\ndpkg -l libsqlite3-dev| grep libsqlite3-dev| grep ii\ndpkg -l openssl| grep openssl| grep ii\ndpkg -l libssl-dev| grep libssl-dev| grep ii\ndpkg -l libffi-dev| grep libffi-dev| grep ii\ndpkg -l libbz2-dev| grep libbz2-dev| grep ii\ndpkg -l libxslt1-dev| grep libxslt1-dev| grep ii\ndpkg -l unzip| grep unzip| grep ii\ndpkg -l pciutils| grep pciutils| grep ii\ndpkg -l net-tools| grep net-tools| grep ii\ndpkg -l libblas-dev| grep libblas-dev| grep ii\ndpkg -l gfortran| grep gfortran| grep ii\ndpkg -l libblas3| grep libblas3| grep ii\n
If the commands return messages showing missing packages, install them as follows (adjust the command if only specific packages are missing):
sudo apt-get install -y gcc g++ make cmake zlib1g zlib1g-dev openssl libsqlite3-dev libssl-dev libffi-dev libbz2-dev libxslt1-dev unzip pciutils net-tools libblas-dev gfortran libblas3\n
Install Python dependencies:
pip3 install --upgrade pip\npip3 install attrs numpy decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions\n
Install the toolkit and kernels:
chmod +x Ascend-cann-toolkit_{vesion}_linux-{arch}.run\nchmod +x Ascend-cann-kernels-{chip_type}_{version}_linux-{arch}.run\n\nsh Ascend-cann-toolkit_{vesion}_linux-{arch}.run --install\nsh Ascend-cann-kernels-{chip_type}_{version}_linux-{arch}.run --install\n
Once installation completes, you should see a success message like this:
xxx install success\n
- Configure environment variables
echo \"source ~/Ascend/ascend-toolkit/set_env.sh\" >> ~/.bashrc\nsource ~/.bashrc\n
For more details, refer to the Ascend Documentation.
"},{"location":"tutorials/running-inference-with-ascend-npus/#installing-gpustack","title":"Installing GPUStack","text":"Once your environment is ready, you can install GPUStack following the installation guide.
Once installed, you should see that GPUStack successfully recognizes the Ascend Device in the resources page.
"},{"location":"tutorials/running-inference-with-ascend-npus/#running-inference","title":"Running Inference","text":"After installation, you can deploy models and run inference. Refer to the model management for usage details.
The Ascend NPU supports inference through the llama-box (llama.cpp) backend. For supported models, see the llama.cpp Ascend NPU model supports.
"},{"location":"tutorials/running-inference-with-moorethreads-gpus/","title":"Running Inference With Moore Threads GPUs","text":"GPUStack supports running inference on Moore Threads GPUs. This tutorial provides a comprehensive guide to configuring your system for optimal performance.
"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#system-and-hardware-support","title":"System and Hardware Support","text":"OS Architecture Status Verified Linux x86_64 Support Ubuntu 20.04/22.04 Device Status Verified MTT S80 Support Yes MTT S3000 Support Yes MTT S4000 Support Yes"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#prerequisites","title":"Prerequisites","text":"The following instructions are applicable for Ubuntu 20.04/22.04
systems with x86_64
architecture.
"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#configure-the-container-runtime","title":"Configure the Container Runtime","text":"Follow these links to install and configure the container runtime:
- Install Docker: Docker Installation Guide
- Install the latest drivers for MTT S80/S3000/S4000 (currently rc3.1.0): MUSA SDK Download
- Install the MT Container Toolkits (currently v1.9.0): MT CloudNative Toolkits Download
"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#verify-container-runtime-configuration","title":"Verify Container Runtime Configuration","text":"Ensure the output shows the default runtime as mthreads
.
$ (cd /usr/bin/musa && sudo ./docker setup $PWD)\n$ docker info | grep mthreads\n Runtimes: mthreads mthreads-experimental runc\n Default Runtime: mthreads\n
"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#installing-gpustack","title":"Installing GPUStack","text":"To set up an isolated environment for GPUStack, we recommend using Docker.
docker run -d --name gpustack-musa -p 9009:80 --ipc=host -v gpustack-data:/var/lib/gpustack \\\n gpustack/gpustack:main-musa\n
This command will:
- Start a container with the GPUStack image.
- Expose the GPUStack web interface on port
9009
. - Mount the
gpustack-data
volume to store the GPUStack data.
To check the logs of the running container, use the following command:
docker logs -f gpustack-musa\n
If the following message appears, the GPUStack container is running successfully:
2024-11-15T23:37:46+00:00 - gpustack.server.server - INFO - Serving on 0.0.0.0:80.\n2024-11-15T23:37:46+00:00 - gpustack.worker.worker - INFO - Starting GPUStack worker.\n
Once the container is running, access the GPUStack web interface by navigating to http://localhost:9009
in your browser.
After the initial setup for GPUStack, you should see the following screen:
"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#dashboard","title":"Dashboard","text":""},{"location":"tutorials/running-inference-with-moorethreads-gpus/#workers","title":"Workers","text":""},{"location":"tutorials/running-inference-with-moorethreads-gpus/#gpus","title":"GPUs","text":""},{"location":"tutorials/running-inference-with-moorethreads-gpus/#running-inference","title":"Running Inference","text":"After installation, you can deploy models and run inference. Refer to the model management for detailed usage instructions.
Moore Threads GPUs support inference through the llama-box (llama.cpp) backend. Most recent models are supported (e.g., llama3.2:1b, llama3.2-vision:11b, qwen2.5:7b, etc.).
Use mthreads-gmi
to verify if the model is offloaded to the GPU.
root@a414c45864ee:/# mthreads-gmi\nSat Nov 16 12:00:16 2024\n---------------------------------------------------------------\n mthreads-gmi:1.14.0 Driver Version:2.7.0\n---------------------------------------------------------------\nID Name |PCIe |%GPU Mem\n Device Type |Pcie Lane Width |Temp MPC Capable\n | ECC Mode\n+-------------------------------------------------------------+\n0 MTT S80 |00000000:01:00.0 |98% 1339MiB(16384MiB)\n Physical |16x(16x) |56C YES\n | N/A\n---------------------------------------------------------------\n\n---------------------------------------------------------------\nProcesses:\nID PID Process name GPU Memory\n Usage\n+-------------------------------------------------------------+\n0 120 ...ird_party/bin/llama-box/llama-box 2MiB\n0 2022 ...ird_party/bin/llama-box/llama-box 1333MiB\n---------------------------------------------------------------\n
"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/","title":"Running Inference on Copilot+ PCs with Snapdragon X","text":"GPUStack supports running on ARM64 Windows, enabling use on Snapdragon X-based Copilot+ PCs.
Note
Only CPU-based inference is supported on Snapdragon X devices. GPUStack does not currently support GPU or NPU acceleration on this platform.
"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#prerequisites","title":"Prerequisites","text":" - A Copilot+ PC with Snapdragon X. In this tutorial, we use the Dell XPS 13 9345.
- Install AMD64 Python 3.10 or above. See details
"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#installing-gpustack","title":"Installing GPUStack","text":"Run PowerShell as administrator (avoid using PowerShell ISE), then run the following command to install GPUStack:
Invoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n
After installation, follow the on-screen instructions to obtain credentials and log in to the GPUStack UI.
"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#deploying-a-model","title":"Deploying a Model","text":" - Navigate to the
Models
page in the GPUStack UI. - Click on the
Deploy Model
button and select Ollama Library
from the dropdown. - Enter
llama3.2
in the Name
field. - Select
llama3.2
from the Ollama Model
dropdown. - Click
Save
to deploy the model.
Once deployed, you can monitor the model's status on the Models
page.
"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#running-inference","title":"Running Inference","text":"Navigate to the Playground
page in the GPUStack UI, where you can interact with the deployed model.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/","title":"Setting Up a Multi-node GPUStack Cluster","text":"This tutorial will guide you through setting up a multi-node GPUStack cluster, where you can distribute your workloads across multiple GPU-enabled nodes. This guide assumes you have basic knowledge of running commands on Linux, macOS, or Windows systems.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#prerequisites","title":"Prerequisites","text":"Before starting, ensure you have the following:
- Multiple nodes with supported OS and GPUs for GPUStack installation. View supported platforms and supported accelerators for more information.
- Nodes are connected to the same network and can communicate with each other.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-1-install-gpustack-on-the-server-node","title":"Step 1: Install GPUStack on the Server Node","text":"First, you need to install GPUStack on one of the nodes to act as the server node. Follow the instructions below based on your operating system.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#linux-or-macos","title":"Linux or macOS","text":"Run the following command on your server node:
curl -sfL https://get.gpustack.ai | sh -s -\n
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#windows","title":"Windows","text":"Run PowerShell as administrator and execute the following command:
Invoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n
Once GPUStack is installed, you can proceed to configure your cluster by adding worker nodes.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-2-retrieve-the-token-from-the-server-node","title":"Step 2: Retrieve the Token from the Server Node","text":"To add worker nodes to the cluster, you need the token generated by GPUStack on the server node. On the server node, run the following command to get the token:
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#linux-or-macos_1","title":"Linux or macOS","text":"cat /var/lib/gpustack/token\n
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#windows_1","title":"Windows","text":"Get-Content -Path \"$env:APPDATA\\gpustack\\token\" -Raw\n
This token will be required in the next steps to authenticate worker nodes.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-3-add-worker-nodes-to-the-cluster","title":"Step 3: Add Worker Nodes to the Cluster","text":"Now, you will install GPUStack on additional nodes (worker nodes) and connect them to the server node using the token.
Linux or macOS Worker Nodes
Run the following command on each worker node, replacing http://myserver with the URL of your server node and mytoken with the token retrieved in Step 2:
curl -sfL https://get.gpustack.ai | sh -s - --server-url http://myserver --token mytoken\n
Windows Worker Nodes
Run PowerShell as administrator on each worker node and use the following command, replacing http://myserver and mytoken with the server URL and token:
Invoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } --server-url http://myserver --token mytoken\"\n
Once the command is executed, each worker node will connect to the main server and become part of the GPUStack cluster.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-4-verify-the-cluster-setup","title":"Step 4: Verify the Cluster Setup","text":"After adding the worker nodes, you can verify that the cluster is set up correctly by accessing the GPUStack UI.
- Open a browser and navigate to
http://myserver
(replace myserver with the actual server URL). - Log in with the default credentials (username
admin
). To retrieve the default password, run the following command on the server node:
Linux or macOS
cat /var/lib/gpustack/initial_admin_password\n
Windows
Get-Content -Path \"$env:APPDATA\\gpustack\\initial_admin_password\" -Raw\n
- After logging in, navigate to the
Resources
page in the UI to see all connected nodes and their GPUs. You should see your worker nodes listed and ready for serving LLMs.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#conclusion","title":"Conclusion","text":"Congratulations! You've successfully set up a multi-node GPUStack cluster! You can now scale your workloads across multiple nodes, making full use of your available GPUs to handle your tasks efficiently.
"},{"location":"tutorials/using-audio-models/","title":"Using Audio Models","text":"GPUStack supports running both speech-to-text and text-to-speech models. Speech-to-text models convert audio inputs in various languages into written text, while text-to-speech models transform written text into natural and expressive speech.
In this tutorial, we will walk you through deploying and using speech-to-text and text-to-speech models in GPUStack.
"},{"location":"tutorials/using-audio-models/#prerequisites","title":"Prerequisites","text":"Before you begin, ensure that you have the following:
- A Linux system with AMD architecture or macOS.
- Access to Hugging Face for downloading the model files.
- GPUStack is installed and running. If not, refer to the Quickstart Guide.
"},{"location":"tutorials/using-audio-models/#running-speech-to-text-model","title":"Running Speech-to-Text Model","text":""},{"location":"tutorials/using-audio-models/#step-1-deploy-speech-to-text-model","title":"Step 1: Deploy Speech-to-Text Model","text":"Follow these steps to deploy the model from Hugging Face:
- Navigate to the
Models
page in the GPUStack UI. - Click the
Deploy Model
button. - In the dropdown, select
Hugging Face
as the source for your model. - Use the search bar in the top left to search for the model name
Systran/faster-whisper-medium
. - Leave everything as default and click the
Save
button to deploy the model.
After deployment, you can monitor the model's status on the Models
page.
"},{"location":"tutorials/using-audio-models/#step-2-interact-with-speech-to-text-model-models","title":"Step 2: Interact with Speech-to-Text Model Models","text":" - Navigate to the
Playground
> Audio
page in the GPUStack UI. - Select the
Speech to Text
Tab. - Select the deployed model from the top-right dropdown.
- Click the
Upload
button to upload audio file or click the Microphone
button to record audio. - Click the
Generate Text Content
button to generate the text.
"},{"location":"tutorials/using-audio-models/#running-text-to-speech-model","title":"Running Text-to-Speech Model","text":""},{"location":"tutorials/using-audio-models/#step-1-deploy-text-to-speech-model","title":"Step 1: Deploy Text-to-Speech Model","text":"Follow these steps to deploy the model from Hugging Face:
- Navigate to the
Models
page in the GPUStack UI. - Click the
Deploy Model
button. - In the dropdown, select
Hugging Face
as the source for your model. - Use the search bar in the top left to search for the model name
FunAudioLLM/CosyVoice-300M
. - Leave everything as default and click the
Save
button to deploy the model.
After deployment, you can monitor the model's status on the Models
page.
"},{"location":"tutorials/using-audio-models/#step-2-interact-with-text-to-speech-model-models","title":"Step 2: Interact with Text to Speech Model Models","text":" - Navigate to the
Playground
> Audio
page in the GPUStack UI. - Select the
Text to Speech
Tab. - Choose the deployed model from the dropdown menu in the top-right corner. Then, configure the voice and output audio format.
- Input the text to generate.
- Click the
Submit
button to generate the text.
"},{"location":"tutorials/using-image-generation-models/","title":"Using Image Generation Models","text":"GPUStack supports deploying and running state-of-the-art image generation models. These models allow you to generate stunning images from textual descriptions, enabling applications in design, content creation, and more.
In this tutorial, we will walk you through deploying and using image generation models in GPUStack.
"},{"location":"tutorials/using-image-generation-models/#prerequisites","title":"Prerequisites","text":"Before you begin, ensure that you have the following:
- A GPU that has at least 12 GB of VRAM.
- Access to Hugging Face for downloading the model files.
- GPUStack is installed and running. If not, refer to the Quickstart Guide.
"},{"location":"tutorials/using-image-generation-models/#step-1-deploy-the-stable-diffusion-model","title":"Step 1: Deploy the Stable Diffusion Model","text":"Follow these steps to deploy the model from Hugging Face:
- Navigate to the
Models
page in the GPUStack UI. - Click the
Deploy Model
button. - In the dropdown, select
Hugging Face
as the source for your model. - Use the search bar in the top left to search for the model name
gpustack/stable-diffusion-v3-5-medium-GGUF
. - In the
Available Files
section, select the stable-diffusion-v3-5-medium-Q4_0.gguf
file. - Leave everything as default and click the
Save
button to deploy the model.
After deployment, you can monitor the model's status on the Models
page.
"},{"location":"tutorials/using-image-generation-models/#step-2-use-the-model-for-image-generation","title":"Step 2: Use the Model for Image Generation","text":" - Navigate to the
Playground
> Image
page in the GPUStack UI. - Verify that the deployed model is selected from the top-right
Model
dropdown. - Enter a prompt describing the image you want to generate. For example:
a female character with long, flowing hair that appears to be made of ethereal, swirling patterns resembling the Northern Lights or Aurora Borealis. The background is dominated by deep blues and purples, creating a mysterious and dramatic atmosphere. The character's face is serene, with pale skin and striking features. She wears a dark-colored outfit with subtle patterns. The overall style of the artwork is reminiscent of fantasy or supernatural genres.\n
- Select
euler
in the Sampler
dropdown. - Set the
Sample Steps
to 20
. - Click the
Submit
button to create the image.
The generated image will be displayed in the UI. Your image may look different given the seed and randomness involved in the generation process.
"},{"location":"tutorials/using-image-generation-models/#conclusion","title":"Conclusion","text":"Congratulations! You\u2019ve successfully deployed and used an image generation model in GPUStack. With this setup, you can generate unique and visually compelling images from textual prompts. Experiment with different prompts and settings to push the boundaries of what\u2019s possible.
"},{"location":"tutorials/using-reranker-models/","title":"Using Reranker Models","text":"Reranker models are specialized models designed to improve the ranking of a list of items based on relevance to a given query. They are commonly used in information retrieval and search systems to refine initial search results, prioritizing items that are more likely to meet the user\u2019s intent. Reranker models take the initial document list and reorder items to enhance precision in applications such as search engines, recommendation systems, and question-answering tasks.
In this tutorial, we will guide you through deploying and using reranker models in GPUStack.
"},{"location":"tutorials/using-reranker-models/#prerequisites","title":"Prerequisites","text":"Before you begin, ensure that you have the following:
- GPUStack is installed and running. If not, refer to the Quickstart Guide.
- Access to Hugging Face for downloading the model files.
"},{"location":"tutorials/using-reranker-models/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":"Follow these steps to deploy the model from Hugging Face:
- Navigate to the
Models
page in the GPUStack UI. - Click the
Deploy Model
button. - In the dropdown, select
Hugging Face
as the source for your model. - Enable the
GGUF
checkbox to filter models by GGUF format. - Use the search bar in the top left to search for the model name
gpustack/bge-reranker-v2-m3-GGUF
. - Leave everything as default and click the
Save
button to deploy the model.
After deployment, you can monitor the model's status on the Models
page.
"},{"location":"tutorials/using-reranker-models/#step-2-generate-an-api-key","title":"Step 2: Generate an API Key","text":"We will use the GPUStack API to interact with the model. To do this, you need to generate an API key:
- Navigate to the
API Keys
page in the GPUStack UI. - Click the
New API Key
button. - Enter a name for the API key and click the
Save
button. - Copy the generated API key. You can only view the API key once, so make sure to save it securely.
"},{"location":"tutorials/using-reranker-models/#step-3-reranking","title":"Step 3: Reranking","text":"With the model deployed and an API key, you can rerank a list of documents via the GPUStack API. Here is an example script using curl
:
export SERVER_URL=<your-server-url>\nexport GPUSTACK_API_KEY=<your-api-key>\ncurl $SERVER_URL/v1/rerank \\\n -H \"Content-Type: application/json\" \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -d '{\n \"model\": \"bge-reranker-v2-m3\",\n \"query\": \"What is a panda?\",\n \"top_n\": 3,\n \"documents\": [\n \"hi\",\n \"it is a bear\",\n \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n ]\n }' | jq\n
Replace <your-server-url>
with the URL of your GPUStack server and <your-api-key>
with the API key you generated in the previous step.
Example response:
{\n \"model\": \"bge-reranker-v2-m3\",\n \"object\": \"list\",\n \"results\": [\n {\n \"document\": {\n \"text\": \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n },\n \"index\": 2,\n \"relevance_score\": 1.951932668685913\n },\n {\n \"document\": {\n \"text\": \"it is a bear\"\n },\n \"index\": 1,\n \"relevance_score\": -3.7347371578216553\n },\n {\n \"document\": {\n \"text\": \"hi\"\n },\n \"index\": 0,\n \"relevance_score\": -6.157620906829834\n }\n ],\n \"usage\": {\n \"prompt_tokens\": 69,\n \"total_tokens\": 69\n }\n}\n
"},{"location":"tutorials/using-vision-language-models/","title":"Using Vision Language Models","text":"Vision Language Models can process both visual (image) and language (text) data simultaneously, making them versatile tools for various applications, such as image captioning, visual question answering, and more. In this tutorial, you will learn how to deploy and interact with Vision Language Models (VLMs) in GPUStack.
The procedure for deploying and interacting with these models in GPUStack is similar. The main difference is the parameters you need to set when deploying the models. For more information on the parameters you can set, please refer to Backend Parameters .
In this tutorial, we will cover the deployment of the following models:
- Llama3.2-Vision
- Qwen2-VL
- Pixtral
- Phi3.5-Vision
"},{"location":"tutorials/using-vision-language-models/#prerequisites","title":"Prerequisites","text":"Before you begin, ensure that you have the following:
- A Linux machine with one or more GPUs that has at least 30 GB of VRAM in total. We will use the vLLM backend which only supports Linux.
- Access to Hugging Face and a Hugging Face API key for downloading the model files.
- You have been granted access to the above models on Hugging Face. Llama3.2-VL and Pixtral are gated models, and you need to request access to them.
Note
An Ubuntu node equipped with one H100 (80GB) GPU is used throughout this tutorial.
"},{"location":"tutorials/using-vision-language-models/#step-1-install-gpustack","title":"Step 1: Install GPUStack","text":"Run the following command to install GPUStack:
curl -sfL https://get.gpustack.ai | sh -s - --huggingface-token <Hugging Face API Key>\n
Replace <Hugging Face API Key>
with your Hugging Face API key. GPUStack will use this key to download the model files.
"},{"location":"tutorials/using-vision-language-models/#step-2-log-in-to-gpustack-ui","title":"Step 2: Log in to GPUStack UI","text":"Run the following command to get the default password:
cat /var/lib/gpustack/initial_admin_password\n
Open your browser and navigate to http://<your-server-ip>
. Replace <your-server-ip>
with the IP address of your server. Log in using the username admin
and the password you obtained in the previous step.
"},{"location":"tutorials/using-vision-language-models/#step-3-deploy-vision-language-models","title":"Step 3: Deploy Vision Language Models","text":""},{"location":"tutorials/using-vision-language-models/#deploy-llama32-vision","title":"Deploy Llama3.2-Vision","text":" - Navigate to the
Models
page in the GPUStack UI. - Click on the
Deploy Model
button, then select Hugging Face
in the dropdown. - Search for
meta-llama/Llama-3.2-11B-Vision-Instruct
in the search bar. - Expand the
Advanced
section in configurations and scroll down to the Backend Parameters
section. - Click on the
Add Parameter
button multiple times and add the following parameters:
--enforce-eager
--max-num-seqs=16
--max-model-len=8192
- Click the
Save
button.
"},{"location":"tutorials/using-vision-language-models/#deploy-qwen2-vl","title":"Deploy Qwen2-VL","text":" - Navigate to the
Models
page in the GPUStack UI. - Click on the
Deploy Model
button, then select Hugging Face
in the dropdown. - Search for
Qwen/Qwen2-VL-7B-Instruct
in the search bar. - Click the
Save
button. The default configurations should work as long as you have enough GPU resources.
"},{"location":"tutorials/using-vision-language-models/#deploy-pixtral","title":"Deploy Pixtral","text":" - Navigate to the
Models
page in the GPUStack UI. - Click on the
Deploy Model
button, then select Hugging Face
in the dropdown. - Search for
mistralai/Pixtral-12B-2409
in the search bar. - Expand the
Advanced
section in configurations and scroll down to the Backend Parameters
section. - Click on the
Add Parameter
button multiple times and add the following parameters:
--tokenizer-mode=mistral
--limit-mm-per-prompt=image=4
- Click the
Save
button.
"},{"location":"tutorials/using-vision-language-models/#deploy-phi35-vision","title":"Deploy Phi3.5-Vision","text":" - Navigate to the
Models
page in the GPUStack UI. - Click on the
Deploy Model
button, then select Hugging Face
in the dropdown. - Search for
microsoft/Phi-3.5-vision-instruct
in the search bar. - Expand the
Advanced
section in configurations and scroll down to the Backend Parameters
section. - Click on the
Add Parameter
button and add the following parameter:
--trust-remote-code
- Click the
Save
button.
"},{"location":"tutorials/using-vision-language-models/#step-4-interact-with-vision-language-models","title":"Step 4: Interact with Vision Language Models","text":" - Navigate to the
Playground
page in the GPUStack UI. - Select the deployed model from the top-right dropdown.
- Click on the
Upload Image
button above the input text area and upload an image. - Enter a prompt in the input text area. For example, \"Describe the image.\"
- Click the
Submit
button to generate the output.
"},{"location":"tutorials/using-vision-language-models/#conclusion","title":"Conclusion","text":"In this tutorial, you learned how to deploy and interact with Vision Language Models in GPUStack. You can use the same approach to deploy other Vision Language Models not covered in this tutorial. If you have any questions or need further assistance, feel free to reach out to us.
"},{"location":"user-guide/api-key-management/","title":"API Key Management","text":"GPUStack supports authentication using API keys. Each GPUStack user can generate and manage their own API keys.
"},{"location":"user-guide/api-key-management/#create-api-key","title":"Create API Key","text":" - Navigate to the
API Keys
page. - Click the
New API Key
button. - Fill in the
Name
, Description
, and select the Expiration
of the API key. - Click the
Save
button. - Copy and store the key somewhere safe, then click the
Done
button.
Note
Please note that you can only see the generated API key once upon creation.
"},{"location":"user-guide/api-key-management/#delete-api-key","title":"Delete API Key","text":" - Navigate to the
API Keys
page. - Find the API key you want to delete.
- Click the
Delete
button in the Operations
column. - Confirm the deletion.
"},{"location":"user-guide/api-key-management/#use-api-key","title":"Use API Key","text":"GPUStack supports using the API key as a bearer token. The following is an example using curl:
export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/chat/completions \\\n -H \"Content-Type: application/json\" \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -d '{\n \"model\": \"llama3\",\n \"messages\": [\n {\n \"role\": \"system\",\n \"content\": \"You are a helpful assistant.\"\n },\n {\n \"role\": \"user\",\n \"content\": \"Hello!\"\n }\n ],\n \"stream\": true\n }'\n
"},{"location":"user-guide/image-generation-apis/","title":"Image Generation APIs","text":"GPUStack provides APIs for generating images given a prompt and/or an input image when running diffusion models.
Note
The image generation APIs are only available when using the llama-box inference backend.
"},{"location":"user-guide/image-generation-apis/#supported-models","title":"Supported Models","text":"The following models are available for image generation:
Tip
Please use the converted GGUF models provided by GPUStack. Check the model link for more details.
- stabilityai/stable-diffusion-3.5-large-turbo
- stabilityai/stable-diffusion-3.5-large
- stabilityai/stable-diffusion-3.5-medium
- stabilityai/stable-diffusion-3-medium
- TencentARC/FLUX.1-mini
- Freepik/FLUX.1-lite
- black-forest-labs/FLUX.1-dev
- black-forest-labs/FLUX.1-schnell
- stabilityai/sdxl-turbo
- stabilityai/stable-diffusion-xl-refiner-1.0
- stabilityai/stable-diffusion-xl-base-1.0
- stabilityai/sd-turbo
- stabilityai/stable-diffusion-2-1
- stable-diffusion-v1-5/stable-diffusion-v1-5
- CompVis/stable-diffusion-v1-4
"},{"location":"user-guide/image-generation-apis/#api-details","title":"API Details","text":"The image generation APIs adhere to OpenAI API specification. While OpenAI APIs for image generation are simple and opinionated, GPUStack extends these capabilities with additional features.
"},{"location":"user-guide/image-generation-apis/#create-image","title":"Create Image","text":""},{"location":"user-guide/image-generation-apis/#streaming","title":"Streaming","text":"This image generation API supports streaming responses to return the progressing of the generation. To enable streaming, set the stream
parameter to true
in the request body. Example:
REQUEST : (application/json)\n{\n \"n\": 1,\n \"response_format\": \"b64_json\",\n \"size\": \"512x512\",\n \"prompt\": \"A lovely cat\",\n \"quality\": \"standard\",\n \"stream\": true,\n \"stream_options\": {\n \"include_usage\": true, // return usage information\n }\n}\n\nRESPONSE : (text/event-stream)\ndata: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\ndata: [DONE]\n
"},{"location":"user-guide/image-generation-apis/#advanced-options","title":"Advanced Options","text":"This image generation API supports additional options to control the generation process. The following options are available:
REQUEST : (application/json)\n{\n \"n\": 1,\n \"response_format\": \"b64_json\",\n \"size\": \"512x512\",\n \"prompt\": \"A lovely cat\",\n \"sampler\": \"euler\", // required, select from euler_a;euler;heun;dpm2;dpm++2s_a;dpm++2m;dpm++2mv2;ipndm;ipndm_v;lcm\n \"schedule\": \"default\", // optional, select from default;discrete;karras;exponential;ays;gits\n \"seed\": null, // optional, random seed\n \"cfg_scale\": 4.5, // optional, for sampler, the scale of classifier-free guidance in the output phase\n \"sample_steps\": 20, // optional, number of sample steps\n \"negative_prompt\": \"\", // optional, negative prompt\n \"stream\": true,\n \"stream_options\": {\n \"include_usage\": true, // return usage information\n }\n}\n\nRESPONSE : (text/event-stream)\ndata: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\ndata: [DONE]\n
"},{"location":"user-guide/image-generation-apis/#create-image-edit","title":"Create Image Edit","text":""},{"location":"user-guide/image-generation-apis/#streaming_1","title":"Streaming","text":"This image generation API supports streaming responses to return the progressing of the generation. To enable streaming, set the stream
parameter to true
in the request body. Example:
REQUEST: (multipart/form-data)\nn=1\nresponse_format=b64_json\nsize=512x512\nprompt=\"A lovely cat\"\nquality=standard\nimage=... // required\nmask=... // optional\nstream=true\nstream_options_include_usage=true // return usage information\n\nRESPONSE : (text/event-stream)\nCASE 1: correct input image\n data: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n ...\n data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n ...\n data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\n data: [DONE]\nCASE 2: illegal input image\n error: {\"code\": 400, \"message\": \"Invalid image\", \"type\": \"invalid_request_error\"}\n
"},{"location":"user-guide/image-generation-apis/#advanced-options_1","title":"Advanced Options","text":"This image generation API supports additional options to control the generation process. The following options are available:
REQUEST: (multipart/form-data)\nn=1\nresponse_format=b64_json\nsize=512x512\nprompt=\"A lovely cat\"\nimage=... // required\nmask=... // optional\nsampler=euler // required, select from euler_a;euler;heun;dpm2;dpm++2s_a;dpm++2m;dpm++2mv2;ipndm;ipndm_v;lcm\nschedule=default // optional, select from default;discrete;karras;exponential;ays;gits\nseed=null // optional, random seed\ncfg_scale=4.5 // optional, for sampler, the scale of classifier-free guidance in the output phase\nsample_steps=20 // optional, number of sample steps\nnegative_prompt=\"\" // optional, negative prompt\nstream=true\nstream_options_include_usage=true // return usage information\n\nRESPONSE : (text/event-stream)\nCASE 1: correct input image\n data: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n ...\n data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n ...\n data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\n data: [DONE]\nCASE 2: illegal input image\n error: {\"code\": 400, \"message\": \"Invalid image\", \"type\": \"invalid_request_error\"}\n
"},{"location":"user-guide/image-generation-apis/#usage","title":"Usage","text":"The followings are examples using the image generation APIs:
"},{"location":"user-guide/image-generation-apis/#curl-create-image","title":"curl (Create Image)","text":"export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/image/generate \\\n -H \"Content-Type: application/json\" \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -d '{\n \"n\": 1,\n \"response_format\": \"b64_json\",\n \"size\": \"512x512\",\n \"prompt\": \"A lovely cat\",\n \"quality\": \"standard\",\n \"stream\": true,\n \"stream_options\": {\n \"include_usage\": true\n }\n }'\n
"},{"location":"user-guide/image-generation-apis/#curl-create-image-edit","title":"curl (Create Image Edit)","text":"export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/image/edit \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -F image=\"@otter.png\" \\\n -F mask=\"@mask.png\" \\\n -F prompt=\"A lovely cat\" \\\n -F n=1 \\\n -F size=\"512x512\"\n
"},{"location":"user-guide/inference-backends/","title":"Inference Backends","text":"GPUStack supports the following inference backends:
- llama-box
- vLLM
- vox-box
When users deploy a model, the backend is selected automatically based on the following criteria:
- If the model is a GGUF model,
llama-box
is used. - If the model is a known
text-to-speech
or speech-to-text
model, vox-box
is used. - Otherwise,
vLLM
is used.
"},{"location":"user-guide/inference-backends/#llama-box","title":"llama-box","text":"llama-box is a LM inference server based on llama.cpp and stable-diffusion.cpp.
"},{"location":"user-guide/inference-backends/#supported-platforms","title":"Supported Platforms","text":"The llama-box backend supports Linux, macOS and Windows (with CPU offloading only on Windows ARM architecture) platforms.
"},{"location":"user-guide/inference-backends/#supported-models","title":"Supported Models","text":" - LLMs: For supported LLMs, refer to the llama.cpp README.
- Difussion Models: Supported models are listed in this Hugging Face collection.
- Reranker Models: Supported models can be found in this Hugging Face collection.
"},{"location":"user-guide/inference-backends/#supported-features","title":"Supported Features","text":""},{"location":"user-guide/inference-backends/#allow-cpu-offloading","title":"Allow CPU Offloading","text":"After enabling CPU offloading, GPUStack prioritizes loading as many layers as possible onto the GPU to optimize performance. If GPU resources are limited, some layers will be offloaded to the CPU, with full CPU inference used only when no GPU is available.
"},{"location":"user-guide/inference-backends/#allow-distributed-inference-across-workers","title":"Allow Distributed Inference Across Workers","text":"Enable distributed inference across multiple workers. The primary Model Instance will communicate with backend instances on one or more others workers, offloading computation tasks to them.
"},{"location":"user-guide/inference-backends/#parameters-reference","title":"Parameters Reference","text":"See the full list of supported parameters for llama-box here.
"},{"location":"user-guide/inference-backends/#vllm","title":"vLLM","text":"vLLM is a high-throughput and memory-efficient LLMs inference engine. It is a popular choice for running LLMs in production. vLLM seamlessly supports most state-of-the-art open-source models, including: Transformer-like LLMs (e.g., Llama), Mixture-of-Expert LLMs (e.g., Mixtral), Embedding Models (e.g. E5-Mistral), Multi-modal LLMs (e.g., LLaVA)
By default, GPUStack estimates the VRAM requirement for the model instance based on the model's metadata. You can customize the parameters to fit your needs. The following vLLM parameters might be useful:
--gpu-memory-utilization
(default: 0.9): The fraction of GPU memory to use for the model instance. --max-model-len
: Model context length. For large-context models, GPUStack automatically sets this parameter to 8192
to simplify model deployment, especially in resource constrained environments. You can customize this parameter to fit your needs. --tensor-parallel-size
: Number of tensor parallel replicas. By default, GPUStack sets this parameter given the GPU resources available and the estimation of the model's memory requirement. You can customize this parameter to fit your needs.
For more details, please refer to vLLM documentation.
"},{"location":"user-guide/inference-backends/#supported-platforms_1","title":"Supported Platforms","text":"The vLLM backend works on AMD Linux.
Note
- When users install GPUStack on amd64 Linux using the installation script, vLLM is automatically installed.
- When users deploy a model using the vLLM backend, GPUStack sets worker label selectors to
{\"os\": \"linux\", \"arch\": \"amd64\"}
by default to ensure the model instance is scheduled to proper workers. You can customize the worker label selectors in the model configuration.
"},{"location":"user-guide/inference-backends/#supported-models_1","title":"Supported Models","text":"Please refer to the vLLM documentation for supported models.
"},{"location":"user-guide/inference-backends/#supported-features_1","title":"Supported Features","text":""},{"location":"user-guide/inference-backends/#multimodal-language-models","title":"Multimodal Language Models","text":"vLLM supports multimodal language models listed here. When users deploy a vision language model using the vLLM backend, image inputs are supported in the chat completion API.
"},{"location":"user-guide/inference-backends/#parameters-reference_1","title":"Parameters Reference","text":"See the full list of supported parameters for vLLM here.
"},{"location":"user-guide/inference-backends/#vox-box","title":"vox-box","text":"vox-box is an inference engine designed for deploying text-to-speech and speech-to-text models. It also provides an API that is fully compatible with the OpenAI audio API.
"},{"location":"user-guide/inference-backends/#supported-platforms_2","title":"Supported Platforms","text":"The vox-box backend supports Linux, macOS and Windows platforms.
Note
- To use Nvidia GPUs, ensure the following NVIDIA libraries are installed on workers:
- cuBLAS for CUDA 12
- cuDNN 9 for CUDA 12
- When users install GPUStack on Linux, macOS and Windows using the installation script, vox-box is automatically installed.
- CosyVoice models are natively supported on Linux AMD architecture and macOS. However, these models are not supported on Linux ARM or Windows architectures.
"},{"location":"user-guide/inference-backends/#supported-models_2","title":"Supported Models","text":"Model Type Link Supported Platforms Faster-whisper-large-v3 speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-large-v2 speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-large-v1 speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-medium speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-medium.en speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-small speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-small.en speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-distil-whisper-large-v3 speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-distil-whisper-large-v2 speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-distil-whisper-medium.en speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-tiny speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-tiny.en speech-to-text Hugging Face, ModelScope Linux, macOS, Windows CosyVoice-300M-Instruct text-to-speech Hugging Face, ModelScope Linux(ARM not supported), macOS, Windows(Not supported) CosyVoice-300M-SFT text-to-speech Hugging Face, ModelScope Linux(ARM not supported), macOS, Windows(Not supported) CosyVoice-300M text-to-speech Hugging Face, ModelScope Linux(ARM not supported), macOS, Windows(Not supported) CosyVoice-300M-25Hz text-to-speech ModelScope Linux(ARM not supported), macOS, Windows(Not supported)"},{"location":"user-guide/inference-backends/#supported-features_2","title":"Supported Features","text":""},{"location":"user-guide/inference-backends/#allow-gpucpu-offloading","title":"Allow GPU/CPU Offloading","text":"vox-box supports deploying models to NVIDIA GPUs. If GPU resources are insufficient, it will automatically deploy the models to the CPU.
"},{"location":"user-guide/model-management/","title":"Model Management","text":"You can manage large language models in GPUStack by navigating to the Models
page. A model in GPUStack contains one or multiple replicas of model instances. On deployment, GPUStack automatically computes resource requirements for the model instances from model metadata and schedules them to available workers accordingly.
"},{"location":"user-guide/model-management/#deploy-model","title":"Deploy Model","text":"Currently, models from Hugging Face, ModelScope, Ollama and local paths are supported.
"},{"location":"user-guide/model-management/#deploying-a-hugging-face-model","title":"Deploying a Hugging Face Model","text":" -
Click the Deploy Model
button, then select Hugging Face
in the dropdown.
-
Search the model by name from Hugging Face using the search bar in the top left. For example, microsoft/Phi-3-mini-4k-instruct-gguf
. If you only want to search for GGUF models, check the \"GGUF\" checkbox.
-
Select a file with the desired quantization format from Available Files
.
-
Adjust the Name
and Replicas
as needed.
-
Expand the Advanced
section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.
-
Click the Save
button.
"},{"location":"user-guide/model-management/#deploying-a-modelscope-model","title":"Deploying a ModelScope Model","text":" -
Click the Deploy Model
button, then select ModelScope
in the dropdown.
-
Search the model by name from ModelScope using the search bar in the top left. For example, Qwen/Qwen2-0.5B-Instruct
. If you only want to search for GGUF models, check the \"GGUF\" checkbox.
-
Select a file with the desired quantization format from Available Files
.
-
Adjust the Name
and Replicas
as needed.
-
Expand the Advanced
section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.
-
Click the Save
button.
"},{"location":"user-guide/model-management/#deploying-an-ollama-model","title":"Deploying an Ollama Model","text":" -
Click the Deploy Model
button, then select Ollama Library
in the dropdown.
-
Fill in the Name
of the model.
-
Select an Ollama Model
from the dropdown list, or input any Ollama model you need. For example, llama3
, llama3:70b
or youraccount/llama3:70b
.
-
Adjust the Replicas
as needed.
-
Expand the Advanced
section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.
-
Click the Save
button.
"},{"location":"user-guide/model-management/#deploying-a-local-path-model","title":"Deploying a Local Path Model","text":"You can deploy a model from a local path. The model path can be a directory (e.g., a downloaded Hugging Face model directory) or a file (e.g., a GGUF model file) located on workers. This is useful when running in an air-gapped environment.
Note
- GPUStack does not check the validity of the model path for scheduling, which may lead to deployment failure if the model path is inaccessible. It is recommended to ensure the model path is accessible on all workers(e.g., using NFS, rsync, etc.). You can also use the worker selector configuration to deploy the model to specific workers.
- GPUStack cannot evaluate the model's resource requirements unless the server has access to the same model path. Consequently, you may observe empty VRAM/RAM allocations for a deployed model. To mitigate this, it is recommended to make the model files available on the same path on the server. Alternatively, you can customize backend parameters, such as
tensor-split
, to configure how the model is distributed across the GPUs.
To deploy a local path model:
-
Click the Deploy Model
button, then select Local Path
in the dropdown.
-
Fill in the Name
of the model.
-
Fill in the Model Path
.
-
Adjust the Replicas
as needed.
-
Expand the Advanced
section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.
-
Click the Save
button.
"},{"location":"user-guide/model-management/#edit-model","title":"Edit Model","text":" - Find the model you want to edit on the model list page.
- Click the
Edit
button in the Operations
column. - Update the attributes as needed. For example, change the
Replicas
to scale up or down. - Click the
Save
button.
Note
After editing the model, the configuration will not be applied to existing model instances. You need to delete the existing model instances. GPUStack will recreate new instances based on the updated model configuration.
"},{"location":"user-guide/model-management/#delete-model","title":"Delete Model","text":" - Find the model you want to delete on the model list page.
- Click the ellipsis button in the
Operations
column, then select Delete
. - Confirm the deletion.
"},{"location":"user-guide/model-management/#view-model-instance","title":"View Model Instance","text":" - Find the model you want to check on the model list page.
- Click the
>
symbol to view the instance list of the model.
"},{"location":"user-guide/model-management/#delete-model-instance","title":"Delete Model Instance","text":" - Find the model you want to check on the model list page.
- Click the
>
symbol to view the instance list of the model. - Find the model instance you want to delete.
- Click the ellipsis button for the model instance in the
Operations
column, then select Delete
. - Confirm the deletion.
Note
After a model instance is deleted, GPUStack will recreate a new instance to satisfy the expected replicas of the model if necessary.
"},{"location":"user-guide/model-management/#view-model-instance-logs","title":"View Model Instance Logs","text":" - Find the model you want to check on the model list page.
- Click the
>
symbol to view the instance list of the model. - Find the model instance you want to check.
- Click the
View Logs
button for the model instance in the Operations
column.
"},{"location":"user-guide/model-management/#use-self-hosted-ollama-models","title":"Use Self-hosted Ollama Models","text":"You can deploy self-hosted Ollama models by configuring the --ollama-library-base-url
option in the GPUStack server. The Ollama Library
URL should point to the base URL of the Ollama model registry. For example, https://registry.mycompany.com
.
Here is an example workflow to set up a registry, publish a model, and use it in GPUStack:
# Run a self-hosted OCI registry\ndocker run -d -p 5001:5000 --name registry registry:2\n\n# Push a model to the registry using Ollama\nollama pull llama3\nollama cp llama3 localhost:5001/library/llama3\nollama push localhost:5001/library/llama3 --insecure\n\n# Start GPUStack server with the custom Ollama library URL\ncurl -sfL https://get.gpustack.ai | sh -s - --ollama-library-base-url http://localhost:5001\n
That's it! You can now deploy the model llama3
from Ollama Library
source in GPUStack as usual, but the model will now be fetched from the self-hosted registry.
"},{"location":"user-guide/model-management/#advanced-model-configuration","title":"Advanced Model Configuration","text":"GPUStack supports tailored configurations for model deployment.
"},{"location":"user-guide/model-management/#schedule-type","title":"Schedule Type","text":""},{"location":"user-guide/model-management/#auto","title":"Auto","text":"GPUStack automatically schedules model instances to appropriate GPUs/Workers based on current resource availability.
-
Placement Strategy
-
Spread: Make the resources of the entire cluster relatively evenly distributed among all workers. It may produce more resource fragmentation on a single worker.
-
Binpack: Prioritize the overall utilization of cluster resources, reducing resource fragmentation on Workers/GPUs.
-
Worker Selector
When configured, the scheduler will deploy the model instance to the worker containing specified labels.
-
Navigate to the Resources
page and edit the desired worker. Assign custom labels to the worker by adding them in the labels section.
-
Go to the Models
page and click on the Deploy Model
button. Expand the Advanced
section and input the previously assigned worker labels in the Worker Selector
configuration. During deployment, the Model Instance will be allocated to the corresponding worker based on these labels.
"},{"location":"user-guide/model-management/#manual","title":"Manual","text":"This schedule type allows users to specify which GPU to deploy the model instance on.
- GPU Selector
Select a GPU from the list. The model instance will attempt to deploy to this GPU if resources permit.
"},{"location":"user-guide/model-management/#backend","title":"Backend","text":"The inference backend. Currently, GPUStack supports three backends: llama-box, vLLM and vox-box. GPUStack automatically selects the backend based on the model's configuration.
For more details, please refer to the Inference Backends section.
"},{"location":"user-guide/model-management/#backend-version","title":"Backend Version","text":"Specify a backend version, such as v1.0.0
. The version format and availability depend on the selected backend. This option is useful for ensuring compatibility or taking advantage of features introduced in specific backend versions. Refer to the Pinned Backend Versions section for more information.
"},{"location":"user-guide/model-management/#backend-parameters","title":"Backend Parameters","text":"Input the parameters for the backend you want to customize when running the model. The parameter should be in the format --parameter=value
, --bool-parameter
or as separate fields for --parameter
and value
. For example, use --ctx-size=8192
for llama-box.
For full list of supported parameters, please refer to the Inference Backends section.
"},{"location":"user-guide/model-management/#allow-cpu-offloading","title":"Allow CPU Offloading","text":"Note
Available for llama-box backend only.
After enabling CPU offloading, GPUStack prioritizes loading as many layers as possible onto the GPU to optimize performance. If GPU resources are limited, some layers will be offloaded to the CPU, with full CPU inference used only when no GPU is available.
"},{"location":"user-guide/model-management/#allow-distributed-inference-across-workers","title":"Allow Distributed Inference Across Workers","text":"Note
Available for llama-box backend only.
Enable distributed inference across multiple workers. The primary Model Instance will communicate with backend instances on one or more other workers, offloading computation tasks to them.
"},{"location":"user-guide/openai-compatible-apis/","title":"OpenAI Compatible APIs","text":"GPUStack serves OpenAI-compatible APIs using the /v1-openai
path. Most of the APIs also work under the /v1
path as an alias, except for the models
endpoint, which is reserved for GPUStack management APIs.
"},{"location":"user-guide/openai-compatible-apis/#supported-endpoints","title":"Supported Endpoints","text":"The following API endpoints are supported:
- List Models
- Create Completion
- Create Chat Completion
- Create Embeddings
- Create Image
- Create Image Edit
- Create Speech
- Create Transcription
"},{"location":"user-guide/openai-compatible-apis/#usage","title":"Usage","text":"The following are examples using the APIs in different languages:
"},{"location":"user-guide/openai-compatible-apis/#curl","title":"curl","text":"export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/chat/completions \\\n -H \"Content-Type: application/json\" \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -d '{\n \"model\": \"llama3\",\n \"messages\": [\n {\n \"role\": \"system\",\n \"content\": \"You are a helpful assistant.\"\n },\n {\n \"role\": \"user\",\n \"content\": \"Hello!\"\n }\n ],\n \"stream\": true\n }'\n
"},{"location":"user-guide/openai-compatible-apis/#openai-python-api-library","title":"OpenAI Python API library","text":"from openai import OpenAI\n\nclient = OpenAI(base_url=\"http://myserver/v1-openai\", api_key=\"myapikey\")\n\ncompletion = client.chat.completions.create(\n model=\"llama3\",\n messages=[\n {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n {\"role\": \"user\", \"content\": \"Hello!\"}\n ]\n)\n\nprint(completion.choices[0].message)\n
"},{"location":"user-guide/openai-compatible-apis/#openai-node-api-library","title":"OpenAI Node API library","text":"const OpenAI = require(\"openai\");\n\nconst openai = new OpenAI({\n apiKey: \"myapikey\",\n baseURL: \"http://myserver/v1-openai\",\n});\n\nasync function main() {\n const params = {\n model: \"llama3\",\n messages: [\n {\n role: \"system\",\n content: \"You are a helpful assistant.\",\n },\n {\n role: \"user\",\n content: \"Hello!\",\n },\n ],\n };\n const chatCompletion = await openai.chat.completions.create(params);\n console.log(chatCompletion.choices[0].message);\n}\nmain();\n
"},{"location":"user-guide/pinned-backend-versions/","title":"Pinned Backend Versions","text":"Inference engines in the generative AI domain are evolving rapidly to enhance performance and unlock new capabilities. This constant evolution provides exciting opportunities but also presents challenges for maintaining model compatibility and deployment stability.
GPUStack allows you to pin inference backend versions to specific releases, offering a balance between staying up-to-date with the latest advancements and ensuring a reliable runtime environment. This feature is particularly beneficial in the following scenarios:
- Leveraging the newest backend features without waiting for a GPUStack update.
- Locking in a specific backend version to maintain compatibility with existing models.
- Assigning different backend versions to models with varying requirements.
By pinning backend versions, you gain full control over your inference environment, enabling both flexibility and predictability in deployment.
"},{"location":"user-guide/pinned-backend-versions/#automatic-installation-of-pinned-backend-versions","title":"Automatic Installation of Pinned Backend Versions","text":"To simplify deployment, GPUStack supports the automatic installation of pinned backend versions when feasible. The process depends on the type of backend:
- Prebuilt Binaries For backends like
llama-box
, GPUStack downloads the specified version using the same mechanism as in GPUStack bootstrapping.
Tip
You can customize the download source using the --tools-download-base-url
configuration option.
- Python-based Backends For backends like
vLLM
and vox-box
, GPUStack uses pipx
to install the specified version in an isolated Python environment.
Tip
- Ensure that
pipx
is installed on the worker nodes. - If
pipx
is not in the system PATH, specify its location with the --pipx-path
configuration option.
This automation reduces manual intervention, allowing you to focus on deploying and using your models.
"},{"location":"user-guide/pinned-backend-versions/#manual-installation-of-pinned-backend-versions","title":"Manual Installation of Pinned Backend Versions","text":"When automatic installation is not feasible or preferred, GPUStack provides a straightforward way to manually install specific versions of inference backends. Follow these steps:
- Prepare the Executable Install the backend executable or link it under the GPUStack bin directory. The default locations are:
- Linux/macOS:
/var/lib/gpustack/bin
- Windows:
$env:AppData\\gpustack\\bin
Tip
You can customize the bin directory using the --bin-dir
configuration option.
- Name the Executable Ensure the executable is named in the following format:
- Linux/macOS:
<backend>_<version>
- Windows:
<backend>_<version>.exe
For example, the vLLM executable for version v0.6.4 should be named vllm_v0.6.4
on Linux.
By following these steps, you can maintain full control over the backend installation process, ensuring that the correct version is used for your deployment.
"},{"location":"user-guide/rerank-api/","title":"Rerank API","text":"In the context of Retrieval-Augmented Generation (RAG), reranking refers to the process of selecting the most relevant information from retrieved documents or knowledge sources before presenting them to the user or utilizing them for answer generation.
GPUStack serves Jina compatible Rerank API using the /v1/rerank
path.
Note
The Rerank API is only available when using the llama-box inference backend.
"},{"location":"user-guide/rerank-api/#supported-models","title":"Supported Models","text":"The following models are available for reranking:
- bce-reranker-base_v1
- jina-reranker-v1-turbo-en
- jina-reranker-v1-tiny-en
- bge-reranker-v2-m3
- gte-multilingual-reranker-base \ud83e\uddea
- jina-reranker-v2-base-multilingual \ud83e\uddea
"},{"location":"user-guide/rerank-api/#usage","title":"Usage","text":"The following is an example using the Rerank API:
export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1/rerank \\\n -H \"Content-Type: application/json\" \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -d '{\n \"model\": \"bge-reranker-v2-m3\",\n \"query\": \"What is a panda?\",\n \"top_n\": 3,\n \"documents\": [\n \"hi\",\n \"it is a bear\",\n \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n ]\n }' | jq\n
Example output:
{\n \"model\": \"bge-reranker-v2-m3\",\n \"object\": \"list\",\n \"results\": [\n {\n \"document\": {\n \"text\": \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n },\n \"index\": 2,\n \"relevance_score\": 1.951932668685913\n },\n {\n \"document\": {\n \"text\": \"it is a bear\"\n },\n \"index\": 1,\n \"relevance_score\": -3.7347371578216553\n },\n {\n \"document\": {\n \"text\": \"hi\"\n },\n \"index\": 0,\n \"relevance_score\": -6.157620906829834\n }\n ],\n \"usage\": {\n \"prompt_tokens\": 69,\n \"total_tokens\": 69\n }\n}\n
"},{"location":"user-guide/user-management/","title":"User Management","text":"GPUStack supports users of two roles: Admin
and User
. Admins can monitor system status, manage models, users, and system settings. Users can manage their own API keys and use the completion API.
"},{"location":"user-guide/user-management/#default-admin","title":"Default Admin","text":"On bootstrap, GPUStack creates a default admin user. The initial password for the default admin is stored in <data-dir>/initial_admin_password
. In the default setup, it should be /var/lib/gpustack/initial_admin_password
. You can customize the default admin password by setting the --bootstrap-password
parameter when starting gpustack
.
"},{"location":"user-guide/user-management/#create-user","title":"Create User","text":" - Navigate to the
Users
page. - Click the
Create User
button. - Fill in
Name
, Full Name
, Password
, and select Role
for the user. - Click the
Save
button.
"},{"location":"user-guide/user-management/#update-user","title":"Update User","text":" - Navigate to the
Users
page. - Find the user you want to edit.
- Click the
Edit
button in the Operations
column. - Update the attributes as needed.
- Click the
Save
button.
"},{"location":"user-guide/user-management/#delete-user","title":"Delete User","text":" - Navigate to the
Users
page. - Find the user you want to delete.
- Click the ellipsis button in the
Operations
column, then select Delete
. - Confirm the deletion.
"},{"location":"user-guide/playground/","title":"Playground","text":"GPUStack offers a playground UI where users can test and experiment with the APIs. Refer to each subpage for detailed instructions and information.
"},{"location":"user-guide/playground/audio/","title":"Audio Playground","text":"The Audio Playground is a dedicated space for testing and experimenting with GPUStack\u2019s text-to-speech (TTS) and speech-to-text (STT) APIs. It allows users to interactively convert text to audio and audio to text, customize parameters, and review code examples for seamless API integration.
"},{"location":"user-guide/playground/audio/#text-to-speech","title":"Text to Speech","text":"Switch to the \"Text to Speech\" tab to test TTS models.
"},{"location":"user-guide/playground/audio/#text-input","title":"Text Input","text":"Enter the text you want to convert, then click the Submit
button to generate the corresponding speech.
"},{"location":"user-guide/playground/audio/#clear-text","title":"Clear Text","text":"Click the Clear
button to reset the text input and remove the generated speech.
"},{"location":"user-guide/playground/audio/#select-model","title":"Select Model","text":"Select an available TTS model in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.
"},{"location":"user-guide/playground/audio/#customize-parameters","title":"Customize Parameters","text":"Customize the voice and format of the audio output.
Tip
Supported voices may vary between models.
"},{"location":"user-guide/playground/audio/#view-code","title":"View Code","text":"After experimenting with input text and parameters, click the View Code
button to see how to call the API with the same input. Code examples are provided in curl
, Python
, and Node.js
.
"},{"location":"user-guide/playground/audio/#speech-to-text","title":"Speech to Text","text":"Switch to the \"Speech to Text\" tab to test STT models.
"},{"location":"user-guide/playground/audio/#provide-audio-file","title":"Provide Audio File","text":"You can provide audio for transcription in two ways:
- Upload an audio file.
- Record audio online.
Note
If the online recording is not available, it could be due to one of the following reasons:
- For HTTPS or
http://localhost
access, microphone permissions must be enabled in your browser. -
For access via http://{host IP}
, the URL must be added to your browser's trusted list.
Example: In Chrome, navigate to chrome://flags/
, add the GPUStack URL to \"Insecure origins treated as secure,\" and enable this option.
"},{"location":"user-guide/playground/audio/#select-model_1","title":"Select Model","text":"Select an available STT model in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.
"},{"location":"user-guide/playground/audio/#copy-text","title":"Copy Text","text":"Copy the transcription results generated by the model.
"},{"location":"user-guide/playground/audio/#customize-parameters_1","title":"Customize Parameters","text":"Select the appropriate language for your audio file to optimize transcription accuracy.
"},{"location":"user-guide/playground/audio/#view-code_1","title":"View Code","text":"After experimenting with audio files and parameters, click the View Code
button to see how to call the API with the same input. Code examples are provided in curl
, Python
, and Node.js
.
"},{"location":"user-guide/playground/chat/","title":"Chat Playground","text":"Interact with the chat completions API. The following is an example screenshot:
"},{"location":"user-guide/playground/chat/#prompts","title":"Prompts","text":"You can adjust the prompt messages on the left side of the playground. There are three role types of prompt messages: system, user, and assistant.
- System: Typically a predefined instruction or guidance that sets the context, defines the behavior, or imposes specific constraints on how the model should generate its responses.
- User: The input or query provided by the user (the person interacting with the LLM).
- Assistant: The response generated by the LLM.
"},{"location":"user-guide/playground/chat/#edit-system-message","title":"Edit System Message","text":"You can add and edit the system message at the top of the playground.
"},{"location":"user-guide/playground/chat/#edit-user-and-assistant-messages","title":"Edit User and Assistant Messages","text":"To add a user or assistant message, click the New Message
button.
To remove a user or assistant message, click the minus button at the right corner of the message.
To change the role of a message, click the User
or Assistant
text at the beginning of the message.
"},{"location":"user-guide/playground/chat/#upload-image","title":"Upload Image","text":"You can add images to the prompt by clicking the Upload Image
button.
"},{"location":"user-guide/playground/chat/#clear-prompts","title":"Clear Prompts","text":"Click the Clear
button to clear all the prompts.
"},{"location":"user-guide/playground/chat/#select-model","title":"Select Model","text":"You can select available models in GPUStack by clicking the model dropdown at the top-right corner of the playground. Please refer to Model Management to learn about how to manage models.
"},{"location":"user-guide/playground/chat/#customize-parameters","title":"Customize Parameters","text":"You can customize completion parameters in the Parameters
section.
"},{"location":"user-guide/playground/chat/#do-completion","title":"Do Completion","text":"You can do a completion by clicking the Submit
button.
"},{"location":"user-guide/playground/chat/#view-code","title":"View Code","text":"Once you've done experimenting with the prompts and parameters, you can click the View Code
button to check how you can call the API with the same input by code. Code examples in curl
, Python
, and Node.js
are provided.
"},{"location":"user-guide/playground/chat/#compare-playground","title":"Compare Playground","text":"You can compare multiple models in the playground. The following is an example screenshot:
"},{"location":"user-guide/playground/chat/#comparision-mode","title":"Comparision Mode","text":"You can choose the number of models to compare by clicking the comparison view buttons, including 2, 3, 4 and 6-model comparison.
"},{"location":"user-guide/playground/chat/#prompts_1","title":"Prompts","text":"You can adjust the prompt messages similar to the chat playground.
"},{"location":"user-guide/playground/chat/#upload-image_1","title":"Upload Image","text":"You can add images to the prompt by clicking the Upload Image
button.
"},{"location":"user-guide/playground/chat/#clear-prompts_1","title":"Clear Prompts","text":"Click the Clear
button to clear all the prompts.
"},{"location":"user-guide/playground/chat/#select-model_1","title":"Select Model","text":"You can select available models in GPUStack by clicking the model dropdown at the top-left corner of each model panel.
"},{"location":"user-guide/playground/chat/#customize-parameters_1","title":"Customize Parameters","text":"You can customize completion parameters by clicking the settings button of each model.
"},{"location":"user-guide/playground/embedding/","title":"Embedding Playground","text":"The Embedding Playground lets you test the model\u2019s ability to convert text into embeddings. It allows you to experiment with multiple text inputs, visualize embeddings, and review code examples for API integration.
"},{"location":"user-guide/playground/embedding/#add-text","title":"Add Text","text":"Add at least two text entries and click the Submit
button to generate embeddings.
"},{"location":"user-guide/playground/embedding/#batch-input-text","title":"Batch Input Text","text":"Enable Batch Input Mode
to automatically split multi-line text into separate entries based on line breaks. This is useful for processing multiple text snippets in a single operation.
"},{"location":"user-guide/playground/embedding/#visualization","title":"Visualization","text":"Visualize the embedding results using PCA (Principal Component Analysis) to reduce dimensions and display them on a 2D plot. Results can be viewed in two formats:
- Chart - Display PCA results visually.
- JSON - View raw embeddings in JSON format.
In the chart, the distance between points represents the similarity between corresponding texts. Closer points indicate higher similarity.
"},{"location":"user-guide/playground/embedding/#clear","title":"Clear","text":"Click the Clear
button to reset text entries and clear the output.
"},{"location":"user-guide/playground/embedding/#select-model","title":"Select Model","text":"You can select available models in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.
"},{"location":"user-guide/playground/embedding/#view-code","title":"View Code","text":"After experimenting with the text inputs, click the View Code
button to see how you can call the API with the same input. Code examples are provided in curl
, Python
, and Node.js
.
"},{"location":"user-guide/playground/image/","title":"Image Playground","text":"The Image Playground is a dedicated space for testing and experimenting with GPUStack\u2019s image generation APIs. It allows users to interactively explore the capabilities of different models, customize parameters, and review code examples for seamless API integration.
"},{"location":"user-guide/playground/image/#prompt","title":"Prompt","text":"You can input or randomly generate a prompt, then click the Submit button to generate an image.
"},{"location":"user-guide/playground/image/#clear-prompt","title":"Clear Prompt","text":"Click the Clear
button to reset the prompt and remove the generated image.
"},{"location":"user-guide/playground/image/#select-model","title":"Select Model","text":"You can select available models in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.
"},{"location":"user-guide/playground/image/#customize-parameters","title":"Customize Parameters","text":"You can customize the image generation parameters by switching between two API styles:
- OpenAI-compatible mode.
- Advanced mode.
"},{"location":"user-guide/playground/image/#advanced-parameters","title":"Advanced Parameters","text":"Parameter Default Description Counts
1
Number of images to generate. Size
512x512
The size of the generated image in 'widthxheight' format. Sampler
euler_a
The sampler algorithm for image generation. Options include 'euler_a', 'euler', 'heun', 'dpm2', 'dpm++2s_a', 'dpm++2m', 'dpm++2mv2', 'ipndm', 'ipndm_v', and 'lcm'. Schedule
discrete
The noise scheduling method. Sampler Steps
10
The number of sampling steps to perform. Higher values may improve image quality at the cost of longer processing time. CFG Scale
4.5
The scale for classifier-free guidance. A higher value increases adherence to the prompt. Negative Prompt
(empty) A negative prompt to specify what the image should avoid. Seed
(empty) Random seed. Note
The maximum image size is restricted by the model's deployment settings. See the diagram below:
"},{"location":"user-guide/playground/image/#view-code","title":"View Code","text":"After experimenting with prompts and parameters, click the View Code
button to see how to call the API with the same inputs. Code examples are provided in curl
, Python
, and Node.js
.
"},{"location":"user-guide/playground/rerank/","title":"Rerank Playground","text":"The Rerank Playground allows you to test reranker models that reorder multiple texts based on their relevance to a query. Experiment with various input texts, customize parameters, and review code examples for API integration.
"},{"location":"user-guide/playground/rerank/#add-text","title":"Add Text","text":"Add multiple text entries to the document for reranking.
"},{"location":"user-guide/playground/rerank/#bach-input-text","title":"Bach Input Text","text":"Enable Batch Input Mode
to split multi-line text into separate entries based on line breaks. This is useful for processing multiple text snippets efficiently.
"},{"location":"user-guide/playground/rerank/#clear","title":"Clear","text":"Click the Clear
button to reset the document and query results.
"},{"location":"user-guide/playground/rerank/#query","title":"Query","text":"Input a query and click the Submit
button to get a ranked list of texts based on their relevance to the query.
"},{"location":"user-guide/playground/rerank/#select-model","title":"Select Model","text":"Select an available reranker model in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.
"},{"location":"user-guide/playground/rerank/#customize-parameters","title":"Customize Parameters","text":"In the parameter section, set Top N
to specify the number of matching texts to retrieve.
"},{"location":"user-guide/playground/rerank/#view-code","title":"View Code","text":"After experimenting with the input text and query, click the View Code
button to see how to call the API with the same input. Code examples are provided in curl
, Python
, and Node.js
.
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":""},{"location":"api-reference/","title":"API Reference","text":"GPUStack provides a built-in Swagger UI. You can access it by navigating to <gpustack-server-url>/docs
in your browser to view and interact with the APIs.
"},{"location":"architecture/","title":"Architecture","text":"The following diagram shows the architecture of GPUStack:
"},{"location":"architecture/#server","title":"Server","text":"The GPUStack server consists of the following components:
- API Server: Provides a RESTful interface for clients to interact with the system. It handles authentication and authorization.
- Scheduler: Responsible for assigning model instances to workers.
- Model Controller: Manages the rollout and scaling of model instances to match the desired model replicas.
- HTTP Proxy: Routes completion API requests to backend inference servers.
"},{"location":"architecture/#worker","title":"Worker","text":"GPUStack workers are responsible for:
- Running inference servers for model instances assigned to the worker.
- Reporting status to the server.
"},{"location":"architecture/#sql-database","title":"SQL Database","text":"The GPUStack server connects to a SQL database as the datastore. GPUStack uses SQLite by default, but you can configure it to use an external PostgreSQL as well.
"},{"location":"architecture/#inference-server","title":"Inference Server","text":"Inference servers are the backends that performs the inference tasks. GPUStack supports llama-box, vLLM and vox-box as the inference server.
"},{"location":"architecture/#rpc-server","title":"RPC Server","text":"The RPC server enables running llama-box backend on a remote host. The Inference Server communicates with one or several instances of RPC server, offloading computations to these remote hosts. This setup allows for distributed LLM inference across multiple workers, enabling the system to load larger models even when individual resources are limited.
"},{"location":"code-of-conduct/","title":"Contributor Code of Conduct","text":""},{"location":"code-of-conduct/#our-pledge","title":"Our Pledge","text":"We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.
"},{"location":"code-of-conduct/#our-standards","title":"Our Standards","text":"Examples of behavior that contributes to a positive environment for our community include:
- Demonstrating empathy and kindness toward other people
- Being respectful of differing opinions, viewpoints, and experiences
- Giving and gracefully accepting constructive feedback
- Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
- Focusing on what is best not just for us as individuals, but for the overall community
Examples of unacceptable behavior include:
- The use of sexualized language or imagery, and sexual attention or advances of any kind
- Trolling, insulting or derogatory comments, and personal or political attacks
- Public or private harassment
- Publishing others' private information, such as a physical or email address, without their explicit permission
- Other conduct which could reasonably be considered inappropriate in a professional setting
"},{"location":"code-of-conduct/#enforcement-responsibilities","title":"Enforcement Responsibilities","text":"Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.
Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.
"},{"location":"code-of-conduct/#scope","title":"Scope","text":"This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
"},{"location":"code-of-conduct/#enforcement","title":"Enforcement","text":"Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at contact@gpustack.ai. All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the reporter of any incident.
"},{"location":"code-of-conduct/#enforcement-guidelines","title":"Enforcement Guidelines","text":"Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:
"},{"location":"code-of-conduct/#1-correction","title":"1. Correction","text":"Community Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.
Consequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.
"},{"location":"code-of-conduct/#2-warning","title":"2. Warning","text":"Community Impact: A violation through a single incident or series of actions.
Consequence: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.
"},{"location":"code-of-conduct/#3-temporary-ban","title":"3. Temporary Ban","text":"Community Impact: A serious violation of community standards, including sustained inappropriate behavior.
Consequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.
"},{"location":"code-of-conduct/#4-permanent-ban","title":"4. Permanent Ban","text":"Community Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.
Consequence: A permanent ban from any sort of public interaction within the community.
"},{"location":"code-of-conduct/#attribution","title":"Attribution","text":"This Code of Conduct is adapted from the Contributor Covenant, version 2.1, available at https://www.contributor-covenant.org/version/2/1/code_of_conduct.html.
"},{"location":"contributing/","title":"Contributing to GPUStack","text":"Thanks for taking the time to contribute to GPUStack!
Please review and follow the Code of Conduct.
"},{"location":"contributing/#filing-issues","title":"Filing Issues","text":"If you find any bugs or are having any trouble, please search the reported issue as someone may have experienced the same issue, or we are actively working on a solution.
If you can't find anything related to your issue, contact us by filing an issue. To help us diagnose and resolve, please include as much information as possible, including:
- Software: GPUStack version, installation method, operating system info, etc.
- Hardware: Node info, GPU info, etc.
- Steps to reproduce: Provide as much detail on how you got into the reported situation.
- Logs: Please include any relevant logs, such as server logs, worker logs, etc.
"},{"location":"contributing/#contributing-code","title":"Contributing Code","text":"For setting up development environment, please refer to Development Guide.
If you're fixing a small issue, you can simply submit a PR. However, if you're planning to submit a bigger PR to implement a new feature or fix a relatively complex bug, please open an issue that explains the change and the motivation for it. If you're addressing a bug, please explain how to reproduce it.
"},{"location":"contributing/#updating-documentation","title":"Updating Documentation","text":"If you have any updates to our documentation, feel free to file an issue with the documentation
label or make a pull request.
"},{"location":"development/","title":"Development Guide","text":""},{"location":"development/#prerequisites","title":"Prerequisites","text":"Install Python (version 3.10 to 3.12).
"},{"location":"development/#set-up-environment","title":"Set Up Environment","text":"make install\n
"},{"location":"development/#run","title":"Run","text":"poetry run gpustack\n
"},{"location":"development/#build","title":"Build","text":"make build\n
And check artifacts in dist
.
"},{"location":"development/#test","title":"Test","text":"make test\n
"},{"location":"development/#update-dependencies","title":"Update Dependencies","text":"poetry add <something>\n
Or
poetry add --group dev <something>\n
For dev/testing dependencies.
"},{"location":"overview/","title":"GPUStack","text":"GPUStack is an open-source GPU cluster manager for running AI models.
"},{"location":"overview/#key-features","title":"Key Features","text":" - Broad Hardware Compatibility: Run with different brands of GPUs in Apple MacBooks, Windows PCs, and Linux servers.
- Broad Model Support: From LLMs to diffusion models, audio, embedding, and reranker models.
- Scales with Your GPU Inventory: Easily add more GPUs or nodes to scale up your operations.
- Distributed Inference: Supports both single-node multi-GPU and multi-node inference and serving.
- Multiple Inference Backends: Supports llama-box (llama.cpp & stable-diffusion.cpp), vox-box and vLLM as the inference backend.
- Lightweight Python Package: Minimal dependencies and operational overhead.
- OpenAI-compatible APIs: Serve APIs that are compatible with OpenAI standards.
- User and API key management: Simplified management of users and API keys.
- GPU metrics monitoring: Monitor GPU performance and utilization in real-time.
- Token usage and rate metrics: Track token usage and manage rate limits effectively.
"},{"location":"overview/#supported-platforms","title":"Supported Platforms","text":" - macOS
- Windows
- Linux
"},{"location":"overview/#supported-accelerators","title":"Supported Accelerators","text":" - Apple Metal (M-series chips)
- NVIDIA CUDA (Compute Capability 6.0 and above)
- Ascend CANN
- Moore Threads MUSA
We plan to support the following accelerators in future releases.
- AMD ROCm
- Intel oneAPI
- Qualcomm AI Engine
"},{"location":"overview/#supported-models","title":"Supported Models","text":"GPUStack uses llama-box (bundled llama.cpp and stable-diffusion.cpp server), vLLM and vox-box as the backends and supports a wide range of models. Models from the following sources are supported:
-
Hugging Face
-
ModelScope
-
Ollama Library
-
Local File Path
"},{"location":"overview/#example-models","title":"Example Models:","text":"Category Models Large Language Models(LLMs) Qwen, LLaMA, Mistral, Deepseek, Phi, Yi Vision Language Models(VLMs) Llama3.2-Vision, Pixtral , Qwen2-VL, LLaVA, InternVL2 Diffusion Models Stable Diffusion, FLUX Rerankers GTE, BCE, BGE, Jina Audio Models Whisper (speech-to-text), CosyVoice (text-to-speech) For full list of supported models, please refer to the supported models section in the inference backends documentation.
"},{"location":"overview/#openai-compatible-apis","title":"OpenAI-Compatible APIs","text":"GPUStack serves OpenAI compatible APIs. For details, please refer to OpenAI Compatible APIs
"},{"location":"quickstart/","title":"Quickstart","text":""},{"location":"quickstart/#installation","title":"Installation","text":""},{"location":"quickstart/#linux-or-macos","title":"Linux or macOS","text":"GPUStack provides a script to install it as a service on systemd or launchd based systems. To install GPUStack using this method, just run:
curl -sfL https://get.gpustack.ai | sh -s -\n
"},{"location":"quickstart/#windows","title":"Windows","text":"Run PowerShell as administrator (avoid using PowerShell ISE), then run the following command to install GPUStack:
Invoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n
"},{"location":"quickstart/#other-installation-methods","title":"Other Installation Methods","text":"For manual installation, docker installation or detailed configuration options, please refer to the Installation Documentation.
"},{"location":"quickstart/#getting-started","title":"Getting Started","text":" - Run and chat with the llama3.2 model:
gpustack chat llama3.2 \"tell me a joke.\"\n
- Run and generate an image with the stable-diffusion-v3-5-large-turbo model:
Tip
This command downloads the model (~12GB) from Hugging Face. The download time depends on your network speed. Ensure you have enough disk space and VRAM (12GB) to run the model. If you encounter issues, you can skip this step and move to the next one.
gpustack draw hf.co/gpustack/stable-diffusion-v3-5-large-turbo-GGUF:stable-diffusion-v3-5-large-turbo-Q4_0.gguf \\\n\"A minion holding a sign that says 'GPUStack'. The background is filled with futuristic elements like neon lights, circuit boards, and holographic displays. The minion is wearing a tech-themed outfit, possibly with LED lights or digital patterns. The sign itself has a sleek, modern design with glowing edges. The overall atmosphere is high-tech and vibrant, with a mix of dark and neon colors.\" \\\n--sample-steps 5 --show\n
Once the command completes, the generated image will appear in the default viewer. You can experiment with the prompt and CLI options to customize the output.
- Open
http://myserver
in the browser to access the GPUStack UI. Log in to GPUStack with username admin
and the default password. You can run the following command to get the password for the default setup:
Linux or macOS
cat /var/lib/gpustack/initial_admin_password\n
Windows
Get-Content -Path \"$env:APPDATA\\gpustack\\initial_admin_password\" -Raw\n
- Click
Playground
in the navigation menu. Now you can chat with the LLM in the UI playground.
-
Click API Keys
in the navigation menu, then click the New API Key
button.
-
Fill in the Name
and click the Save
button.
-
Copy the generated API key and save it somewhere safe. Please note that you can only see it once on creation.
-
Now you can use the API key to access the OpenAI-compatible API. For example, use curl as the following:
export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/chat/completions \\\n -H \"Content-Type: application/json\" \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -d '{\n \"model\": \"llama3.2\",\n \"messages\": [\n {\n \"role\": \"system\",\n \"content\": \"You are a helpful assistant.\"\n },\n {\n \"role\": \"user\",\n \"content\": \"Hello!\"\n }\n ],\n \"stream\": true\n }'\n
"},{"location":"quickstart/#cleanup","title":"Cleanup","text":"After you complete using the deployed models, you can go to the Models
page in the GPUStack UI and delete the models to free up resources.
"},{"location":"scheduler/","title":"Scheduler","text":""},{"location":"scheduler/#summary","title":"Summary","text":"The scheduler's primary responsibility is to calculate the resources required by models instance and to evaluate and select the optimal workers/GPUs for model instances through a series of strategies. This ensures that model instances can run efficiently. This document provides a detailed overview of the policies and processes used by the scheduler.
"},{"location":"scheduler/#scheduling-process","title":"Scheduling Process","text":""},{"location":"scheduler/#filtering-phase","title":"Filtering Phase","text":"The filtering phase aims to narrow down the available workers or GPUs to those that meet specific criteria. The main policies involved are:
- Label Matching Policy
- Status Policy
- Resource Fit Policy
"},{"location":"scheduler/#label-matching-policy","title":"Label Matching Policy","text":"This policy filters workers based on the label selectors configured for the model. If no label selectors are defined for the model, all workers are considered. Otherwise, the system checks whether the labels of each worker node match the model's label selectors, retaining only those workers that match.
"},{"location":"scheduler/#status-policy","title":"Status Policy","text":"This policy filters workers based on their status, retaining only those that are in a READY state.
"},{"location":"scheduler/#resource-fit-policy","title":"Resource Fit Policy","text":"The Resource Fit Policy is a critical strategy in the scheduling system, used to filter workers or GPUs based on resource compatibility. The goal of this policy is to ensure that model instances can run on the selected nodes without exceeding resource limits. The Resource Fit Policy prioritizes candidates in the following order:
- Single Worker Node, Single GPU Full Offload: Identifies candidates where a single GPU on a single worker can fully offload the model, which usually offers the best performance.
- Single Worker Node, Multiple GPU Full Offload: Identifies candidates where multiple GPUs on a single worker can fully the offload the model.
- Single Worker Node Partial Offload: Identifies candidates on a single worker that can handle a partial offload, used only when partial offloading is allowed.
- Distributed Inference Across Multiple Workers: Identifies candidates where a combination of GPUs across multiple workers can handle full or partial offloading, used only when distributed inference across nodes is permitted.
- Single Worker Node, CPU: When no GPUs are available, the system will use the CPU for inference, identifying candidates where memory resources on a single worker are sufficient.
"},{"location":"scheduler/#scoring-phase","title":"Scoring Phase","text":"The scoring phase evaluates the filtered candidates, scoring them to select the optimal deployment location. The primary strategy involved is:
- Placement Strategy Policy
"},{"location":"scheduler/#placement-strategy-policy","title":"Placement Strategy Policy","text":" - Binpack
This strategy aims to \"pack\" as many model instances as possible into the fewest number of \"bins\" (e.g., Workers/GPUs) to optimize resource utilization. The goal is to minimize the number of bins used while maximizing resource efficiency, ensuring each bin is filled as efficiently as possible without exceeding its capacity. Model instances are placed in the bin with the least remaining space to minimize leftover capacity in each bin.
- Spread
This strategy seeks to distribute multiple model instances across different worker nodes as evenly as possible, improving system fault tolerance and load balancing.
"},{"location":"troubleshooting/","title":"Troubleshooting","text":""},{"location":"troubleshooting/#view-gpustack-logs","title":"View GPUStack Logs","text":"If you installed GPUStack using the installation script, you can view GPUStack logs at the following path:
"},{"location":"troubleshooting/#linux-or-macos","title":"Linux or macOS","text":"/var/log/gpustack.log\n
"},{"location":"troubleshooting/#windows","title":"Windows","text":"\"$env:APPDATA\\gpustack\\log\\gpustack.log\"\n
"},{"location":"troubleshooting/#configure-log-level","title":"Configure Log Level","text":"You can enable the DEBUG log level on gpustack start
by setting the --debug
parameter.
You can configure log level of GPUStack server at runtime by running the following command on the server node:
curl -X PUT http://localhost/debug/log_level -d \"debug\"\n
"},{"location":"troubleshooting/#reset-admin-password","title":"Reset Admin Password","text":"In case you forgot the admin password, you can reset it by running the following command on the server node:
gpustack reset-admin-password\n
"},{"location":"upgrade/","title":"Upgrade","text":"You can upgrade GPUStack using the installation script or by manually installing the desired version of the GPUStack Python package.
Note
When upgrading, upgrade the GPUStack server first, then upgrade the workers.
"},{"location":"upgrade/#upgrade-gpustack-using-the-installation-script","title":"Upgrade GPUStack Using the Installation Script","text":"To upgrade GPUStack from an older version, re-run the installation script using the same configuration options you originally used.
Running the installation script will:
- Install the latest version of the GPUStack Python package.
- Update the system service (systemd, launchd, or Windows) init script to reflect the arguments passed to the installation script.
- Restart the GPUStack service.
"},{"location":"upgrade/#linux-and-macos","title":"Linux and macOS","text":"For example, to upgrade GPUStack to the latest version on a Linux system and macOS:
curl -sfL https://get.gpustack.ai | <EXISTING_INSTALL_ENV> sh -s - <EXISTING_GPUSTACK_ARGS>\n
To upgrade to a specific version, specify the INSTALL_PACKAGE_SPEC
environment variable similar to the pip install
command:
curl -sfL https://get.gpustack.ai | INSTALL_PACKAGE_SPEC=gpustack==x.y.z <EXISTING_INSTALL_ENV> sh -s - <EXISTING_GPUSTACK_ARGS>\n
"},{"location":"upgrade/#windows","title":"Windows","text":"To upgrade GPUStack to the latest version on a Windows system:
$env:<EXISTING_INSTALL_ENV> = <EXISTING_INSTALL_ENV_VALUE>\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n
To upgrade to a specific version:
$env:INSTALL_PACKAGE_SPEC = gpustack==x.y.z\n$env:<EXISTING_INSTALL_ENV> = <EXISTING_INSTALL_ENV_VALUE>\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } <EXISTING_GPUSTACK_ARGS>\"\n
"},{"location":"upgrade/#docker-upgrade","title":"Docker Upgrade","text":"If you installed GPUStack using Docker, upgrade to the a new version by pulling the Docker image with the desired version tag.
For example:
docker pull gpustack/gpustack:vX.Y.Z\n
Then restart the GPUStack service with the new image.
"},{"location":"upgrade/#manual-upgrade","title":"Manual Upgrade","text":"If you install GPUStack manually, upgrade using the common pip
workflow.
For example, to upgrade GPUStack to the latest version:
pip install --upgrade gpustack\n
Then restart the GPUStack service according to your setup.
"},{"location":"cli-reference/chat/","title":"gpustack chat","text":"Chat with a large language model.
gpustack chat model [prompt]\n
"},{"location":"cli-reference/chat/#positional-arguments","title":"Positional Arguments","text":"Name Description model The model to use for chat. prompt The prompt to send to the model. [Optional]"},{"location":"cli-reference/chat/#one-time-chat-with-a-prompt","title":"One-time Chat with a Prompt","text":"If a prompt is provided, it performs a one-time inference. For example:
gpustack chat llama3 \"tell me a joke.\"\n
Example output:
Why couldn't the bicycle stand up by itself?\n\nBecause it was two-tired!\n
"},{"location":"cli-reference/chat/#interactive-chat","title":"Interactive Chat","text":"If the prompt
argument is not provided, you can chat with the large language model interactively. For example:
gpustack chat llama3\n
Example output:
>tell me a joke.\nHere's one:\n\nWhy couldn't the bicycle stand up by itself?\n\n(wait for it...)\n\nBecause it was two-tired!\n\nHope that made you smile!\n>Do you have a better one?\nHere's another one:\n\nWhy did the scarecrow win an award?\n\n(think about it for a sec...)\n\nBecause he was outstanding in his field!\n\nHope that one stuck with you!\n\nDo you want to hear another one?\n>\\quit\n
"},{"location":"cli-reference/chat/#interactive-commands","title":"Interactive Commands","text":"Followings are available commands in interactive chat:
Commands:\n \\q or \\quit - Quit the chat\n \\c or \\clear - Clear chat context in prompt\n \\? or \\h or \\help - Print this help message\n
"},{"location":"cli-reference/chat/#connect-to-external-gpustack-server","title":"Connect to External GPUStack Server","text":"If you are not running gpustack chat
on the server node, or if you are serving on a custom host or port, you should provide the following environment variables:
Name Description GPUSTACK_SERVER_URL URL of the GPUStack server, e.g., http://myserver
. GPUSTACK_API_KEY GPUStack API key."},{"location":"cli-reference/download-tools/","title":"gpustack download-tools","text":"Download dependency tools, including llama-box, gguf-parser, and fastfetch.
gpustack download-tools [OPTIONS]\n
"},{"location":"cli-reference/download-tools/#configurations","title":"Configurations","text":"Flag Default Description ----tools-download-base-url
value (empty) Base URL to download dependency tools. --save-archive
value (empty) Path to save downloaded tools as a tar archive. --load-archive
value (empty) Path to load downloaded tools from a tar archive, instead of downloading. --system
value Default is the current OS. Operating system to download tools for. Options: linux
, windows
, macos
. --arch
value Default is the current architecture. Architecture to download tools for. Options: amd64
, arm64
. --device
value Default is the current device. Device to download tools for. Options: cuda
, mps
, npu
, musa
, cpu
."},{"location":"cli-reference/draw/","title":"gpustack draw","text":"Generate an image with a diffusion model.
gpustack draw [model] [prompt]\n
"},{"location":"cli-reference/draw/#positional-arguments","title":"Positional Arguments","text":"Name Description model The model to use for image generation. prompt Text prompt to use for image generation. The model
can be either of the following:
- Name of a GPUStack model. You need to create a model in GPUStack before using it here.
- Reference to a Hugging Face GGUF diffusion model in Ollama style. When using this option, the model will be deployed if it is not already available. When not specified the default
Q4_0
tag is used. Examples:
hf.co/gpustack/stable-diffusion-v3-5-large-turbo-GGUF
hf.co/gpustack/stable-diffusion-v3-5-large-turbo-GGUF:FP16
hf.co/gpustack/stable-diffusion-v3-5-large-turbo-GGUF:stable-diffusion-v3-5-large-turbo-Q4_0.gguf
"},{"location":"cli-reference/draw/#configurations","title":"Configurations","text":"Flag Default Description --size
value 512x512
Size of the image to generate, specified as widthxheight
. --sampler
value euler
Sampling method. Options include: euler_a, euler, heun, dpm2, dpm++2s_a, dpm++2m, lcm, etc. --sample-steps
value (Empty) Number of sampling steps. --cfg-scale
value (Empty) Classifier-free guidance scale for balancing prompt adherence and creativity. --seed
value (Empty) Seed for random number generation. Useful for reproducibility. --negative-prompt
value (Empty) Text prompt for what to avoid in the image. --output
value (Empty) Path to save the generated image. --show
False
If True, opens the generated image in the default image viewer. -d
, --debug
False
Enable debug mode."},{"location":"cli-reference/start/","title":"gpustack start","text":"Run GPUStack server or worker.
gpustack start [OPTIONS]\n
"},{"location":"cli-reference/start/#configurations","title":"Configurations","text":""},{"location":"cli-reference/start/#common-options","title":"Common Options","text":"Flag Default Description --config-file
value (empty) Path to the YAML config file. -d
value, --debug
value False
To enable debug mode, the short flag -d is not supported in Windows because this flag is reserved by PowerShell for CommonParameters. --data-dir
value (empty) Directory to store data. Default is OS specific. --cache-dir
value (empty) Directory to store cache (e.g., model files). Defaults to /cache. -t
value, --token
value Auto-generated. Shared secret used to add a worker. --huggingface-token
value (empty) User Access Token to authenticate to the Hugging Face Hub. Can also be configured via the HF_TOKEN
environment variable."},{"location":"cli-reference/start/#server-options","title":"Server Options","text":"Flag Default Description --host
value 0.0.0.0
Host to bind the server to. --port
value 80
Port to bind the server to. --disable-worker
False
Disable embedded worker. --bootstrap-password
value Auto-generated. Initial password for the default admin user. --database-url
value sqlite:///<data-dir>/database.db
URL of the database. Example: postgresql://user:password@hostname:port/db_name --ssl-keyfile
value (empty) Path to the SSL key file. --ssl-certfile
value (empty) Path to the SSL certificate file. --force-auth-localhost
False
Force authentication for requests originating from localhost (127.0.0.1).When set to True, all requests from localhost will require authentication. --ollama-library-base-url
https://registry.ollama.ai
Base URL for the Ollama library. --disable-update-check
False
Disable update check."},{"location":"cli-reference/start/#worker-options","title":"Worker Options","text":"Flag Default Description -s
value, --server-url
value (empty) Server to connect to. --worker-ip
value (empty) IP address of the worker node. Auto-detected by default. --disable-metrics
False
Disable metrics. --disable-rpc-servers
False
Disable RPC servers. --metrics-port
value 10151
Port to expose metrics. --worker-port
value 10150
Port to bind the worker to. Use a consistent value for all workers. --log-dir
value (empty) Directory to store logs. --system-reserved
value \"{\\\"ram\\\": 2, \\\"vram\\\": 0}\"
The system reserves resources for the worker during scheduling, measured in GiB. By default, 2 GiB of RAM is reserved, Note: '{\\\"memory\\\": 2, \\\"gpu_memory\\\": 0}' is also supported, but it is deprecated and will be removed in future releases. --tools-download-base-url
value Base URL for downloading dependency tools."},{"location":"cli-reference/start/#available-environment-variables","title":"Available Environment Variables","text":"Most of the options can be set via environment variables. The environment variables are prefixed with GPUSTACK_
and are in uppercase. For example, --data-dir
can be set via the GPUSTACK_DATA_DIR
environment variable.
Below are additional environment variables that can be set:
Flag Description HF_ENDPOINT
Hugging Face Hub endpoint. e.g., https://hf-mirror.com
"},{"location":"cli-reference/start/#config-file","title":"Config File","text":"You can configure start options using a YAML-format config file when starting GPUStack server or worker. Here is a complete example:
# Common Options\ndebug: false\ndata_dir: /path/to/data_dir\ncache_dir: /path/to/cache_dir\ntoken: mytoken\n\n# Server Options\nhost: 0.0.0.0\nport: 80\ndisable_worker: false\ndatabase_url: postgresql://user:password@hostname:port/db_name\nssl_keyfile: /path/to/keyfile\nssl_certfile: /path/to/certfile\nforce_auth_localhost: false\nbootstrap_password: myadminpassword\nollama_library_base_url: https://registry.mycompany.com\ndisable_update_check: false\n\n# Worker Options\nserver_url: http://myserver\nworker_ip: 192.168.1.101\ndisable_metrics: false\ndisable_rpc_servers: false\nmetrics_port: 10151\nworker_port: 10150\nlog_dir: /path/to/log_dir\nsystem_reserved:\n ram: 2\n vram: 0\ntools_download_base_url: https://mirror.mycompany.com\n
"},{"location":"installation/air-gapped-installation/","title":"Air-Gapped Installation","text":"You can install GPUStack in an air-gapped environment. An air-gapped environment refers to a setup where GPUStack will be installed offline, behind a firewall, or behind a proxy.
The following methods are available for installing GPUStack in an air-gapped environment:
- Docker Installation
- Manual Installation
"},{"location":"installation/air-gapped-installation/#docker-installation","title":"Docker Installation","text":"When running GPUStack with Docker, it works out of the box in an air-gapped environment as long as the Docker images are available. To do this, follow these steps:
- Pull GPUStack Docker images in an online environment.
- Publish Docker images to a private registry.
- Refer to the Docker Installation guide to run GPUStack using Docker.
"},{"location":"installation/air-gapped-installation/#manual-installation","title":"Manual Installation","text":"For manual installation, you need to prepare the required packages and tools in an online environment and then transfer them to the air-gapped environment.
"},{"location":"installation/air-gapped-installation/#prerequisites","title":"Prerequisites","text":"Set up an online environment identical to the air-gapped environment, including OS, architecture, and Python version.
"},{"location":"installation/air-gapped-installation/#step-1-download-the-required-packages","title":"Step 1: Download the Required Packages","text":"Run the following commands in an online environment:
# On Windows (PowerShell):\n# $PACKAGE_SPEC = \"gpustack\"\n\n# Optional: To include extra dependencies (vllm, audio, all) or install a specific version\n# PACKAGE_SPEC=\"gpustack[all]\"\n# PACKAGE_SPEC=\"gpustack==0.4.0\"\nPACKAGE_SPEC=\"gpustack\"\n\n# Download all required packages\npip wheel $PACKAGE_SPEC -w gpustack_offline_packages\n\n# Install GPUStack to access its CLI\npip install gpustack\n\n# Download dependency tools and save them as an archive\ngpustack download-tools --save-archive gpustack_offline_tools.tar.gz\n
Optional: Additional Dependencies for macOS.
# Deploying the speech-to-text CosyVoice model on macOS requires additional dependencies.\nbrew install openfst\nCPLUS_INCLUDE_PATH=$(brew --prefix openfst)/include\nLIBRARY_PATH=$(brew --prefix openfst)/lib\n\nAUDIO_DEPENDENCY_PACKAGE_SPEC=\"wetextprocessing\"\npip wheel $AUDIO_DEPENDENCY_PACKAGE_SPEC -w gpustack_audio_dependency_offline_packages\nmv gpustack_audio_dependency_offline_packages/* gpustack_offline_packages/ && rm -rf gpustack_audio_dependency_offline_packages\n
Note
This instruction assumes that the online environment uses the same GPU type as the air-gapped environment. If the GPU types differ, use the --device
flag to specify the device type for the air-gapped environment. Refer to the download-tools command for more information.
"},{"location":"installation/air-gapped-installation/#step-2-transfer-the-packages","title":"Step 2: Transfer the Packages","text":"Transfer the following files from the online environment to the air-gapped environment.
gpustack_offline_packages
directory. gpustack_offline_tools.tar.gz
file.
"},{"location":"installation/air-gapped-installation/#step-3-install-gpustack","title":"Step 3: Install GPUStack","text":"In the air-gapped environment, run the following commands:
# Install GPUStack from the downloaded packages\npip install --no-index --find-links=gpustack_offline_packages gpustack\n\n# Load and apply the pre-downloaded tools archive\ngpustack download-tools --load-archive gpustack_offline_tools.tar.gz\n
Optional: Additional Dependencies for macOS.
# Install the additional dependencies for speech-to-text CosyVoice model on macOS.\nbrew install openfst\n\npip install --no-index --find-links=gpustack_offline_packages wetextprocessing\n
Now you can run GPUStack by following the instructions in the Manual Installation guide.
"},{"location":"installation/docker-installation/","title":"Docker Installation","text":"You can use the official Docker image to run GPUStack in a container. Installation using docker is supported on:
- Linux with Nvidia GPUs
"},{"location":"installation/docker-installation/#prerequisites","title":"Prerequisites","text":" - Docker
- Nvidia Container Toolkit
"},{"location":"installation/docker-installation/#run-gpustack-with-docker","title":"Run GPUStack with Docker","text":"Run the following command to start the GPUStack server:
docker run -d --gpus all -p 80:80 --ipc=host \\\n -v gpustack-data:/var/lib/gpustack gpustack/gpustack\n
Note
You can either use the --ipc=host
flag or --shm-size
flag to allow the container to access the host\u2019s shared memory. It is used by vLLM and pyTorch to share data between processes under the hood, particularly for tensor parallel inference.
You can set additional flags for the gpustack start
command by appending them to the docker run command.
For example, to start a GPUStack worker:
docker run -d --gpus all --ipc=host --network=host \\\n gpustack/gpustack --server-url http://myserver --token mytoken\n
Note
The --network=host
flag is used to ensure that server is accessible to the worker and inference services running on it. Alternatively, you can set --worker-ip <host-ip> -p 10150:10150 -p 40000-41024:40000-41024
to expose relevant ports.
For configuration details, please refer to the CLI Reference.
"},{"location":"installation/docker-installation/#run-gpustack-with-docker-compose","title":"Run GPUStack with Docker Compose","text":"Get the docker-compose file from GPUStack repository, run the following command to start the GPUStack server:
docker-compose up -d\n
You can update the docker-compose.yml
file to customize the command while starting a GPUStack worker.
"},{"location":"installation/docker-installation/#build-your-own-docker-image","title":"Build Your Own Docker Image","text":"The official Docker image is built with CUDA 12.4. If you want to use a different version of CUDA, you can build your own Docker image.
# Example Dockerfile\nARG CUDA_VERSION=12.4.1\n\nFROM nvidia/cuda:$CUDA_VERSION-cudnn-runtime-ubuntu22.04\n\nENV DEBIAN_FRONTEND=noninteractive\n\nRUN apt-get update && apt-get install -y \\\n wget \\\n tzdata \\\n python3 \\\n python3-pip \\\n && rm -rf /var/lib/apt/lists/*\n\n\nRUN pip3 install gpustack[all] && \\\n pip3 cache purge\n\nENTRYPOINT [ \"gpustack\", \"start\" ]\n
Run the following command to build the Docker image:
docker build -t my/gpustack --build-arg CUDA_VERSION=12.0.0 .\n
"},{"location":"installation/installation-requirements/","title":"Installation Requirements","text":"This page describes the software and networking requirements for the nodes where GPUStack will be installed.
"},{"location":"installation/installation-requirements/#python-requirements","title":"Python Requirements","text":"GPUStack requires Python version 3.10 to 3.12.
"},{"location":"installation/installation-requirements/#operating-system-requirements","title":"Operating System Requirements","text":"GPUStack is supported on the following operating systems:
- macOS
- Windows
- Linux
GPUStack has been tested and verified to work on the following operating systems:
OS Versions Windows 10, 11 Ubuntu >= 20.04 Debian >= 11 RHEL >= 8 Rocky >= 8 Fedora >= 36 OpenSUSE >= 15.3 (leap) OpenEuler >= 22.03 Note
The installation of GPUStack worker on a Linux system requires that the GLIBC version be 2.29 or higher. If your system uses a lower GLIBC version, consider using the Docker Installation method as an alternative.
"},{"location":"installation/installation-requirements/#supported-architectures","title":"Supported Architectures","text":"GPUStack supports both AMD64 and ARM64 architectures, with the following notes:
- On Linux and macOS, when using Python versions below 3.12, ensure that the installed Python distribution corresponds to your system architecture.
- On Windows, please use the AMD64 distribution of Python, as wheel packages for certain dependencies are unavailable for ARM64. If you use tools like
conda
, this will be handled automatically, as conda installs the AMD64 distribution by default.
"},{"location":"installation/installation-requirements/#accelerator-runtime-requirements","title":"Accelerator Runtime Requirements","text":"GPUStack supports the following accelerators:
- Apple Metal (M-series chips)
- NVIDIA CUDA (Compute Capability 6.0 and above)
- Ascend CANN
- Moore Threads MUSA
Ensure all necessary drivers and libraries are installed on the system prior to installing GPUStack.
"},{"location":"installation/installation-requirements/#nvidia-cuda","title":"NVIDIA CUDA","text":"To use NVIDIA CUDA as an accelerator, ensure the following components are installed:
- NVIDIA CUDA Toolkit
- NVIDIA cuBLAS (Optional, required for audio models)
- NVIDIA cuDNN (Optional, required for audio models)
- NVIDIA Container Toolkit (Optional, required for docker installation)
"},{"location":"installation/installation-requirements/#ascend-cann","title":"Ascend CANN","text":"For Ascend CANN as an accelerator, ensure the following components are installed:
- Ascend NPU driver & firmware
- Ascend CANN Toolkit & kernels
"},{"location":"installation/installation-requirements/#musa","title":"MUSA","text":"To use Moore Threads MUSA as an accelerator, ensure the following components are installed:
- MUSA SDK
- MT Container Toolkits(Optional, required for docker installation)
"},{"location":"installation/installation-requirements/#networking-requirements","title":"Networking Requirements","text":""},{"location":"installation/installation-requirements/#connectivity-requirements","title":"Connectivity Requirements","text":"The following network connectivity is required to ensure GPUStack functions properly:
Server-to-Worker: The server must be able to reach the workers for proxying inference requests.
Worker-to-Server: Workers must be able to reach the server to register themselves and send updates.
Worker-to-Worker: Necessary for distributed inference across multiple workers
"},{"location":"installation/installation-requirements/#port-requirements","title":"Port Requirements","text":"GPUStack uses the following ports for communication:
Server Ports
Port Description TCP 80 Default port for the GPUStack UI and API endpoints TCP 443 Default port for the GPUStack UI and API endpoints (when TLS is enabled) Worker Ports
Port Description TCP 10150 Default port for the GPUStack worker TCP 10151 Default port for exposing metrics TCP 40000-41024 Port range allocated for inference services"},{"location":"installation/installation-script/","title":"Installation Script","text":""},{"location":"installation/installation-script/#linux-and-macos","title":"Linux and macOS","text":"You can use the installation script available at https://get.gpustack.ai
to install GPUStack as a service on systemd and launchd based systems.
You can set additional environment variables and CLI flags when running the script. The following are examples running the installation script with different configurations:
# Run server.\ncurl -sfL https://get.gpustack.ai | sh -s -\n\n# Run server without the embedded worker.\ncurl -sfL https://get.gpustack.ai | sh -s - --disable-worker\n\n# Run server with TLS.\ncurl -sfL https://get.gpustack.ai | sh -s - --ssl-keyfile /path/to/keyfile --ssl-certfile /path/to/certfile\n\n# Run server with external postgresql database.\ncurl -sfL https://get.gpustack.ai | sh -s - --database-url \"postgresql://username:password@host:port/database_name\"\n\n# Run worker with specified IP.\ncurl -sfL https://get.gpustack.ai | sh -s - --server-url http://myserver --token mytoken --worker-ip 192.168.1.100\n\n# Install with a custom index URL.\ncurl -sfL https://get.gpustack.ai | INSTALL_INDEX_URL=https://pypi.tuna.tsinghua.edu.cn/simple sh -s -\n\n# Install a custom wheel package other than releases form pypi.org.\ncurl -sfL https://get.gpustack.ai | INSTALL_PACKAGE_SPEC=https://repo.mycompany.com/my-gpustack.whl sh -s -\n\n# Install a specific version with extra audio dependencies.\ncurl -sfL https://get.gpustack.ai | INSTALL_PACKAGE_SPEC=gpustack[audio]==0.4.0 sh -s -\n
"},{"location":"installation/installation-script/#windows","title":"Windows","text":"You can use the installation script available at https://get.gpustack.ai
to install GPUStack as a service on Windows Service Manager.
You can set additional environment variables and CLI flags when running the script. The following are examples running the installation script with different configurations:
# Run server.\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n\n# Run server without the embedded worker.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --disable-worker\"\n\n# Run server with TLS.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --ssl-keyfile 'C:\\path\\to\\keyfile' --ssl-certfile 'C:\\path\\to\\certfile'\"\n\n\n# Run server with external postgresql database.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --database-url 'postgresql://username:password@host:port/database_name'\"\n\n# Run worker with specified IP.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --server-url 'http://myserver' --token 'mytoken' --worker-ip '192.168.1.100'\"\n\n# Run worker with customize reserved resource.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --server-url 'http://myserver' --token 'mytoken' --system-reserved '{\"\"ram\"\":5, \"\"vram\"\":5}'\"\n\n# Install with a custom index URL.\n$env:INSTALL_INDEX_URL = \"https://pypi.tuna.tsinghua.edu.cn/simple\"\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n\n# Install a custom wheel package other than releases form pypi.org.\n$env:INSTALL_PACKAGE_SPEC = \"https://repo.mycompany.com/my-gpustack.whl\"\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n\n# Install a specific version with extra audio dependencies.\n$env:INSTALL_PACKAGE_SPEC = \"gpustack[audio]==0.4.0\"\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n
Warning
Avoid using PowerShell ISE as it is not compatible with the installation script.
"},{"location":"installation/installation-script/#available-environment-variables-for-the-installation-script","title":"Available Environment Variables for the Installation Script","text":"Name Default Description INSTALL_INDEX_URL
(empty) Base URL of the Python Package Index. INSTALL_PACKAGE_SPEC
gpustack[all]
or gpustack[audio]
The package spec to install. The install script will automatically decide based on the platform. It supports PYPI package names, URLs, and local paths. See the pip install documentation for details. gpustack[all]
: With all inference backends: llama-box, vllm, vox-box.gpustack[vllm]
: With inference backends: llama-box, vllm.gpustack[audio]
: With inference backends: llama-box, vox-box.
INSTALL_SKIP_POST_CHECK
(empty) If set to 1, the installation script will skip the post-installation check."},{"location":"installation/installation-script/#set-environment-variables-for-the-gpustack-service","title":"Set Environment Variables for the GPUStack Service","text":"You can set environment variables for the GPUStack service in an environment file located at:
- Linux and macOS:
/etc/default/gpustack
- Windows:
$env:APPDATA\\gpustack\\gpustack.env
The following is an example of the content of the file:
HF_TOKEN=\"mytoken\"\nHF_ENDPOINT=\"https://my-hf-endpoint\"\n
Note
Unlike Systemd, Launchd and Windows services do not natively support reading environment variables from a file. Configuration via the environment file is implemented by the installation script. It reads the file and applies the variables to the service configuration. After modifying the environment file on Windows and macOS, you need to re-run the installation script to apply changes to the GPUStack service.
"},{"location":"installation/installation-script/#available-cli-flags","title":"Available CLI Flags","text":"The appended CLI flags of the installation script are passed directly as flags for the gpustack start
command. You can refer to the CLI Reference for details.
"},{"location":"installation/installation-script/#install-server","title":"Install Server","text":"To set up the GPUStack server (the management node), install GPUStack without the --server-url
flag. By default, the GPUStack server includes an embedded worker. To disable this embedded worker on the server, use the --disable-worker
flag.
"},{"location":"installation/installation-script/#install-worker","title":"Install Worker","text":"To form a cluster, you can add GPUStack workers on additional nodes. Install GPUStack with the --server-url
flag to specify the server' address and the --token
flag for worker authenticate.
Examples are as follows:
"},{"location":"installation/installation-script/#linux-or-macos","title":"Linux or macOS","text":"curl -sfL https://get.gpustack.ai | sh -s - --server-url http://myserver --token mytoken\n
In the default setup, you can run the following on the server node to get the token used for adding workers:
cat /var/lib/gpustack/token\n
"},{"location":"installation/installation-script/#windows_1","title":"Windows","text":"Invoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --server-url http://myserver --token mytoken\"\n
In the default setup, you can run the following on the server node to get the token used for adding workers:
Get-Content -Path \"$env:APPDATA\\gpustack\\token\" -Raw\n
"},{"location":"installation/manual-installation/","title":"Manual Installation","text":""},{"location":"installation/manual-installation/#prerequites","title":"Prerequites:","text":"Install Python version 3.10 to 3.12.
"},{"location":"installation/manual-installation/#install-gpustack-cli","title":"Install GPUStack CLI","text":"Run the following to install GPUStack:
# You can add extra dependencies, options are \"vllm\", \"audio\" and \"all\".\n# e.g., gpustack[all]\npip install gpustack\n
To verify, run:
gpustack version\n
"},{"location":"installation/manual-installation/#run-gpustack","title":"Run GPUStack","text":"Run the following command to start the GPUStack server:
gpustack start\n
By default, GPUStack uses /var/lib/gpustack
as the data directory so you need sudo
or proper permission for that. You can also set a custom data directory by running:
gpustack start --data-dir mypath\n
"},{"location":"installation/manual-installation/#run-gpustack-as-a-system-service","title":"Run GPUStack as a System Service","text":"A recommended way is to run GPUStack as a startup service. For example, using systemd:
Create a service file in /etc/systemd/system/gpustack.service
:
[Unit]\nDescription=GPUStack Service\nWants=network-online.target\nAfter=network-online.target\n\n[Service]\nEnvironmentFile=-/etc/default/%N\nExecStart=gpustack start\nRestart=always\nRestartSec=3\nStandardOutput=append:/var/log/gpustack.log\nStandardError=append:/var/log/gpustack.log\n\n[Install]\nWantedBy=multi-user.target\n
Then start GPUStack:
systemctl daemon-reload\nsystemctl enable gpustack\n
"},{"location":"installation/uninstallation/","title":"Uninstallation","text":""},{"location":"installation/uninstallation/#uninstallation-script","title":"Uninstallation Script","text":"Warning
Uninstallation script deletes the data in local datastore(sqlite), configuration, model cache, and all of the scripts and CLI tools. It does not remove any data from external datastores.
If you installed GPUStack using the installation script, a script to uninstall GPUStack was generated during installation.
"},{"location":"installation/uninstallation/#linux-or-macos","title":"Linux or macOS","text":"Run the following command to uninstall GPUStack:
sudo /var/lib/gpustack/uninstall.sh\n
"},{"location":"installation/uninstallation/#windows","title":"Windows","text":"Run the following command in PowerShell to uninstall GPUStack:
Set-ExecutionPolicy Bypass -Scope Process -Force; & \"$env:APPDATA\\gpustack\\uninstall.ps1\"\n
"},{"location":"installation/uninstallation/#manual-uninstallation","title":"Manual Uninstallation","text":"If you install GPUStack manually, the followings are example commands to uninstall GPUStack. You can modify according to your setup:
# Stop and remove the service.\nsystemctl stop gpustack.service\nrm /etc/systemd/system/gpustack.service\nsystemctl daemon-reload\n# Uninstall the CLI.\npip uninstall gpustack\n# Remove the data directory.\nrm -rf /var/lib/gpustack\n
"},{"location":"tutorials/creating-text-embeddings/","title":"Creating Text Embeddings","text":"Text embeddings are numerical representations of text that capture semantic meaning, enabling machines to understand relationships and similarities between different pieces of text. In essence, they transform text into vectors in a continuous space, where texts with similar meanings are positioned closer together. Text embeddings are widely used in applications such as natural language processing, information retrieval, and recommendation systems.
In this tutorial, we will demonstrate how to deploy embedding models in GPUStack and generate text embeddings using the deployed models.
"},{"location":"tutorials/creating-text-embeddings/#prerequisites","title":"Prerequisites","text":"Before you begin, ensure that you have the following:
- GPUStack is installed and running. If not, refer to the Quickstart Guide.
- Access to Hugging Face for downloading the model files.
"},{"location":"tutorials/creating-text-embeddings/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":"Follow these steps to deploy the model from Hugging Face:
- Navigate to the
Models
page in the GPUStack UI. - Click the
Deploy Model
button. - In the dropdown, select
Hugging Face
as the source for your model. - Enable the
GGUF
checkbox to filter models by GGUF format. - Use the search bar in the top left to search for the model name
CompendiumLabs/bge-small-en-v1.5-gguf
. - Leave everything as default and click the
Save
button to deploy the model.
After deployment, you can monitor the model's status on the Models
page.
"},{"location":"tutorials/creating-text-embeddings/#step-2-generate-an-api-key","title":"Step 2: Generate an API Key","text":"We will use the GPUStack API to generate text embeddings, and an API key is required:
- Navigate to the
API Keys
page in the GPUStack UI. - Click the
New API Key
button. - Enter a name for the API key and click the
Save
button. - Copy the generated API key. You can only view the API key once, so make sure to save it securely.
"},{"location":"tutorials/creating-text-embeddings/#step-3-generate-text-embeddings","title":"Step 3: Generate Text Embeddings","text":"With the model deployed and an API key, you can generate text embeddings via the GPUStack API. Here is an example script using curl
:
export SERVER_URL=<your-server-url>\nexport GPUSTACK_API_KEY=<your-api-key>\ncurl $SERVER_URL/v1-openai/embeddings \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"input\": \"The food was delicious and the waiter...\",\n \"model\": \"bge-small-en-v1.5\",\n \"encoding_format\": \"float\"\n }'\n
Replace <your-server-url>
with the URL of your GPUStack server and <your-api-key>
with the API key you generated in the previous step.
Example response:
{\n \"data\": [\n {\n \"embedding\": [\n -0.012189436703920364, 0.016934078186750412, 0.003965042531490326,\n -0.03453584015369415, -0.07623119652271271, -0.007116147316992283,\n 0.11278388649225235, 0.019714849069714546, 0.010370955802500248,\n -0.04219457507133484, -0.029902394860982895, 0.01122555136680603,\n 0.022912170737981796, 0.031186765059828758, 0.006303929258137941,\n # ... additional values\n ],\n \"index\": 0,\n \"object\": \"embedding\"\n }\n ],\n \"model\": \"bge-small-en-v1.5\",\n \"object\": \"list\",\n \"usage\": { \"prompt_tokens\": 12, \"total_tokens\": 12 }\n}\n
"},{"location":"tutorials/inference-on-cpus/","title":"Inference on CPUs","text":"GPUStack supports inference on CPUs, offering flexibility when GPU resources are limited or when model sizes exceed available GPU memory. The following CPU inference modes are available:
- CPU+GPU Hybrid Inference: Enables partial acceleration by offloading portions of large models to the CPU when VRAM capacity is insufficient.
- Full CPU Inference: Operates entirely on CPU when no GPU resources are available.
Note
CPU inference is supported when using the llama-box (llama.cpp) backend.
To deploy a model with CPU offloading, enable the Allow CPU Offloading
option in the deployment configuration (this setting is enabled by default).
After deployment, you can view the number of model layers offloaded to the CPU.
"},{"location":"tutorials/inference-with-function-calling/","title":"Inference with Function Calling","text":"Function calling allows you to connect models to external tools and systems. This is useful for many things such as empowering AI assistants with capabilities, or building deep integrations between your applications and the models.
In this tutorial, you\u2019ll learn how to set up and use function calling within GPUStack to extend your AI\u2019s capabilities.
Note
- Function calling is supported in the vLLM inference backend.
- Function calling is essentially achieved through prompt engineering, requiring models to be trained with internalized templates to enable this capability. Therefore, not all LLMs support function calling.
"},{"location":"tutorials/inference-with-function-calling/#prerequisites","title":"Prerequisites","text":"Before proceeding, ensure the following:
- GPUStack is installed and running.
- A Linux worker node with a GPU is available. We'll use Qwen2.5-7B-Instruct as the model for this tutorial. The model requires a GPU with at least 18GB VRAM.
- Access to Hugging Face for downloading the model files.
"},{"location":"tutorials/inference-with-function-calling/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":" - Navigate to the
Models
page in the GPUStack UI and click the Deploy Model
button. In the dropdown, select Hugging Face
as the source for your model. - Use the search bar to find the
Qwen/Qwen2.5-7B-Instruct
model. - Expand the
Advanced
section in configurations and scroll down to the Backend Parameters
section. - Click on the
Add Parameter
button and add the following parameters:
--enable-auto-tool-choice
--tool-call-parser=hermes
- Click the
Save
button to deploy the model.
After deployment, you can monitor the model's status on the Models
page.
"},{"location":"tutorials/inference-with-function-calling/#step-2-generate-an-api-key","title":"Step 2: Generate an API Key","text":"We will use the GPUStack API to interact with the model. To do this, you need to generate an API key:
- Navigate to the
API Keys
page in the GPUStack UI. - Click the
New API Key
button. - Enter a name for the API key and click the
Save
button. - Copy the generated API key for later use.
"},{"location":"tutorials/inference-with-function-calling/#step-3-do-inference","title":"Step 3: Do Inference","text":"With the model deployed and an API key, you can call the model via the GPUStack API. Here is an example script using curl
(replace <your-server-url>
with your GPUStack server URL and <your-api-key>
with the API key generated in the previous step):
export GPUSTACK_SERVER_URL=<your-server-url>\nexport GPUSTACK_API_KEY=<your-api-key>\ncurl $GPUSTACK_SERVER_URL/v1-openai/chat/completions \\\n-H \"Content-Type: application/json\" \\\n-H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n-d '{\n \"model\": \"qwen2.5-7b-instruct\",\n \"messages\": [\n {\n \"role\": \"user\",\n \"content\": \"What'\\''s the weather like in Boston today?\"\n }\n ],\n \"tools\": [\n {\n \"type\": \"function\",\n \"function\": {\n \"name\": \"get_current_weather\",\n \"description\": \"Get the current weather in a given location\",\n \"parameters\": {\n \"type\": \"object\",\n \"properties\": {\n \"location\": {\n \"type\": \"string\",\n \"description\": \"The city and state, e.g. San Francisco, CA\"\n },\n \"unit\": {\n \"type\": \"string\",\n \"enum\": [\"celsius\", \"fahrenheit\"]\n }\n },\n \"required\": [\"location\"]\n }\n }\n }\n ],\n \"tool_choice\": \"auto\"\n}'\n
Example response:
{\n \"model\": \"qwen2.5-7b-instruct\",\n \"choices\": [\n {\n \"index\": 0,\n \"message\": {\n \"role\": \"assistant\",\n \"content\": null,\n \"tool_calls\": [\n {\n \"id\": \"chatcmpl-tool-b99d32848b324eaea4bac5a5830d00b8\",\n \"type\": \"function\",\n \"function\": {\n \"name\": \"get_current_weather\",\n \"arguments\": \"{\\\"location\\\": \\\"Boston, MA\\\", \\\"unit\\\": \\\"fahrenheit\\\"}\"\n }\n }\n ]\n },\n \"finish_reason\": \"tool_calls\"\n }\n ],\n \"usage\": {\n \"prompt_tokens\": 212,\n \"total_tokens\": 242,\n \"completion_tokens\": 30\n }\n}\n
"},{"location":"tutorials/performing-distributed-inference-across-workers/","title":"Performing Distributed Inference Across Workers","text":"This tutorial will guide you through the process of configuring and running distributed inference across multiple workers using GPUStack. Distributed inference allows you to handle larger language models by distributing the computational workload among multiple workers. This is particularly useful when individual workers do not have sufficient resources, such as VRAM, to run the entire model independently.
"},{"location":"tutorials/performing-distributed-inference-across-workers/#prerequisites","title":"Prerequisites","text":"Before proceeding, ensure the following:
- GPUStack is installed and running. Refer to the Setting Up a Multi-node GPUStack Cluster tutorial if needed.
- Access to Hugging Face for downloading the model files.
In this tutorial, we\u2019ll assume a cluster with two nodes, each equipped with an NVIDIA P40 GPU (22GB VRAM), as shown in the following image:
We aim to run a large language model that requires more VRAM than a single worker can provide. For this tutorial, we\u2019ll use the Qwen/Qwen2.5-72B-Instruct
model with the q2_k
quantization format. The required resources for running this model can be estimated using the gguf-parser tool:
$ gguf-parser --hf-repo Qwen/Qwen2.5-72B-Instruct-GGUF --hf-file qwen2.5-72b-instruct-q2_k-00001-of-00007.gguf --ctx-size=8192 --in-short --skip-architecture --skip-metadata --skip-tokenizer\n\n+--------------------------------------------------------------------------------------+\n| ESTIMATE |\n+----------------------------------------------+---------------------------------------+\n| RAM | VRAM 0 |\n+--------------------+------------+------------+----------------+----------+-----------+\n| LAYERS (I + T + O) | UMA | NONUMA | LAYERS (T + O) | UMA | NONUMA |\n+--------------------+------------+------------+----------------+----------+-----------+\n| 1 + 0 + 0 | 243.89 MiB | 393.89 MiB | 80 + 1 | 2.50 GiB | 28.92 GiB |\n+--------------------+------------+------------+----------------+----------+-----------+\n
From the output, we can see that the estimated VRAM requirement for this model exceeds the 22GB VRAM available on each worker node. Thus, we need to distribute the inference across multiple workers to successfully run the model.
"},{"location":"tutorials/performing-distributed-inference-across-workers/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":"Follow these steps to deploy the model from Hugging Face, enabling distributed inference:
- Navigate to the
Models
page in the GPUStack UI. - Click the
Deploy Model
button. - In the dropdown, select
Hugging Face
as the source for your model. - Enable the
GGUF
checkbox to filter models by GGUF format. - Use the search bar in the top left to search for the model name
Qwen/Qwen2.5-72B-Instruct-GGUF
. - In the
Available Files
section, select the q2_k
quantization format. - Expand the
Advanced
section and scroll down. Disable the Allow CPU Offloading
option and verify that the Allow Distributed Inference Across Workers
option is enabled(this is enabled by default). GPUStack will evaluate the available resources in the cluster and run the model in a distributed manner if required. - Click the
Save
button to deploy the model.
"},{"location":"tutorials/performing-distributed-inference-across-workers/#step-2-verify-the-model-deployment","title":"Step 2: Verify the Model Deployment","text":"Once the model is deployed, verify the deployment on the Models
page, where you can view details about how the model is running across multiple workers.
You can also check worker and GPU resource usage by navigating to the Resources
page.
Finally, go to the Playground
page to interact with the model and verify that everything is functioning correctly.
"},{"location":"tutorials/performing-distributed-inference-across-workers/#conclusion","title":"Conclusion","text":"Congratulations! You have successfully configured and run distributed inference across multiple workers using GPUStack.
"},{"location":"tutorials/running-inference-with-ascend-npus/","title":"Running Inference With Ascend NPUs","text":"GPUStack supports running inference on Ascend NPUs. This tutorial will guide you through the configuration steps.
"},{"location":"tutorials/running-inference-with-ascend-npus/#system-and-hardware-support","title":"System and Hardware Support","text":"OS Status Verified Linux Support Ubuntu 20.04 Device Status Verified Ascend 910 Support Ascend 910B"},{"location":"tutorials/running-inference-with-ascend-npus/#setup-steps","title":"Setup Steps","text":""},{"location":"tutorials/running-inference-with-ascend-npus/#install-ascend-packages","title":"Install Ascend packages","text":" - Download Ascend packages
Choose the packages according to your system, hardware and GPUStack is compatible with CANN 8.x from resources download center(links below).
Download the driver and firmware from here.
Package Name Description Ascend-hdk-{chiptype}-npu-driver{version}_linux-{arch}.run Ascend Driver (run format) Ascend-hdk-{chiptype}-npu-firmware{version}.run Ascend (run format) Download the toolkit and kernels from here.
Package Name Description Ascend-cann-toolkit_{version}_linux-{arch}.run CANN Toolkit (run format) Ascend-cann-kernels-{chiptype}{version}_linux-{arch}.run CANN Kernels (run format) - Create the user and group for running
sudo groupadd -g HwHiAiUser\nsudo useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash\nsudo usermod -aG HwHiAiUser $USER\n
- Install driver
sudo chmod +x Ascend-hdk-xxx-npu-driver_x.x.x_linux-{arch}.run\n# Driver installation, default installation path: \"/usr/local/Ascend\"\nsudo sh Ascend-hdk-xxx-npu-driver_x.x.x_linux-{arch}.run --full --install-for-all\n
If you see the following message, the firmware installation is complete:
Driver package installed successfully!\n
- Verify successful driver installation
After the driver successful installation, run the npu-smi info
command to check if the driver was installed correctly.
$npu-smi info\n+------------------------------------------------------------------------------------------------+\n| npu-smi 23.0.1 Version: 23.0.1 |\n+---------------------------+---------------+----------------------------------------------------+\n| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|\n| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |\n+===========================+===============+====================================================+\n| 4 910B3 | OK | 93.6 40 0 / 0 |\n| 0 | 0000:01:00.0 | 0 0 / 0 3161 / 65536 |\n+===========================+===============+====================================================+\n+---------------------------+---------------+----------------------------------------------------+\n| NPU Chip | Process id | Process name | Process memory(MB) |\n+===========================+===============+====================================================+\n| No running processes found in NPU 4 |\n+===========================+===============+====================================================+\n
- Install firmware
sudo chmod +x Ascend-hdk-xxx-npu-firmware_x.x.x.x.X.run\nsudo sh Ascend-hdk-xxx-npu-firmware_x.x.x.x.X.run --full\n
If you see the following message, the firmware installation is complete:
Firmware package installed successfully!\n
- Install toolkit and kernels
As an example for Ubuntu, adapt commands according to your system.
Check for dependencies to ensure Python, GCC, and other required tools are installed.
gcc --version\ng++ --version\nmake --version\ncmake --version\ndpkg -l zlib1g| grep zlib1g| grep ii\ndpkg -l zlib1g-dev| grep zlib1g-dev| grep ii\ndpkg -l libsqlite3-dev| grep libsqlite3-dev| grep ii\ndpkg -l openssl| grep openssl| grep ii\ndpkg -l libssl-dev| grep libssl-dev| grep ii\ndpkg -l libffi-dev| grep libffi-dev| grep ii\ndpkg -l libbz2-dev| grep libbz2-dev| grep ii\ndpkg -l libxslt1-dev| grep libxslt1-dev| grep ii\ndpkg -l unzip| grep unzip| grep ii\ndpkg -l pciutils| grep pciutils| grep ii\ndpkg -l net-tools| grep net-tools| grep ii\ndpkg -l libblas-dev| grep libblas-dev| grep ii\ndpkg -l gfortran| grep gfortran| grep ii\ndpkg -l libblas3| grep libblas3| grep ii\n
If the commands return messages showing missing packages, install them as follows (adjust the command if only specific packages are missing):
sudo apt-get install -y gcc g++ make cmake zlib1g zlib1g-dev openssl libsqlite3-dev libssl-dev libffi-dev libbz2-dev libxslt1-dev unzip pciutils net-tools libblas-dev gfortran libblas3\n
Install Python dependencies:
pip3 install --upgrade pip\npip3 install attrs numpy decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions\n
Install the toolkit and kernels:
chmod +x Ascend-cann-toolkit_{vesion}_linux-{arch}.run\nchmod +x Ascend-cann-kernels-{chip_type}_{version}_linux-{arch}.run\n\nsh Ascend-cann-toolkit_{vesion}_linux-{arch}.run --install\nsh Ascend-cann-kernels-{chip_type}_{version}_linux-{arch}.run --install\n
Once installation completes, you should see a success message like this:
xxx install success\n
- Configure environment variables
echo \"source ~/Ascend/ascend-toolkit/set_env.sh\" >> ~/.bashrc\nsource ~/.bashrc\n
For more details, refer to the Ascend Documentation.
"},{"location":"tutorials/running-inference-with-ascend-npus/#installing-gpustack","title":"Installing GPUStack","text":"Once your environment is ready, you can install GPUStack following the installation guide.
Once installed, you should see that GPUStack successfully recognizes the Ascend Device in the resources page.
"},{"location":"tutorials/running-inference-with-ascend-npus/#running-inference","title":"Running Inference","text":"After installation, you can deploy models and run inference. Refer to the model management for usage details.
The Ascend NPU supports inference through the llama-box (llama.cpp) backend. For supported models, see the llama.cpp Ascend NPU model supports.
"},{"location":"tutorials/running-inference-with-moorethreads-gpus/","title":"Running Inference With Moore Threads GPUs","text":"GPUStack supports running inference on Moore Threads GPUs. This tutorial provides a comprehensive guide to configuring your system for optimal performance.
"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#system-and-hardware-support","title":"System and Hardware Support","text":"OS Architecture Status Verified Linux x86_64 Support Ubuntu 20.04/22.04 Device Status Verified MTT S80 Support Yes MTT S3000 Support Yes MTT S4000 Support Yes"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#prerequisites","title":"Prerequisites","text":"The following instructions are applicable for Ubuntu 20.04/22.04
systems with x86_64
architecture.
"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#configure-the-container-runtime","title":"Configure the Container Runtime","text":"Follow these links to install and configure the container runtime:
- Install Docker: Docker Installation Guide
- Install the latest drivers for MTT S80/S3000/S4000 (currently rc3.1.0): MUSA SDK Download
- Install the MT Container Toolkits (currently v1.9.0): MT CloudNative Toolkits Download
"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#verify-container-runtime-configuration","title":"Verify Container Runtime Configuration","text":"Ensure the output shows the default runtime as mthreads
.
$ (cd /usr/bin/musa && sudo ./docker setup $PWD)\n$ docker info | grep mthreads\n Runtimes: mthreads mthreads-experimental runc\n Default Runtime: mthreads\n
"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#installing-gpustack","title":"Installing GPUStack","text":"To set up an isolated environment for GPUStack, we recommend using Docker.
docker run -d --name gpustack-musa -p 9009:80 --ipc=host -v gpustack-data:/var/lib/gpustack \\\n gpustack/gpustack:main-musa\n
This command will:
- Start a container with the GPUStack image.
- Expose the GPUStack web interface on port
9009
. - Mount the
gpustack-data
volume to store the GPUStack data.
To check the logs of the running container, use the following command:
docker logs -f gpustack-musa\n
If the following message appears, the GPUStack container is running successfully:
2024-11-15T23:37:46+00:00 - gpustack.server.server - INFO - Serving on 0.0.0.0:80.\n2024-11-15T23:37:46+00:00 - gpustack.worker.worker - INFO - Starting GPUStack worker.\n
Once the container is running, access the GPUStack web interface by navigating to http://localhost:9009
in your browser.
After the initial setup for GPUStack, you should see the following screen:
"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#dashboard","title":"Dashboard","text":""},{"location":"tutorials/running-inference-with-moorethreads-gpus/#workers","title":"Workers","text":""},{"location":"tutorials/running-inference-with-moorethreads-gpus/#gpus","title":"GPUs","text":""},{"location":"tutorials/running-inference-with-moorethreads-gpus/#running-inference","title":"Running Inference","text":"After installation, you can deploy models and run inference. Refer to the model management for detailed usage instructions.
Moore Threads GPUs support inference through the llama-box (llama.cpp) backend. Most recent models are supported (e.g., llama3.2:1b, llama3.2-vision:11b, qwen2.5:7b, etc.).
Use mthreads-gmi
to verify if the model is offloaded to the GPU.
root@a414c45864ee:/# mthreads-gmi\nSat Nov 16 12:00:16 2024\n---------------------------------------------------------------\n mthreads-gmi:1.14.0 Driver Version:2.7.0\n---------------------------------------------------------------\nID Name |PCIe |%GPU Mem\n Device Type |Pcie Lane Width |Temp MPC Capable\n | ECC Mode\n+-------------------------------------------------------------+\n0 MTT S80 |00000000:01:00.0 |98% 1339MiB(16384MiB)\n Physical |16x(16x) |56C YES\n | N/A\n---------------------------------------------------------------\n\n---------------------------------------------------------------\nProcesses:\nID PID Process name GPU Memory\n Usage\n+-------------------------------------------------------------+\n0 120 ...ird_party/bin/llama-box/llama-box 2MiB\n0 2022 ...ird_party/bin/llama-box/llama-box 1333MiB\n---------------------------------------------------------------\n
"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/","title":"Running Inference on Copilot+ PCs with Snapdragon X","text":"GPUStack supports running on ARM64 Windows, enabling use on Snapdragon X-based Copilot+ PCs.
Note
Only CPU-based inference is supported on Snapdragon X devices. GPUStack does not currently support GPU or NPU acceleration on this platform.
"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#prerequisites","title":"Prerequisites","text":" - A Copilot+ PC with Snapdragon X. In this tutorial, we use the Dell XPS 13 9345.
- Install AMD64 Python (version 3.10 to 3.12). See details
"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#installing-gpustack","title":"Installing GPUStack","text":"Run PowerShell as administrator (avoid using PowerShell ISE), then run the following command to install GPUStack:
Invoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n
After installation, follow the on-screen instructions to obtain credentials and log in to the GPUStack UI.
"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#deploying-a-model","title":"Deploying a Model","text":" - Navigate to the
Models
page in the GPUStack UI. - Click on the
Deploy Model
button and select Ollama Library
from the dropdown. - Enter
llama3.2
in the Name
field. - Select
llama3.2
from the Ollama Model
dropdown. - Click
Save
to deploy the model.
Once deployed, you can monitor the model's status on the Models
page.
"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#running-inference","title":"Running Inference","text":"Navigate to the Playground
page in the GPUStack UI, where you can interact with the deployed model.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/","title":"Setting Up a Multi-node GPUStack Cluster","text":"This tutorial will guide you through setting up a multi-node GPUStack cluster, where you can distribute your workloads across multiple GPU-enabled nodes. This guide assumes you have basic knowledge of running commands on Linux, macOS, or Windows systems.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#prerequisites","title":"Prerequisites","text":"Before starting, ensure you have the following:
- Multiple nodes with supported OS and GPUs for GPUStack installation. View supported platforms and supported accelerators for more information.
- Nodes are connected to the same network and can communicate with each other.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-1-install-gpustack-on-the-server-node","title":"Step 1: Install GPUStack on the Server Node","text":"First, you need to install GPUStack on one of the nodes to act as the server node. Follow the instructions below based on your operating system.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#linux-or-macos","title":"Linux or macOS","text":"Run the following command on your server node:
curl -sfL https://get.gpustack.ai | sh -s -\n
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#windows","title":"Windows","text":"Run PowerShell as administrator and execute the following command:
Invoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n
Once GPUStack is installed, you can proceed to configure your cluster by adding worker nodes.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-2-retrieve-the-token-from-the-server-node","title":"Step 2: Retrieve the Token from the Server Node","text":"To add worker nodes to the cluster, you need the token generated by GPUStack on the server node. On the server node, run the following command to get the token:
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#linux-or-macos_1","title":"Linux or macOS","text":"cat /var/lib/gpustack/token\n
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#windows_1","title":"Windows","text":"Get-Content -Path \"$env:APPDATA\\gpustack\\token\" -Raw\n
This token will be required in the next steps to authenticate worker nodes.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-3-add-worker-nodes-to-the-cluster","title":"Step 3: Add Worker Nodes to the Cluster","text":"Now, you will install GPUStack on additional nodes (worker nodes) and connect them to the server node using the token.
Linux or macOS Worker Nodes
Run the following command on each worker node, replacing http://myserver with the URL of your server node and mytoken with the token retrieved in Step 2:
curl -sfL https://get.gpustack.ai | sh -s - --server-url http://myserver --token mytoken\n
Windows Worker Nodes
Run PowerShell as administrator on each worker node and use the following command, replacing http://myserver and mytoken with the server URL and token:
Invoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } --server-url http://myserver --token mytoken\"\n
Once the command is executed, each worker node will connect to the main server and become part of the GPUStack cluster.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-4-verify-the-cluster-setup","title":"Step 4: Verify the Cluster Setup","text":"After adding the worker nodes, you can verify that the cluster is set up correctly by accessing the GPUStack UI.
- Open a browser and navigate to
http://myserver
(replace myserver with the actual server URL). - Log in with the default credentials (username
admin
). To retrieve the default password, run the following command on the server node:
Linux or macOS
cat /var/lib/gpustack/initial_admin_password\n
Windows
Get-Content -Path \"$env:APPDATA\\gpustack\\initial_admin_password\" -Raw\n
- After logging in, navigate to the
Resources
page in the UI to see all connected nodes and their GPUs. You should see your worker nodes listed and ready for serving LLMs.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#conclusion","title":"Conclusion","text":"Congratulations! You've successfully set up a multi-node GPUStack cluster! You can now scale your workloads across multiple nodes, making full use of your available GPUs to handle your tasks efficiently.
"},{"location":"tutorials/using-audio-models/","title":"Using Audio Models","text":"GPUStack supports running both speech-to-text and text-to-speech models. Speech-to-text models convert audio inputs in various languages into written text, while text-to-speech models transform written text into natural and expressive speech.
In this tutorial, we will walk you through deploying and using speech-to-text and text-to-speech models in GPUStack.
"},{"location":"tutorials/using-audio-models/#prerequisites","title":"Prerequisites","text":"Before you begin, ensure that you have the following:
- A Linux system with AMD architecture or macOS.
- Access to Hugging Face for downloading the model files.
- GPUStack is installed and running. If not, refer to the Quickstart Guide.
"},{"location":"tutorials/using-audio-models/#running-speech-to-text-model","title":"Running Speech-to-Text Model","text":""},{"location":"tutorials/using-audio-models/#step-1-deploy-speech-to-text-model","title":"Step 1: Deploy Speech-to-Text Model","text":"Follow these steps to deploy the model from Hugging Face:
- Navigate to the
Models
page in the GPUStack UI. - Click the
Deploy Model
button. - In the dropdown, select
Hugging Face
as the source for your model. - Use the search bar in the top left to search for the model name
Systran/faster-whisper-medium
. - Leave everything as default and click the
Save
button to deploy the model.
After deployment, you can monitor the model's status on the Models
page.
"},{"location":"tutorials/using-audio-models/#step-2-interact-with-speech-to-text-model-models","title":"Step 2: Interact with Speech-to-Text Model Models","text":" - Navigate to the
Playground
> Audio
page in the GPUStack UI. - Select the
Speech to Text
Tab. - Select the deployed model from the top-right dropdown.
- Click the
Upload
button to upload audio file or click the Microphone
button to record audio. - Click the
Generate Text Content
button to generate the text.
"},{"location":"tutorials/using-audio-models/#running-text-to-speech-model","title":"Running Text-to-Speech Model","text":""},{"location":"tutorials/using-audio-models/#step-1-deploy-text-to-speech-model","title":"Step 1: Deploy Text-to-Speech Model","text":"Follow these steps to deploy the model from Hugging Face:
- Navigate to the
Models
page in the GPUStack UI. - Click the
Deploy Model
button. - In the dropdown, select
Hugging Face
as the source for your model. - Use the search bar in the top left to search for the model name
FunAudioLLM/CosyVoice-300M
. - Leave everything as default and click the
Save
button to deploy the model.
After deployment, you can monitor the model's status on the Models
page.
"},{"location":"tutorials/using-audio-models/#step-2-interact-with-text-to-speech-model-models","title":"Step 2: Interact with Text to Speech Model Models","text":" - Navigate to the
Playground
> Audio
page in the GPUStack UI. - Select the
Text to Speech
Tab. - Choose the deployed model from the dropdown menu in the top-right corner. Then, configure the voice and output audio format.
- Input the text to generate.
- Click the
Submit
button to generate the text.
"},{"location":"tutorials/using-image-generation-models/","title":"Using Image Generation Models","text":"GPUStack supports deploying and running state-of-the-art image generation models. These models allow you to generate stunning images from textual descriptions, enabling applications in design, content creation, and more.
In this tutorial, we will walk you through deploying and using image generation models in GPUStack.
"},{"location":"tutorials/using-image-generation-models/#prerequisites","title":"Prerequisites","text":"Before you begin, ensure that you have the following:
- A GPU that has at least 12 GB of VRAM.
- Access to Hugging Face for downloading the model files.
- GPUStack is installed and running. If not, refer to the Quickstart Guide.
"},{"location":"tutorials/using-image-generation-models/#step-1-deploy-the-stable-diffusion-model","title":"Step 1: Deploy the Stable Diffusion Model","text":"Follow these steps to deploy the model from Hugging Face:
- Navigate to the
Models
page in the GPUStack UI. - Click the
Deploy Model
button. - In the dropdown, select
Hugging Face
as the source for your model. - Use the search bar in the top left to search for the model name
gpustack/stable-diffusion-v3-5-medium-GGUF
. - In the
Available Files
section, select the stable-diffusion-v3-5-medium-Q4_0.gguf
file. - Leave everything as default and click the
Save
button to deploy the model.
After deployment, you can monitor the model's status on the Models
page.
"},{"location":"tutorials/using-image-generation-models/#step-2-use-the-model-for-image-generation","title":"Step 2: Use the Model for Image Generation","text":" - Navigate to the
Playground
> Image
page in the GPUStack UI. - Verify that the deployed model is selected from the top-right
Model
dropdown. - Enter a prompt describing the image you want to generate. For example:
a female character with long, flowing hair that appears to be made of ethereal, swirling patterns resembling the Northern Lights or Aurora Borealis. The background is dominated by deep blues and purples, creating a mysterious and dramatic atmosphere. The character's face is serene, with pale skin and striking features. She wears a dark-colored outfit with subtle patterns. The overall style of the artwork is reminiscent of fantasy or supernatural genres.\n
- Select
euler
in the Sampler
dropdown. - Set the
Sample Steps
to 20
. - Click the
Submit
button to create the image.
The generated image will be displayed in the UI. Your image may look different given the seed and randomness involved in the generation process.
"},{"location":"tutorials/using-image-generation-models/#conclusion","title":"Conclusion","text":"Congratulations! You\u2019ve successfully deployed and used an image generation model in GPUStack. With this setup, you can generate unique and visually compelling images from textual prompts. Experiment with different prompts and settings to push the boundaries of what\u2019s possible.
"},{"location":"tutorials/using-reranker-models/","title":"Using Reranker Models","text":"Reranker models are specialized models designed to improve the ranking of a list of items based on relevance to a given query. They are commonly used in information retrieval and search systems to refine initial search results, prioritizing items that are more likely to meet the user\u2019s intent. Reranker models take the initial document list and reorder items to enhance precision in applications such as search engines, recommendation systems, and question-answering tasks.
In this tutorial, we will guide you through deploying and using reranker models in GPUStack.
"},{"location":"tutorials/using-reranker-models/#prerequisites","title":"Prerequisites","text":"Before you begin, ensure that you have the following:
- GPUStack is installed and running. If not, refer to the Quickstart Guide.
- Access to Hugging Face for downloading the model files.
"},{"location":"tutorials/using-reranker-models/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":"Follow these steps to deploy the model from Hugging Face:
- Navigate to the
Models
page in the GPUStack UI. - Click the
Deploy Model
button. - In the dropdown, select
Hugging Face
as the source for your model. - Enable the
GGUF
checkbox to filter models by GGUF format. - Use the search bar in the top left to search for the model name
gpustack/bge-reranker-v2-m3-GGUF
. - Leave everything as default and click the
Save
button to deploy the model.
After deployment, you can monitor the model's status on the Models
page.
"},{"location":"tutorials/using-reranker-models/#step-2-generate-an-api-key","title":"Step 2: Generate an API Key","text":"We will use the GPUStack API to interact with the model. To do this, you need to generate an API key:
- Navigate to the
API Keys
page in the GPUStack UI. - Click the
New API Key
button. - Enter a name for the API key and click the
Save
button. - Copy the generated API key. You can only view the API key once, so make sure to save it securely.
"},{"location":"tutorials/using-reranker-models/#step-3-reranking","title":"Step 3: Reranking","text":"With the model deployed and an API key, you can rerank a list of documents via the GPUStack API. Here is an example script using curl
:
export SERVER_URL=<your-server-url>\nexport GPUSTACK_API_KEY=<your-api-key>\ncurl $SERVER_URL/v1/rerank \\\n -H \"Content-Type: application/json\" \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -d '{\n \"model\": \"bge-reranker-v2-m3\",\n \"query\": \"What is a panda?\",\n \"top_n\": 3,\n \"documents\": [\n \"hi\",\n \"it is a bear\",\n \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n ]\n }' | jq\n
Replace <your-server-url>
with the URL of your GPUStack server and <your-api-key>
with the API key you generated in the previous step.
Example response:
{\n \"model\": \"bge-reranker-v2-m3\",\n \"object\": \"list\",\n \"results\": [\n {\n \"document\": {\n \"text\": \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n },\n \"index\": 2,\n \"relevance_score\": 1.951932668685913\n },\n {\n \"document\": {\n \"text\": \"it is a bear\"\n },\n \"index\": 1,\n \"relevance_score\": -3.7347371578216553\n },\n {\n \"document\": {\n \"text\": \"hi\"\n },\n \"index\": 0,\n \"relevance_score\": -6.157620906829834\n }\n ],\n \"usage\": {\n \"prompt_tokens\": 69,\n \"total_tokens\": 69\n }\n}\n
"},{"location":"tutorials/using-vision-language-models/","title":"Using Vision Language Models","text":"Vision Language Models can process both visual (image) and language (text) data simultaneously, making them versatile tools for various applications, such as image captioning, visual question answering, and more. In this tutorial, you will learn how to deploy and interact with Vision Language Models (VLMs) in GPUStack.
The procedure for deploying and interacting with these models in GPUStack is similar. The main difference is the parameters you need to set when deploying the models. For more information on the parameters you can set, please refer to Backend Parameters .
In this tutorial, we will cover the deployment of the following models:
- Llama3.2-Vision
- Qwen2-VL
- Pixtral
- Phi3.5-Vision
"},{"location":"tutorials/using-vision-language-models/#prerequisites","title":"Prerequisites","text":"Before you begin, ensure that you have the following:
- A Linux machine with one or more GPUs that has at least 30 GB of VRAM in total. We will use the vLLM backend which only supports Linux.
- Access to Hugging Face and a Hugging Face API key for downloading the model files.
- You have been granted access to the above models on Hugging Face. Llama3.2-VL and Pixtral are gated models, and you need to request access to them.
Note
An Ubuntu node equipped with one H100 (80GB) GPU is used throughout this tutorial.
"},{"location":"tutorials/using-vision-language-models/#step-1-install-gpustack","title":"Step 1: Install GPUStack","text":"Run the following command to install GPUStack:
curl -sfL https://get.gpustack.ai | sh -s - --huggingface-token <Hugging Face API Key>\n
Replace <Hugging Face API Key>
with your Hugging Face API key. GPUStack will use this key to download the model files.
"},{"location":"tutorials/using-vision-language-models/#step-2-log-in-to-gpustack-ui","title":"Step 2: Log in to GPUStack UI","text":"Run the following command to get the default password:
cat /var/lib/gpustack/initial_admin_password\n
Open your browser and navigate to http://<your-server-ip>
. Replace <your-server-ip>
with the IP address of your server. Log in using the username admin
and the password you obtained in the previous step.
"},{"location":"tutorials/using-vision-language-models/#step-3-deploy-vision-language-models","title":"Step 3: Deploy Vision Language Models","text":""},{"location":"tutorials/using-vision-language-models/#deploy-llama32-vision","title":"Deploy Llama3.2-Vision","text":" - Navigate to the
Models
page in the GPUStack UI. - Click on the
Deploy Model
button, then select Hugging Face
in the dropdown. - Search for
meta-llama/Llama-3.2-11B-Vision-Instruct
in the search bar. - Expand the
Advanced
section in configurations and scroll down to the Backend Parameters
section. - Click on the
Add Parameter
button multiple times and add the following parameters:
--enforce-eager
--max-num-seqs=16
--max-model-len=8192
- Click the
Save
button.
"},{"location":"tutorials/using-vision-language-models/#deploy-qwen2-vl","title":"Deploy Qwen2-VL","text":" - Navigate to the
Models
page in the GPUStack UI. - Click on the
Deploy Model
button, then select Hugging Face
in the dropdown. - Search for
Qwen/Qwen2-VL-7B-Instruct
in the search bar. - Click the
Save
button. The default configurations should work as long as you have enough GPU resources.
"},{"location":"tutorials/using-vision-language-models/#deploy-pixtral","title":"Deploy Pixtral","text":" - Navigate to the
Models
page in the GPUStack UI. - Click on the
Deploy Model
button, then select Hugging Face
in the dropdown. - Search for
mistralai/Pixtral-12B-2409
in the search bar. - Expand the
Advanced
section in configurations and scroll down to the Backend Parameters
section. - Click on the
Add Parameter
button multiple times and add the following parameters:
--tokenizer-mode=mistral
--limit-mm-per-prompt=image=4
- Click the
Save
button.
"},{"location":"tutorials/using-vision-language-models/#deploy-phi35-vision","title":"Deploy Phi3.5-Vision","text":" - Navigate to the
Models
page in the GPUStack UI. - Click on the
Deploy Model
button, then select Hugging Face
in the dropdown. - Search for
microsoft/Phi-3.5-vision-instruct
in the search bar. - Expand the
Advanced
section in configurations and scroll down to the Backend Parameters
section. - Click on the
Add Parameter
button and add the following parameter:
--trust-remote-code
- Click the
Save
button.
"},{"location":"tutorials/using-vision-language-models/#step-4-interact-with-vision-language-models","title":"Step 4: Interact with Vision Language Models","text":" - Navigate to the
Playground
page in the GPUStack UI. - Select the deployed model from the top-right dropdown.
- Click on the
Upload Image
button above the input text area and upload an image. - Enter a prompt in the input text area. For example, \"Describe the image.\"
- Click the
Submit
button to generate the output.
"},{"location":"tutorials/using-vision-language-models/#conclusion","title":"Conclusion","text":"In this tutorial, you learned how to deploy and interact with Vision Language Models in GPUStack. You can use the same approach to deploy other Vision Language Models not covered in this tutorial. If you have any questions or need further assistance, feel free to reach out to us.
"},{"location":"user-guide/api-key-management/","title":"API Key Management","text":"GPUStack supports authentication using API keys. Each GPUStack user can generate and manage their own API keys.
"},{"location":"user-guide/api-key-management/#create-api-key","title":"Create API Key","text":" - Navigate to the
API Keys
page. - Click the
New API Key
button. - Fill in the
Name
, Description
, and select the Expiration
of the API key. - Click the
Save
button. - Copy and store the key somewhere safe, then click the
Done
button.
Note
Please note that you can only see the generated API key once upon creation.
"},{"location":"user-guide/api-key-management/#delete-api-key","title":"Delete API Key","text":" - Navigate to the
API Keys
page. - Find the API key you want to delete.
- Click the
Delete
button in the Operations
column. - Confirm the deletion.
"},{"location":"user-guide/api-key-management/#use-api-key","title":"Use API Key","text":"GPUStack supports using the API key as a bearer token. The following is an example using curl:
export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/chat/completions \\\n -H \"Content-Type: application/json\" \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -d '{\n \"model\": \"llama3\",\n \"messages\": [\n {\n \"role\": \"system\",\n \"content\": \"You are a helpful assistant.\"\n },\n {\n \"role\": \"user\",\n \"content\": \"Hello!\"\n }\n ],\n \"stream\": true\n }'\n
"},{"location":"user-guide/image-generation-apis/","title":"Image Generation APIs","text":"GPUStack provides APIs for generating images given a prompt and/or an input image when running diffusion models.
Note
The image generation APIs are only available when using the llama-box inference backend.
"},{"location":"user-guide/image-generation-apis/#supported-models","title":"Supported Models","text":"The following models are available for image generation:
Tip
Please use the converted GGUF models provided by GPUStack. Check the model link for more details.
- stabilityai/stable-diffusion-3.5-large-turbo
- stabilityai/stable-diffusion-3.5-large
- stabilityai/stable-diffusion-3.5-medium
- stabilityai/stable-diffusion-3-medium
- TencentARC/FLUX.1-mini
- Freepik/FLUX.1-lite
- black-forest-labs/FLUX.1-dev
- black-forest-labs/FLUX.1-schnell
- stabilityai/sdxl-turbo
- stabilityai/stable-diffusion-xl-refiner-1.0
- stabilityai/stable-diffusion-xl-base-1.0
- stabilityai/sd-turbo
- stabilityai/stable-diffusion-2-1
- stable-diffusion-v1-5/stable-diffusion-v1-5
- CompVis/stable-diffusion-v1-4
"},{"location":"user-guide/image-generation-apis/#api-details","title":"API Details","text":"The image generation APIs adhere to OpenAI API specification. While OpenAI APIs for image generation are simple and opinionated, GPUStack extends these capabilities with additional features.
"},{"location":"user-guide/image-generation-apis/#create-image","title":"Create Image","text":""},{"location":"user-guide/image-generation-apis/#streaming","title":"Streaming","text":"This image generation API supports streaming responses to return the progressing of the generation. To enable streaming, set the stream
parameter to true
in the request body. Example:
REQUEST : (application/json)\n{\n \"n\": 1,\n \"response_format\": \"b64_json\",\n \"size\": \"512x512\",\n \"prompt\": \"A lovely cat\",\n \"quality\": \"standard\",\n \"stream\": true,\n \"stream_options\": {\n \"include_usage\": true, // return usage information\n }\n}\n\nRESPONSE : (text/event-stream)\ndata: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\ndata: [DONE]\n
"},{"location":"user-guide/image-generation-apis/#advanced-options","title":"Advanced Options","text":"This image generation API supports additional options to control the generation process. The following options are available:
REQUEST : (application/json)\n{\n \"n\": 1,\n \"response_format\": \"b64_json\",\n \"size\": \"512x512\",\n \"prompt\": \"A lovely cat\",\n \"sampler\": \"euler\", // required, select from euler_a;euler;heun;dpm2;dpm++2s_a;dpm++2m;dpm++2mv2;ipndm;ipndm_v;lcm\n \"schedule\": \"default\", // optional, select from default;discrete;karras;exponential;ays;gits\n \"seed\": null, // optional, random seed\n \"cfg_scale\": 4.5, // optional, for sampler, the scale of classifier-free guidance in the output phase\n \"sample_steps\": 20, // optional, number of sample steps\n \"negative_prompt\": \"\", // optional, negative prompt\n \"stream\": true,\n \"stream_options\": {\n \"include_usage\": true, // return usage information\n }\n}\n\nRESPONSE : (text/event-stream)\ndata: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\ndata: [DONE]\n
"},{"location":"user-guide/image-generation-apis/#create-image-edit","title":"Create Image Edit","text":""},{"location":"user-guide/image-generation-apis/#streaming_1","title":"Streaming","text":"This image generation API supports streaming responses to return the progressing of the generation. To enable streaming, set the stream
parameter to true
in the request body. Example:
REQUEST: (multipart/form-data)\nn=1\nresponse_format=b64_json\nsize=512x512\nprompt=\"A lovely cat\"\nquality=standard\nimage=... // required\nmask=... // optional\nstream=true\nstream_options_include_usage=true // return usage information\n\nRESPONSE : (text/event-stream)\nCASE 1: correct input image\n data: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n ...\n data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n ...\n data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\n data: [DONE]\nCASE 2: illegal input image\n error: {\"code\": 400, \"message\": \"Invalid image\", \"type\": \"invalid_request_error\"}\n
"},{"location":"user-guide/image-generation-apis/#advanced-options_1","title":"Advanced Options","text":"This image generation API supports additional options to control the generation process. The following options are available:
REQUEST: (multipart/form-data)\nn=1\nresponse_format=b64_json\nsize=512x512\nprompt=\"A lovely cat\"\nimage=... // required\nmask=... // optional\nsampler=euler // required, select from euler_a;euler;heun;dpm2;dpm++2s_a;dpm++2m;dpm++2mv2;ipndm;ipndm_v;lcm\nschedule=default // optional, select from default;discrete;karras;exponential;ays;gits\nseed=null // optional, random seed\ncfg_scale=4.5 // optional, for sampler, the scale of classifier-free guidance in the output phase\nsample_steps=20 // optional, number of sample steps\nnegative_prompt=\"\" // optional, negative prompt\nstream=true\nstream_options_include_usage=true // return usage information\n\nRESPONSE : (text/event-stream)\nCASE 1: correct input image\n data: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n ...\n data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n ...\n data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\n data: [DONE]\nCASE 2: illegal input image\n error: {\"code\": 400, \"message\": \"Invalid image\", \"type\": \"invalid_request_error\"}\n
"},{"location":"user-guide/image-generation-apis/#usage","title":"Usage","text":"The followings are examples using the image generation APIs:
"},{"location":"user-guide/image-generation-apis/#curl-create-image","title":"curl (Create Image)","text":"export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/image/generate \\\n -H \"Content-Type: application/json\" \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -d '{\n \"n\": 1,\n \"response_format\": \"b64_json\",\n \"size\": \"512x512\",\n \"prompt\": \"A lovely cat\",\n \"quality\": \"standard\",\n \"stream\": true,\n \"stream_options\": {\n \"include_usage\": true\n }\n }'\n
"},{"location":"user-guide/image-generation-apis/#curl-create-image-edit","title":"curl (Create Image Edit)","text":"export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/image/edit \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -F image=\"@otter.png\" \\\n -F mask=\"@mask.png\" \\\n -F prompt=\"A lovely cat\" \\\n -F n=1 \\\n -F size=\"512x512\"\n
"},{"location":"user-guide/inference-backends/","title":"Inference Backends","text":"GPUStack supports the following inference backends:
- llama-box
- vLLM
- vox-box
When users deploy a model, the backend is selected automatically based on the following criteria:
- If the model is a GGUF model,
llama-box
is used. - If the model is a known
text-to-speech
or speech-to-text
model, vox-box
is used. - Otherwise,
vLLM
is used.
"},{"location":"user-guide/inference-backends/#llama-box","title":"llama-box","text":"llama-box is a LM inference server based on llama.cpp and stable-diffusion.cpp.
"},{"location":"user-guide/inference-backends/#supported-platforms","title":"Supported Platforms","text":"The llama-box backend supports Linux, macOS and Windows (with CPU offloading only on Windows ARM architecture) platforms.
"},{"location":"user-guide/inference-backends/#supported-models","title":"Supported Models","text":" - LLMs: For supported LLMs, refer to the llama.cpp README.
- Difussion Models: Supported models are listed in this Hugging Face collection.
- Reranker Models: Supported models can be found in this Hugging Face collection.
"},{"location":"user-guide/inference-backends/#supported-features","title":"Supported Features","text":""},{"location":"user-guide/inference-backends/#allow-cpu-offloading","title":"Allow CPU Offloading","text":"After enabling CPU offloading, GPUStack prioritizes loading as many layers as possible onto the GPU to optimize performance. If GPU resources are limited, some layers will be offloaded to the CPU, with full CPU inference used only when no GPU is available.
"},{"location":"user-guide/inference-backends/#allow-distributed-inference-across-workers","title":"Allow Distributed Inference Across Workers","text":"Enable distributed inference across multiple workers. The primary Model Instance will communicate with backend instances on one or more others workers, offloading computation tasks to them.
"},{"location":"user-guide/inference-backends/#parameters-reference","title":"Parameters Reference","text":"See the full list of supported parameters for llama-box here.
"},{"location":"user-guide/inference-backends/#vllm","title":"vLLM","text":"vLLM is a high-throughput and memory-efficient LLMs inference engine. It is a popular choice for running LLMs in production. vLLM seamlessly supports most state-of-the-art open-source models, including: Transformer-like LLMs (e.g., Llama), Mixture-of-Expert LLMs (e.g., Mixtral), Embedding Models (e.g. E5-Mistral), Multi-modal LLMs (e.g., LLaVA)
By default, GPUStack estimates the VRAM requirement for the model instance based on the model's metadata. You can customize the parameters to fit your needs. The following vLLM parameters might be useful:
--gpu-memory-utilization
(default: 0.9): The fraction of GPU memory to use for the model instance. --max-model-len
: Model context length. For large-context models, GPUStack automatically sets this parameter to 8192
to simplify model deployment, especially in resource constrained environments. You can customize this parameter to fit your needs. --tensor-parallel-size
: Number of tensor parallel replicas. By default, GPUStack sets this parameter given the GPU resources available and the estimation of the model's memory requirement. You can customize this parameter to fit your needs.
For more details, please refer to vLLM documentation.
"},{"location":"user-guide/inference-backends/#supported-platforms_1","title":"Supported Platforms","text":"The vLLM backend works on AMD Linux.
Note
- When users install GPUStack on amd64 Linux using the installation script, vLLM is automatically installed.
- When users deploy a model using the vLLM backend, GPUStack sets worker label selectors to
{\"os\": \"linux\", \"arch\": \"amd64\"}
by default to ensure the model instance is scheduled to proper workers. You can customize the worker label selectors in the model configuration.
"},{"location":"user-guide/inference-backends/#supported-models_1","title":"Supported Models","text":"Please refer to the vLLM documentation for supported models.
"},{"location":"user-guide/inference-backends/#supported-features_1","title":"Supported Features","text":""},{"location":"user-guide/inference-backends/#multimodal-language-models","title":"Multimodal Language Models","text":"vLLM supports multimodal language models listed here. When users deploy a vision language model using the vLLM backend, image inputs are supported in the chat completion API.
"},{"location":"user-guide/inference-backends/#parameters-reference_1","title":"Parameters Reference","text":"See the full list of supported parameters for vLLM here.
"},{"location":"user-guide/inference-backends/#vox-box","title":"vox-box","text":"vox-box is an inference engine designed for deploying text-to-speech and speech-to-text models. It also provides an API that is fully compatible with the OpenAI audio API.
"},{"location":"user-guide/inference-backends/#supported-platforms_2","title":"Supported Platforms","text":"The vox-box backend supports Linux, macOS and Windows platforms.
Note
- To use Nvidia GPUs, ensure the following NVIDIA libraries are installed on workers:
- cuBLAS for CUDA 12
- cuDNN 9 for CUDA 12
- When users install GPUStack on Linux, macOS and Windows using the installation script, vox-box is automatically installed.
- CosyVoice models are natively supported on Linux AMD architecture and macOS. However, these models are not supported on Linux ARM or Windows architectures.
"},{"location":"user-guide/inference-backends/#supported-models_2","title":"Supported Models","text":"Model Type Link Supported Platforms Faster-whisper-large-v3 speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-large-v2 speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-large-v1 speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-medium speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-medium.en speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-small speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-small.en speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-distil-whisper-large-v3 speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-distil-whisper-large-v2 speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-distil-whisper-medium.en speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-tiny speech-to-text Hugging Face, ModelScope Linux, macOS, Windows Faster-whisper-tiny.en speech-to-text Hugging Face, ModelScope Linux, macOS, Windows CosyVoice-300M-Instruct text-to-speech Hugging Face, ModelScope Linux(ARM not supported), macOS, Windows(Not supported) CosyVoice-300M-SFT text-to-speech Hugging Face, ModelScope Linux(ARM not supported), macOS, Windows(Not supported) CosyVoice-300M text-to-speech Hugging Face, ModelScope Linux(ARM not supported), macOS, Windows(Not supported) CosyVoice-300M-25Hz text-to-speech ModelScope Linux(ARM not supported), macOS, Windows(Not supported)"},{"location":"user-guide/inference-backends/#supported-features_2","title":"Supported Features","text":""},{"location":"user-guide/inference-backends/#allow-gpucpu-offloading","title":"Allow GPU/CPU Offloading","text":"vox-box supports deploying models to NVIDIA GPUs. If GPU resources are insufficient, it will automatically deploy the models to the CPU.
"},{"location":"user-guide/model-management/","title":"Model Management","text":"You can manage large language models in GPUStack by navigating to the Models
page. A model in GPUStack contains one or multiple replicas of model instances. On deployment, GPUStack automatically computes resource requirements for the model instances from model metadata and schedules them to available workers accordingly.
"},{"location":"user-guide/model-management/#deploy-model","title":"Deploy Model","text":"Currently, models from Hugging Face, ModelScope, Ollama and local paths are supported.
"},{"location":"user-guide/model-management/#deploying-a-hugging-face-model","title":"Deploying a Hugging Face Model","text":" -
Click the Deploy Model
button, then select Hugging Face
in the dropdown.
-
Search the model by name from Hugging Face using the search bar in the top left. For example, microsoft/Phi-3-mini-4k-instruct-gguf
. If you only want to search for GGUF models, check the \"GGUF\" checkbox.
-
Select a file with the desired quantization format from Available Files
.
-
Adjust the Name
and Replicas
as needed.
-
Expand the Advanced
section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.
-
Click the Save
button.
"},{"location":"user-guide/model-management/#deploying-a-modelscope-model","title":"Deploying a ModelScope Model","text":" -
Click the Deploy Model
button, then select ModelScope
in the dropdown.
-
Search the model by name from ModelScope using the search bar in the top left. For example, Qwen/Qwen2-0.5B-Instruct
. If you only want to search for GGUF models, check the \"GGUF\" checkbox.
-
Select a file with the desired quantization format from Available Files
.
-
Adjust the Name
and Replicas
as needed.
-
Expand the Advanced
section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.
-
Click the Save
button.
"},{"location":"user-guide/model-management/#deploying-an-ollama-model","title":"Deploying an Ollama Model","text":" -
Click the Deploy Model
button, then select Ollama Library
in the dropdown.
-
Fill in the Name
of the model.
-
Select an Ollama Model
from the dropdown list, or input any Ollama model you need. For example, llama3
, llama3:70b
or youraccount/llama3:70b
.
-
Adjust the Replicas
as needed.
-
Expand the Advanced
section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.
-
Click the Save
button.
"},{"location":"user-guide/model-management/#deploying-a-local-path-model","title":"Deploying a Local Path Model","text":"You can deploy a model from a local path. The model path can be a directory (e.g., a downloaded Hugging Face model directory) or a file (e.g., a GGUF model file) located on workers. This is useful when running in an air-gapped environment.
Note
- GPUStack does not check the validity of the model path for scheduling, which may lead to deployment failure if the model path is inaccessible. It is recommended to ensure the model path is accessible on all workers(e.g., using NFS, rsync, etc.). You can also use the worker selector configuration to deploy the model to specific workers.
- GPUStack cannot evaluate the model's resource requirements unless the server has access to the same model path. Consequently, you may observe empty VRAM/RAM allocations for a deployed model. To mitigate this, it is recommended to make the model files available on the same path on the server. Alternatively, you can customize backend parameters, such as
tensor-split
, to configure how the model is distributed across the GPUs.
To deploy a local path model:
-
Click the Deploy Model
button, then select Local Path
in the dropdown.
-
Fill in the Name
of the model.
-
Fill in the Model Path
.
-
Adjust the Replicas
as needed.
-
Expand the Advanced
section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.
-
Click the Save
button.
"},{"location":"user-guide/model-management/#edit-model","title":"Edit Model","text":" - Find the model you want to edit on the model list page.
- Click the
Edit
button in the Operations
column. - Update the attributes as needed. For example, change the
Replicas
to scale up or down. - Click the
Save
button.
Note
After editing the model, the configuration will not be applied to existing model instances. You need to delete the existing model instances. GPUStack will recreate new instances based on the updated model configuration.
"},{"location":"user-guide/model-management/#delete-model","title":"Delete Model","text":" - Find the model you want to delete on the model list page.
- Click the ellipsis button in the
Operations
column, then select Delete
. - Confirm the deletion.
"},{"location":"user-guide/model-management/#view-model-instance","title":"View Model Instance","text":" - Find the model you want to check on the model list page.
- Click the
>
symbol to view the instance list of the model.
"},{"location":"user-guide/model-management/#delete-model-instance","title":"Delete Model Instance","text":" - Find the model you want to check on the model list page.
- Click the
>
symbol to view the instance list of the model. - Find the model instance you want to delete.
- Click the ellipsis button for the model instance in the
Operations
column, then select Delete
. - Confirm the deletion.
Note
After a model instance is deleted, GPUStack will recreate a new instance to satisfy the expected replicas of the model if necessary.
"},{"location":"user-guide/model-management/#view-model-instance-logs","title":"View Model Instance Logs","text":" - Find the model you want to check on the model list page.
- Click the
>
symbol to view the instance list of the model. - Find the model instance you want to check.
- Click the
View Logs
button for the model instance in the Operations
column.
"},{"location":"user-guide/model-management/#use-self-hosted-ollama-models","title":"Use Self-hosted Ollama Models","text":"You can deploy self-hosted Ollama models by configuring the --ollama-library-base-url
option in the GPUStack server. The Ollama Library
URL should point to the base URL of the Ollama model registry. For example, https://registry.mycompany.com
.
Here is an example workflow to set up a registry, publish a model, and use it in GPUStack:
# Run a self-hosted OCI registry\ndocker run -d -p 5001:5000 --name registry registry:2\n\n# Push a model to the registry using Ollama\nollama pull llama3\nollama cp llama3 localhost:5001/library/llama3\nollama push localhost:5001/library/llama3 --insecure\n\n# Start GPUStack server with the custom Ollama library URL\ncurl -sfL https://get.gpustack.ai | sh -s - --ollama-library-base-url http://localhost:5001\n
That's it! You can now deploy the model llama3
from Ollama Library
source in GPUStack as usual, but the model will now be fetched from the self-hosted registry.
"},{"location":"user-guide/model-management/#advanced-model-configuration","title":"Advanced Model Configuration","text":"GPUStack supports tailored configurations for model deployment.
"},{"location":"user-guide/model-management/#schedule-type","title":"Schedule Type","text":""},{"location":"user-guide/model-management/#auto","title":"Auto","text":"GPUStack automatically schedules model instances to appropriate GPUs/Workers based on current resource availability.
-
Placement Strategy
-
Spread: Make the resources of the entire cluster relatively evenly distributed among all workers. It may produce more resource fragmentation on a single worker.
-
Binpack: Prioritize the overall utilization of cluster resources, reducing resource fragmentation on Workers/GPUs.
-
Worker Selector
When configured, the scheduler will deploy the model instance to the worker containing specified labels.
-
Navigate to the Resources
page and edit the desired worker. Assign custom labels to the worker by adding them in the labels section.
-
Go to the Models
page and click on the Deploy Model
button. Expand the Advanced
section and input the previously assigned worker labels in the Worker Selector
configuration. During deployment, the Model Instance will be allocated to the corresponding worker based on these labels.
"},{"location":"user-guide/model-management/#manual","title":"Manual","text":"This schedule type allows users to specify which GPU to deploy the model instance on.
- GPU Selector
Select a GPU from the list. The model instance will attempt to deploy to this GPU if resources permit.
"},{"location":"user-guide/model-management/#backend","title":"Backend","text":"The inference backend. Currently, GPUStack supports three backends: llama-box, vLLM and vox-box. GPUStack automatically selects the backend based on the model's configuration.
For more details, please refer to the Inference Backends section.
"},{"location":"user-guide/model-management/#backend-version","title":"Backend Version","text":"Specify a backend version, such as v1.0.0
. The version format and availability depend on the selected backend. This option is useful for ensuring compatibility or taking advantage of features introduced in specific backend versions. Refer to the Pinned Backend Versions section for more information.
"},{"location":"user-guide/model-management/#backend-parameters","title":"Backend Parameters","text":"Input the parameters for the backend you want to customize when running the model. The parameter should be in the format --parameter=value
, --bool-parameter
or as separate fields for --parameter
and value
. For example, use --ctx-size=8192
for llama-box.
For full list of supported parameters, please refer to the Inference Backends section.
"},{"location":"user-guide/model-management/#allow-cpu-offloading","title":"Allow CPU Offloading","text":"Note
Available for llama-box backend only.
After enabling CPU offloading, GPUStack prioritizes loading as many layers as possible onto the GPU to optimize performance. If GPU resources are limited, some layers will be offloaded to the CPU, with full CPU inference used only when no GPU is available.
"},{"location":"user-guide/model-management/#allow-distributed-inference-across-workers","title":"Allow Distributed Inference Across Workers","text":"Note
Available for llama-box backend only.
Enable distributed inference across multiple workers. The primary Model Instance will communicate with backend instances on one or more other workers, offloading computation tasks to them.
"},{"location":"user-guide/openai-compatible-apis/","title":"OpenAI Compatible APIs","text":"GPUStack serves OpenAI-compatible APIs using the /v1-openai
path. Most of the APIs also work under the /v1
path as an alias, except for the models
endpoint, which is reserved for GPUStack management APIs.
"},{"location":"user-guide/openai-compatible-apis/#supported-endpoints","title":"Supported Endpoints","text":"The following API endpoints are supported:
- List Models
- Create Completion
- Create Chat Completion
- Create Embeddings
- Create Image
- Create Image Edit
- Create Speech
- Create Transcription
"},{"location":"user-guide/openai-compatible-apis/#usage","title":"Usage","text":"The following are examples using the APIs in different languages:
"},{"location":"user-guide/openai-compatible-apis/#curl","title":"curl","text":"export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/chat/completions \\\n -H \"Content-Type: application/json\" \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -d '{\n \"model\": \"llama3\",\n \"messages\": [\n {\n \"role\": \"system\",\n \"content\": \"You are a helpful assistant.\"\n },\n {\n \"role\": \"user\",\n \"content\": \"Hello!\"\n }\n ],\n \"stream\": true\n }'\n
"},{"location":"user-guide/openai-compatible-apis/#openai-python-api-library","title":"OpenAI Python API library","text":"from openai import OpenAI\n\nclient = OpenAI(base_url=\"http://myserver/v1-openai\", api_key=\"myapikey\")\n\ncompletion = client.chat.completions.create(\n model=\"llama3\",\n messages=[\n {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n {\"role\": \"user\", \"content\": \"Hello!\"}\n ]\n)\n\nprint(completion.choices[0].message)\n
"},{"location":"user-guide/openai-compatible-apis/#openai-node-api-library","title":"OpenAI Node API library","text":"const OpenAI = require(\"openai\");\n\nconst openai = new OpenAI({\n apiKey: \"myapikey\",\n baseURL: \"http://myserver/v1-openai\",\n});\n\nasync function main() {\n const params = {\n model: \"llama3\",\n messages: [\n {\n role: \"system\",\n content: \"You are a helpful assistant.\",\n },\n {\n role: \"user\",\n content: \"Hello!\",\n },\n ],\n };\n const chatCompletion = await openai.chat.completions.create(params);\n console.log(chatCompletion.choices[0].message);\n}\nmain();\n
"},{"location":"user-guide/pinned-backend-versions/","title":"Pinned Backend Versions","text":"Inference engines in the generative AI domain are evolving rapidly to enhance performance and unlock new capabilities. This constant evolution provides exciting opportunities but also presents challenges for maintaining model compatibility and deployment stability.
GPUStack allows you to pin inference backend versions to specific releases, offering a balance between staying up-to-date with the latest advancements and ensuring a reliable runtime environment. This feature is particularly beneficial in the following scenarios:
- Leveraging the newest backend features without waiting for a GPUStack update.
- Locking in a specific backend version to maintain compatibility with existing models.
- Assigning different backend versions to models with varying requirements.
By pinning backend versions, you gain full control over your inference environment, enabling both flexibility and predictability in deployment.
"},{"location":"user-guide/pinned-backend-versions/#automatic-installation-of-pinned-backend-versions","title":"Automatic Installation of Pinned Backend Versions","text":"To simplify deployment, GPUStack supports the automatic installation of pinned backend versions when feasible. The process depends on the type of backend:
- Prebuilt Binaries For backends like
llama-box
, GPUStack downloads the specified version using the same mechanism as in GPUStack bootstrapping.
Tip
You can customize the download source using the --tools-download-base-url
configuration option.
- Python-based Backends For backends like
vLLM
and vox-box
, GPUStack uses pipx
to install the specified version in an isolated Python environment.
Tip
- Ensure that
pipx
is installed on the worker nodes. - If
pipx
is not in the system PATH, specify its location with the --pipx-path
configuration option.
This automation reduces manual intervention, allowing you to focus on deploying and using your models.
"},{"location":"user-guide/pinned-backend-versions/#manual-installation-of-pinned-backend-versions","title":"Manual Installation of Pinned Backend Versions","text":"When automatic installation is not feasible or preferred, GPUStack provides a straightforward way to manually install specific versions of inference backends. Follow these steps:
- Prepare the Executable Install the backend executable or link it under the GPUStack bin directory. The default locations are:
- Linux/macOS:
/var/lib/gpustack/bin
- Windows:
$env:AppData\\gpustack\\bin
Tip
You can customize the bin directory using the --bin-dir
configuration option.
- Name the Executable Ensure the executable is named in the following format:
- Linux/macOS:
<backend>_<version>
- Windows:
<backend>_<version>.exe
For example, the vLLM executable for version v0.6.4 should be named vllm_v0.6.4
on Linux.
By following these steps, you can maintain full control over the backend installation process, ensuring that the correct version is used for your deployment.
"},{"location":"user-guide/rerank-api/","title":"Rerank API","text":"In the context of Retrieval-Augmented Generation (RAG), reranking refers to the process of selecting the most relevant information from retrieved documents or knowledge sources before presenting them to the user or utilizing them for answer generation.
GPUStack serves Jina compatible Rerank API using the /v1/rerank
path.
Note
The Rerank API is only available when using the llama-box inference backend.
"},{"location":"user-guide/rerank-api/#supported-models","title":"Supported Models","text":"The following models are available for reranking:
- bce-reranker-base_v1
- jina-reranker-v1-turbo-en
- jina-reranker-v1-tiny-en
- bge-reranker-v2-m3
- gte-multilingual-reranker-base \ud83e\uddea
- jina-reranker-v2-base-multilingual \ud83e\uddea
"},{"location":"user-guide/rerank-api/#usage","title":"Usage","text":"The following is an example using the Rerank API:
export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1/rerank \\\n -H \"Content-Type: application/json\" \\\n -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n -d '{\n \"model\": \"bge-reranker-v2-m3\",\n \"query\": \"What is a panda?\",\n \"top_n\": 3,\n \"documents\": [\n \"hi\",\n \"it is a bear\",\n \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n ]\n }' | jq\n
Example output:
{\n \"model\": \"bge-reranker-v2-m3\",\n \"object\": \"list\",\n \"results\": [\n {\n \"document\": {\n \"text\": \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n },\n \"index\": 2,\n \"relevance_score\": 1.951932668685913\n },\n {\n \"document\": {\n \"text\": \"it is a bear\"\n },\n \"index\": 1,\n \"relevance_score\": -3.7347371578216553\n },\n {\n \"document\": {\n \"text\": \"hi\"\n },\n \"index\": 0,\n \"relevance_score\": -6.157620906829834\n }\n ],\n \"usage\": {\n \"prompt_tokens\": 69,\n \"total_tokens\": 69\n }\n}\n
"},{"location":"user-guide/user-management/","title":"User Management","text":"GPUStack supports users of two roles: Admin
and User
. Admins can monitor system status, manage models, users, and system settings. Users can manage their own API keys and use the completion API.
"},{"location":"user-guide/user-management/#default-admin","title":"Default Admin","text":"On bootstrap, GPUStack creates a default admin user. The initial password for the default admin is stored in <data-dir>/initial_admin_password
. In the default setup, it should be /var/lib/gpustack/initial_admin_password
. You can customize the default admin password by setting the --bootstrap-password
parameter when starting gpustack
.
"},{"location":"user-guide/user-management/#create-user","title":"Create User","text":" - Navigate to the
Users
page. - Click the
Create User
button. - Fill in
Name
, Full Name
, Password
, and select Role
for the user. - Click the
Save
button.
"},{"location":"user-guide/user-management/#update-user","title":"Update User","text":" - Navigate to the
Users
page. - Find the user you want to edit.
- Click the
Edit
button in the Operations
column. - Update the attributes as needed.
- Click the
Save
button.
"},{"location":"user-guide/user-management/#delete-user","title":"Delete User","text":" - Navigate to the
Users
page. - Find the user you want to delete.
- Click the ellipsis button in the
Operations
column, then select Delete
. - Confirm the deletion.
"},{"location":"user-guide/playground/","title":"Playground","text":"GPUStack offers a playground UI where users can test and experiment with the APIs. Refer to each subpage for detailed instructions and information.
"},{"location":"user-guide/playground/audio/","title":"Audio Playground","text":"The Audio Playground is a dedicated space for testing and experimenting with GPUStack\u2019s text-to-speech (TTS) and speech-to-text (STT) APIs. It allows users to interactively convert text to audio and audio to text, customize parameters, and review code examples for seamless API integration.
"},{"location":"user-guide/playground/audio/#text-to-speech","title":"Text to Speech","text":"Switch to the \"Text to Speech\" tab to test TTS models.
"},{"location":"user-guide/playground/audio/#text-input","title":"Text Input","text":"Enter the text you want to convert, then click the Submit
button to generate the corresponding speech.
"},{"location":"user-guide/playground/audio/#clear-text","title":"Clear Text","text":"Click the Clear
button to reset the text input and remove the generated speech.
"},{"location":"user-guide/playground/audio/#select-model","title":"Select Model","text":"Select an available TTS model in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.
"},{"location":"user-guide/playground/audio/#customize-parameters","title":"Customize Parameters","text":"Customize the voice and format of the audio output.
Tip
Supported voices may vary between models.
"},{"location":"user-guide/playground/audio/#view-code","title":"View Code","text":"After experimenting with input text and parameters, click the View Code
button to see how to call the API with the same input. Code examples are provided in curl
, Python
, and Node.js
.
"},{"location":"user-guide/playground/audio/#speech-to-text","title":"Speech to Text","text":"Switch to the \"Speech to Text\" tab to test STT models.
"},{"location":"user-guide/playground/audio/#provide-audio-file","title":"Provide Audio File","text":"You can provide audio for transcription in two ways:
- Upload an audio file.
- Record audio online.
Note
If the online recording is not available, it could be due to one of the following reasons:
- For HTTPS or
http://localhost
access, microphone permissions must be enabled in your browser. -
For access via http://{host IP}
, the URL must be added to your browser's trusted list.
Example: In Chrome, navigate to chrome://flags/
, add the GPUStack URL to \"Insecure origins treated as secure,\" and enable this option.
"},{"location":"user-guide/playground/audio/#select-model_1","title":"Select Model","text":"Select an available STT model in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.
"},{"location":"user-guide/playground/audio/#copy-text","title":"Copy Text","text":"Copy the transcription results generated by the model.
"},{"location":"user-guide/playground/audio/#customize-parameters_1","title":"Customize Parameters","text":"Select the appropriate language for your audio file to optimize transcription accuracy.
"},{"location":"user-guide/playground/audio/#view-code_1","title":"View Code","text":"After experimenting with audio files and parameters, click the View Code
button to see how to call the API with the same input. Code examples are provided in curl
, Python
, and Node.js
.
"},{"location":"user-guide/playground/chat/","title":"Chat Playground","text":"Interact with the chat completions API. The following is an example screenshot:
"},{"location":"user-guide/playground/chat/#prompts","title":"Prompts","text":"You can adjust the prompt messages on the left side of the playground. There are three role types of prompt messages: system, user, and assistant.
- System: Typically a predefined instruction or guidance that sets the context, defines the behavior, or imposes specific constraints on how the model should generate its responses.
- User: The input or query provided by the user (the person interacting with the LLM).
- Assistant: The response generated by the LLM.
"},{"location":"user-guide/playground/chat/#edit-system-message","title":"Edit System Message","text":"You can add and edit the system message at the top of the playground.
"},{"location":"user-guide/playground/chat/#edit-user-and-assistant-messages","title":"Edit User and Assistant Messages","text":"To add a user or assistant message, click the New Message
button.
To remove a user or assistant message, click the minus button at the right corner of the message.
To change the role of a message, click the User
or Assistant
text at the beginning of the message.
"},{"location":"user-guide/playground/chat/#upload-image","title":"Upload Image","text":"You can add images to the prompt by clicking the Upload Image
button.
"},{"location":"user-guide/playground/chat/#clear-prompts","title":"Clear Prompts","text":"Click the Clear
button to clear all the prompts.
"},{"location":"user-guide/playground/chat/#select-model","title":"Select Model","text":"You can select available models in GPUStack by clicking the model dropdown at the top-right corner of the playground. Please refer to Model Management to learn about how to manage models.
"},{"location":"user-guide/playground/chat/#customize-parameters","title":"Customize Parameters","text":"You can customize completion parameters in the Parameters
section.
"},{"location":"user-guide/playground/chat/#do-completion","title":"Do Completion","text":"You can do a completion by clicking the Submit
button.
"},{"location":"user-guide/playground/chat/#view-code","title":"View Code","text":"Once you've done experimenting with the prompts and parameters, you can click the View Code
button to check how you can call the API with the same input by code. Code examples in curl
, Python
, and Node.js
are provided.
"},{"location":"user-guide/playground/chat/#compare-playground","title":"Compare Playground","text":"You can compare multiple models in the playground. The following is an example screenshot:
"},{"location":"user-guide/playground/chat/#comparision-mode","title":"Comparision Mode","text":"You can choose the number of models to compare by clicking the comparison view buttons, including 2, 3, 4 and 6-model comparison.
"},{"location":"user-guide/playground/chat/#prompts_1","title":"Prompts","text":"You can adjust the prompt messages similar to the chat playground.
"},{"location":"user-guide/playground/chat/#upload-image_1","title":"Upload Image","text":"You can add images to the prompt by clicking the Upload Image
button.
"},{"location":"user-guide/playground/chat/#clear-prompts_1","title":"Clear Prompts","text":"Click the Clear
button to clear all the prompts.
"},{"location":"user-guide/playground/chat/#select-model_1","title":"Select Model","text":"You can select available models in GPUStack by clicking the model dropdown at the top-left corner of each model panel.
"},{"location":"user-guide/playground/chat/#customize-parameters_1","title":"Customize Parameters","text":"You can customize completion parameters by clicking the settings button of each model.
"},{"location":"user-guide/playground/embedding/","title":"Embedding Playground","text":"The Embedding Playground lets you test the model\u2019s ability to convert text into embeddings. It allows you to experiment with multiple text inputs, visualize embeddings, and review code examples for API integration.
"},{"location":"user-guide/playground/embedding/#add-text","title":"Add Text","text":"Add at least two text entries and click the Submit
button to generate embeddings.
"},{"location":"user-guide/playground/embedding/#batch-input-text","title":"Batch Input Text","text":"Enable Batch Input Mode
to automatically split multi-line text into separate entries based on line breaks. This is useful for processing multiple text snippets in a single operation.
"},{"location":"user-guide/playground/embedding/#visualization","title":"Visualization","text":"Visualize the embedding results using PCA (Principal Component Analysis) to reduce dimensions and display them on a 2D plot. Results can be viewed in two formats:
- Chart - Display PCA results visually.
- JSON - View raw embeddings in JSON format.
In the chart, the distance between points represents the similarity between corresponding texts. Closer points indicate higher similarity.
"},{"location":"user-guide/playground/embedding/#clear","title":"Clear","text":"Click the Clear
button to reset text entries and clear the output.
"},{"location":"user-guide/playground/embedding/#select-model","title":"Select Model","text":"You can select available models in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.
"},{"location":"user-guide/playground/embedding/#view-code","title":"View Code","text":"After experimenting with the text inputs, click the View Code
button to see how you can call the API with the same input. Code examples are provided in curl
, Python
, and Node.js
.
"},{"location":"user-guide/playground/image/","title":"Image Playground","text":"The Image Playground is a dedicated space for testing and experimenting with GPUStack\u2019s image generation APIs. It allows users to interactively explore the capabilities of different models, customize parameters, and review code examples for seamless API integration.
"},{"location":"user-guide/playground/image/#prompt","title":"Prompt","text":"You can input or randomly generate a prompt, then click the Submit button to generate an image.
"},{"location":"user-guide/playground/image/#clear-prompt","title":"Clear Prompt","text":"Click the Clear
button to reset the prompt and remove the generated image.
"},{"location":"user-guide/playground/image/#select-model","title":"Select Model","text":"You can select available models in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.
"},{"location":"user-guide/playground/image/#customize-parameters","title":"Customize Parameters","text":"You can customize the image generation parameters by switching between two API styles:
- OpenAI-compatible mode.
- Advanced mode.
"},{"location":"user-guide/playground/image/#advanced-parameters","title":"Advanced Parameters","text":"Parameter Default Description Counts
1
Number of images to generate. Size
512x512
The size of the generated image in 'widthxheight' format. Sampler
euler_a
The sampler algorithm for image generation. Options include 'euler_a', 'euler', 'heun', 'dpm2', 'dpm++2s_a', 'dpm++2m', 'dpm++2mv2', 'ipndm', 'ipndm_v', and 'lcm'. Schedule
discrete
The noise scheduling method. Sampler Steps
10
The number of sampling steps to perform. Higher values may improve image quality at the cost of longer processing time. CFG Scale
4.5
The scale for classifier-free guidance. A higher value increases adherence to the prompt. Negative Prompt
(empty) A negative prompt to specify what the image should avoid. Seed
(empty) Random seed. Note
The maximum image size is restricted by the model's deployment settings. See the diagram below:
"},{"location":"user-guide/playground/image/#view-code","title":"View Code","text":"After experimenting with prompts and parameters, click the View Code
button to see how to call the API with the same inputs. Code examples are provided in curl
, Python
, and Node.js
.
"},{"location":"user-guide/playground/rerank/","title":"Rerank Playground","text":"The Rerank Playground allows you to test reranker models that reorder multiple texts based on their relevance to a query. Experiment with various input texts, customize parameters, and review code examples for API integration.
"},{"location":"user-guide/playground/rerank/#add-text","title":"Add Text","text":"Add multiple text entries to the document for reranking.
"},{"location":"user-guide/playground/rerank/#bach-input-text","title":"Bach Input Text","text":"Enable Batch Input Mode
to split multi-line text into separate entries based on line breaks. This is useful for processing multiple text snippets efficiently.
"},{"location":"user-guide/playground/rerank/#clear","title":"Clear","text":"Click the Clear
button to reset the document and query results.
"},{"location":"user-guide/playground/rerank/#query","title":"Query","text":"Input a query and click the Submit
button to get a ranked list of texts based on their relevance to the query.
"},{"location":"user-guide/playground/rerank/#select-model","title":"Select Model","text":"Select an available reranker model in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.
"},{"location":"user-guide/playground/rerank/#customize-parameters","title":"Customize Parameters","text":"In the parameter section, set Top N
to specify the number of matching texts to retrieve.
"},{"location":"user-guide/playground/rerank/#view-code","title":"View Code","text":"After experimenting with the input text and query, click the View Code
button to see how to call the API with the same input. Code examples are provided in curl
, Python
, and Node.js
.
"}]}
\ No newline at end of file
diff --git a/0.4/sitemap.xml b/0.4/sitemap.xml
index 65a3268..6231bbb 100644
--- a/0.4/sitemap.xml
+++ b/0.4/sitemap.xml
@@ -2,237 +2,237 @@
https://docs.gpustack.ai/0.4/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/api-reference/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/architecture/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/code-of-conduct/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/contributing/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/development/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/overview/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/quickstart/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/scheduler/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/troubleshooting/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/upgrade/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/cli-reference/chat/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/cli-reference/download-tools/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/cli-reference/draw/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/cli-reference/start/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/installation/air-gapped-installation/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/installation/docker-installation/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/installation/installation-requirements/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/installation/installation-script/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/installation/manual-installation/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/installation/uninstallation/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/tutorials/creating-text-embeddings/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/tutorials/inference-on-cpus/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/tutorials/inference-with-function-calling/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/tutorials/performing-distributed-inference-across-workers/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/tutorials/running-inference-with-ascend-npus/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/tutorials/running-inference-with-moorethreads-gpus/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/tutorials/setting-up-a-multi-node-gpustack-cluster/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/tutorials/using-audio-models/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/tutorials/using-image-generation-models/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/tutorials/using-reranker-models/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/tutorials/using-vision-language-models/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/user-guide/api-key-management/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/user-guide/image-generation-apis/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/user-guide/inference-backends/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/user-guide/model-management/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/user-guide/openai-compatible-apis/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/user-guide/pinned-backend-versions/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/user-guide/rerank-api/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/user-guide/user-management/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/user-guide/playground/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/user-guide/playground/audio/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/user-guide/playground/chat/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/user-guide/playground/embedding/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/user-guide/playground/image/
- 2024-12-13
+ 2024-12-17
daily
https://docs.gpustack.ai/0.4/user-guide/playground/rerank/
- 2024-12-13
+ 2024-12-17
daily
\ No newline at end of file
diff --git a/0.4/sitemap.xml.gz b/0.4/sitemap.xml.gz
index e23da6c..32fd431 100644
Binary files a/0.4/sitemap.xml.gz and b/0.4/sitemap.xml.gz differ
diff --git a/0.4/tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/index.html b/0.4/tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/index.html
index 4fb1e20..4eec4eb 100644
--- a/0.4/tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/index.html
+++ b/0.4/tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/index.html
@@ -1635,7 +1635,7 @@ Running Inference on
Prerequisites
- A Copilot+ PC with Snapdragon X. In this tutorial, we use the Dell XPS 13 9345.
-- Install AMD64 Python 3.10 or above. See details
+- Install AMD64 Python (version 3.10 to 3.12). See details
Installing GPUStack
Run PowerShell as administrator (avoid using PowerShell ISE), then run the following command to install GPUStack: