From 990b3398882e7ee3395ae15020c606f53523e052 Mon Sep 17 00:00:00 2001 From: gitlawr Date: Fri, 6 Dec 2024 11:59:51 +0800 Subject: [PATCH] Deployed c87fcfb to 0.4 with MkDocs 1.6.0 and mike 2.1.3 --- .../air-gapped-installation/index.html | 16 +++ 0.4/overview/index.html | 8 +- 0.4/search/search_index.json | 2 +- 0.4/sitemap.xml | 92 +++++++++--------- 0.4/sitemap.xml.gz | Bin 704 -> 704 bytes 5 files changed, 65 insertions(+), 53 deletions(-) diff --git a/0.4/installation/air-gapped-installation/index.html b/0.4/installation/air-gapped-installation/index.html index 3455beb..af2cbc1 100644 --- a/0.4/installation/air-gapped-installation/index.html +++ b/0.4/installation/air-gapped-installation/index.html @@ -1688,6 +1688,16 @@

Step 1: Download the Required Pac # Download dependency tools and save them as an archive gpustack download-tools --save-archive gpustack_offline_tools.tar.gz +

Optional: Additional Dependencies for macOS.

+
# Deploying the speech-to-text CosyVoice model on macOS requires additional dependencies.
+brew install openfst
+CPLUS_INCLUDE_PATH=$(brew --prefix openfst)/include
+LIBRARY_PATH=$(brew --prefix openfst)/lib
+
+AUDIO_DEPENDENCY_PACKAGE_SPEC="wetextprocessing"
+pip wheel $AUDIO_DEPENDENCY_PACKAGE_SPEC -w gpustack_audio_dependency_offline_packages
+mv gpustack_audio_dependency_offline_packages/* gpustack_offline_packages/ && rm -rf gpustack_audio_dependency_offline_packages
+

Note

This instruction assumes that the online environment uses the same GPU type as the air-gapped environment. If the GPU types differ, use the --device flag to specify the device type for the air-gapped environment. Refer to the download-tools command for more information.

@@ -1706,6 +1716,12 @@

Step 3: Install GPUStack

# Load and apply the pre-downloaded tools archive gpustack download-tools --load-archive gpustack_offline_tools.tar.gz
+

Optional: Additional Dependencies for macOS.

+
# Install the additional dependencies for speech-to-text CosyVoice model on macOS.
+brew install openfst
+
+pip install --no-index --find-links=gpustack_offline_packages wetextprocessing
+

Now you can run GPUStack by following the instructions in the Manual Installation guide.

diff --git a/0.4/overview/index.html b/0.4/overview/index.html index 885e8e6..13d6baf 100644 --- a/0.4/overview/index.html +++ b/0.4/overview/index.html @@ -1762,12 +1762,8 @@

Supported Accelerators

We plan to support the following accelerators in future releases.

Supported Models

GPUStack uses llama-box (bundled llama.cpp and stable-diffusion.cpp server), vLLM and vox-box as the backends and supports a wide range of models. Models from the following sources are supported:

diff --git a/0.4/search/search_index.json b/0.4/search/search_index.json index f0959f7..247f43a 100644 --- a/0.4/search/search_index.json +++ b/0.4/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":""},{"location":"api-reference/","title":"API Reference","text":"

GPUStack provides a built-in Swagger UI. You can access it by navigating to <gpustack-server-url>/docs in your browser to view and interact with the APIs.

"},{"location":"architecture/","title":"Architecture","text":"

The following diagram shows the architecture of GPUStack:

"},{"location":"architecture/#server","title":"Server","text":"

The GPUStack server consists of the following components:

"},{"location":"architecture/#worker","title":"Worker","text":"

GPUStack workers are responsible for:

"},{"location":"architecture/#sql-database","title":"SQL Database","text":"

The GPUStack server connects to a SQL database as the datastore. GPUStack uses SQLite by default, but you can configure it to use an external PostgreSQL as well.

"},{"location":"architecture/#inference-server","title":"Inference Server","text":"

Inference servers are the backends that performs the inference tasks. GPUStack supports llama-box, vLLM and vox-box as the inference server.

"},{"location":"architecture/#rpc-server","title":"RPC Server","text":"

The RPC server enables running llama-box backend on a remote host. The Inference Server communicates with one or several instances of RPC server, offloading computations to these remote hosts. This setup allows for distributed LLM inference across multiple workers, enabling the system to load larger models even when individual resources are limited.

"},{"location":"code-of-conduct/","title":"Contributor Code of Conduct","text":""},{"location":"code-of-conduct/#our-pledge","title":"Our Pledge","text":"

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

"},{"location":"code-of-conduct/#our-standards","title":"Our Standards","text":"

Examples of behavior that contributes to a positive environment for our community include:

Examples of unacceptable behavior include:

"},{"location":"code-of-conduct/#enforcement-responsibilities","title":"Enforcement Responsibilities","text":"

Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.

Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.

"},{"location":"code-of-conduct/#scope","title":"Scope","text":"

This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.

"},{"location":"code-of-conduct/#enforcement","title":"Enforcement","text":"

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at contact@gpustack.ai. All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the reporter of any incident.

"},{"location":"code-of-conduct/#enforcement-guidelines","title":"Enforcement Guidelines","text":"

Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:

"},{"location":"code-of-conduct/#1-correction","title":"1. Correction","text":"

Community Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.

Consequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.

"},{"location":"code-of-conduct/#2-warning","title":"2. Warning","text":"

Community Impact: A violation through a single incident or series of actions.

Consequence: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.

"},{"location":"code-of-conduct/#3-temporary-ban","title":"3. Temporary Ban","text":"

Community Impact: A serious violation of community standards, including sustained inappropriate behavior.

Consequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.

"},{"location":"code-of-conduct/#4-permanent-ban","title":"4. Permanent Ban","text":"

Community Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.

Consequence: A permanent ban from any sort of public interaction within the community.

"},{"location":"code-of-conduct/#attribution","title":"Attribution","text":"

This Code of Conduct is adapted from the Contributor Covenant, version 2.1, available at https://www.contributor-covenant.org/version/2/1/code_of_conduct.html.

"},{"location":"contributing/","title":"Contributing to GPUStack","text":"

Thanks for taking the time to contribute to GPUStack!

Please review and follow the Code of Conduct.

"},{"location":"contributing/#filing-issues","title":"Filing Issues","text":"

If you find any bugs or are having any trouble, please search the reported issue as someone may have experienced the same issue, or we are actively working on a solution.

If you can't find anything related to your issue, contact us by filing an issue. To help us diagnose and resolve, please include as much information as possible, including:

"},{"location":"contributing/#contributing-code","title":"Contributing Code","text":"

For setting up development environment, please refer to Development Guide.

If you're fixing a small issue, you can simply submit a PR. However, if you're planning to submit a bigger PR to implement a new feature or fix a relatively complex bug, please open an issue that explains the change and the motivation for it. If you're addressing a bug, please explain how to reproduce it.

"},{"location":"contributing/#updating-documentation","title":"Updating Documentation","text":"

If you have any updates to our documentation, feel free to file an issue with the documentation label or make a pull request.

"},{"location":"development/","title":"Development Guide","text":""},{"location":"development/#prerequisites","title":"Prerequisites","text":"

Install python 3.10+.

"},{"location":"development/#set-up-environment","title":"Set Up Environment","text":"
make install\n
"},{"location":"development/#run","title":"Run","text":"
poetry run gpustack\n
"},{"location":"development/#build","title":"Build","text":"
make build\n

And check artifacts in dist.

"},{"location":"development/#test","title":"Test","text":"
make test\n
"},{"location":"development/#update-dependencies","title":"Update Dependencies","text":"
poetry add <something>\n

Or

poetry add --group dev <something>\n

For dev/testing dependencies.

"},{"location":"overview/","title":"GPUStack","text":"

GPUStack is an open-source GPU cluster manager for running AI models.

"},{"location":"overview/#key-features","title":"Key Features","text":""},{"location":"overview/#supported-platforms","title":"Supported Platforms","text":"

The following operating systems are verified to work with GPUStack:

OS Versions Windows 10, 11 Ubuntu >= 20.04 Debian >= 11 RHEL >= 8 Rocky >= 8 Fedora >= 36 OpenSUSE >= 15.3 (leap) OpenEuler >= 22.03

Note

The installation of GPUStack worker on a Linux system requires that the GLIBC version be 2.29 or higher.

"},{"location":"overview/#supported-architectures","title":"Supported Architectures","text":"

GPUStack supports both AMD64 and ARM64 architectures, with the following notes:

"},{"location":"overview/#supported-accelerators","title":"Supported Accelerators","text":"

We plan to support the following accelerators in future releases.

"},{"location":"overview/#supported-models","title":"Supported Models","text":"

GPUStack uses llama-box (bundled llama.cpp and stable-diffusion.cpp server), vLLM and vox-box as the backends and supports a wide range of models. Models from the following sources are supported:

  1. Hugging Face

  2. ModelScope

  3. Ollama Library

  4. Local File Path

"},{"location":"overview/#example-models","title":"Example Models:","text":"Category Models Large Language Models(LLMs) Qwen, LLaMA, Mistral, Deepseek, Phi, Yi Vision Language Models(VLMs) Llama3.2-Vision, Pixtral , Qwen2-VL, LLaVA, InternVL2 Diffusion Models Stable Diffusion, FLUX Rerankers GTE, BCE, BGE, Jina Audio Models Whisper (speech-to-text), CosyVoice (text-to-speech)

For full list of supported models, please refer to the supported models section in the inference backends documentation.

"},{"location":"overview/#openai-compatible-apis","title":"OpenAI-Compatible APIs","text":"

GPUStack serves OpenAI compatible APIs. For details, please refer to OpenAI Compatible APIs

"},{"location":"quickstart/","title":"Quickstart","text":""},{"location":"quickstart/#installation","title":"Installation","text":""},{"location":"quickstart/#linux-or-macos","title":"Linux or macOS","text":"

GPUStack provides a script to install it as a service on systemd or launchd based systems. To install GPUStack using this method, just run:

curl -sfL https://get.gpustack.ai | sh -s -\n
"},{"location":"quickstart/#windows","title":"Windows","text":"

Run PowerShell as administrator (avoid using PowerShell ISE), then run the following command to install GPUStack:

Invoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n
"},{"location":"quickstart/#other-installation-methods","title":"Other Installation Methods","text":"

For manual installation, docker installation or detailed configuration options, please refer to the Installation Documentation.

"},{"location":"quickstart/#getting-started","title":"Getting Started","text":"
  1. Run and chat with the llama3.2 model:
gpustack chat llama3.2 \"tell me a joke.\"\n
  1. Run and generate an image with the stable-diffusion-v3-5-large-turbo model:

Tip

This command downloads the model (~12GB) from Hugging Face. The download time depends on your network speed. Ensure you have enough disk space and VRAM (12GB) to run the model. If you encounter issues, you can skip this step and move to the next one.

gpustack draw hf.co/gpustack/stable-diffusion-v3-5-large-turbo-GGUF:stable-diffusion-v3-5-large-turbo-Q4_0.gguf \\\n\"A minion holding a sign that says 'GPUStack'. The background is filled with futuristic elements like neon lights, circuit boards, and holographic displays. The minion is wearing a tech-themed outfit, possibly with LED lights or digital patterns. The sign itself has a sleek, modern design with glowing edges. The overall atmosphere is high-tech and vibrant, with a mix of dark and neon colors.\" \\\n--sample-steps 5 --show\n

Once the command completes, the generated image will appear in the default viewer. You can experiment with the prompt and CLI options to customize the output.

  1. Open http://myserver in the browser to access the GPUStack UI. Log in to GPUStack with username admin and the default password. You can run the following command to get the password for the default setup:

Linux or macOS

cat /var/lib/gpustack/initial_admin_password\n

Windows

Get-Content -Path \"$env:APPDATA\\gpustack\\initial_admin_password\" -Raw\n
  1. Click Playground in the navigation menu. Now you can chat with the LLM in the UI playground.

  1. Click API Keys in the navigation menu, then click the New API Key button.

  2. Fill in the Name and click the Save button.

  3. Copy the generated API key and save it somewhere safe. Please note that you can only see it once on creation.

  4. Now you can use the API key to access the OpenAI-compatible API. For example, use curl as the following:

export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n  -d '{\n    \"model\": \"llama3.2\",\n    \"messages\": [\n      {\n        \"role\": \"system\",\n        \"content\": \"You are a helpful assistant.\"\n      },\n      {\n        \"role\": \"user\",\n        \"content\": \"Hello!\"\n      }\n    ],\n    \"stream\": true\n  }'\n
"},{"location":"quickstart/#cleanup","title":"Cleanup","text":"

After you complete using the deployed models, you can go to the Models page in the GPUStack UI and delete the models to free up resources.

"},{"location":"scheduler/","title":"Scheduler","text":""},{"location":"scheduler/#summary","title":"Summary","text":"

The scheduler's primary responsibility is to calculate the resources required by models instance and to evaluate and select the optimal workers/GPUs for model instances through a series of strategies. This ensures that model instances can run efficiently. This document provides a detailed overview of the policies and processes used by the scheduler.

"},{"location":"scheduler/#scheduling-process","title":"Scheduling Process","text":""},{"location":"scheduler/#filtering-phase","title":"Filtering Phase","text":"

The filtering phase aims to narrow down the available workers or GPUs to those that meet specific criteria. The main policies involved are:

"},{"location":"scheduler/#label-matching-policy","title":"Label Matching Policy","text":"

This policy filters workers based on the label selectors configured for the model. If no label selectors are defined for the model, all workers are considered. Otherwise, the system checks whether the labels of each worker node match the model's label selectors, retaining only those workers that match.

"},{"location":"scheduler/#status-policy","title":"Status Policy","text":"

This policy filters workers based on their status, retaining only those that are in a READY state.

"},{"location":"scheduler/#resource-fit-policy","title":"Resource Fit Policy","text":"

The Resource Fit Policy is a critical strategy in the scheduling system, used to filter workers or GPUs based on resource compatibility. The goal of this policy is to ensure that model instances can run on the selected nodes without exceeding resource limits. The Resource Fit Policy prioritizes candidates in the following order:

"},{"location":"scheduler/#scoring-phase","title":"Scoring Phase","text":"

The scoring phase evaluates the filtered candidates, scoring them to select the optimal deployment location. The primary strategy involved is:

"},{"location":"scheduler/#placement-strategy-policy","title":"Placement Strategy Policy","text":"

This strategy aims to \"pack\" as many model instances as possible into the fewest number of \"bins\" (e.g., Workers/GPUs) to optimize resource utilization. The goal is to minimize the number of bins used while maximizing resource efficiency, ensuring each bin is filled as efficiently as possible without exceeding its capacity. Model instances are placed in the bin with the least remaining space to minimize leftover capacity in each bin.

This strategy seeks to distribute multiple model instances across different worker nodes as evenly as possible, improving system fault tolerance and load balancing.

"},{"location":"troubleshooting/","title":"Troubleshooting","text":""},{"location":"troubleshooting/#view-gpustack-logs","title":"View GPUStack Logs","text":"

If you installed GPUStack using the installation script, you can view GPUStack logs at the following path:

"},{"location":"troubleshooting/#linux-or-macos","title":"Linux or macOS","text":"
/var/log/gpustack.log\n
"},{"location":"troubleshooting/#windows","title":"Windows","text":"
\"$env:APPDATA\\gpustack\\log\\gpustack.log\"\n
"},{"location":"troubleshooting/#configure-log-level","title":"Configure Log Level","text":"

You can enable the DEBUG log level on gpustack start by setting the --debug parameter.

You can configure log level of GPUStack server at runtime by running the following command on the server node:

curl -X PUT http://localhost/debug/log_level -d \"debug\"\n
"},{"location":"troubleshooting/#reset-admin-password","title":"Reset Admin Password","text":"

In case you forgot the admin password, you can reset it by running the following command on the server node:

gpustack reset-admin-password\n
"},{"location":"upgrade/","title":"Upgrade","text":"

You can upgrade GPUStack using the installation script or by manually installing the desired version of the GPUStack Python package.

Note

When upgrading, upgrade the GPUStack server first, then upgrade the workers.

"},{"location":"upgrade/#upgrade-gpustack-using-the-installation-script","title":"Upgrade GPUStack Using the Installation Script","text":"

To upgrade GPUStack from an older version, re-run the installation script using the same configuration options you originally used.

Running the installation script will:

  1. Install the latest version of the GPUStack Python package.
  2. Update the system service (systemd, launchd, or Windows) init script to reflect the arguments passed to the installation script.
  3. Restart the GPUStack service.
"},{"location":"upgrade/#linux-and-macos","title":"Linux and macOS","text":"

For example, to upgrade GPUStack to the latest version on a Linux system and macOS:

curl -sfL https://get.gpustack.ai | <EXISTING_INSTALL_ENV> sh -s - <EXISTING_GPUSTACK_ARGS>\n

To upgrade to a specific version, specify the INSTALL_PACKAGE_SPEC environment variable similar to the pip install command:

curl -sfL https://get.gpustack.ai | INSTALL_PACKAGE_SPEC=gpustack==x.y.z <EXISTING_INSTALL_ENV> sh -s - <EXISTING_GPUSTACK_ARGS>\n
"},{"location":"upgrade/#windows","title":"Windows","text":"

To upgrade GPUStack to the latest version on a Windows system:

$env:<EXISTING_INSTALL_ENV> = <EXISTING_INSTALL_ENV_VALUE>\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n

To upgrade to a specific version:

$env:INSTALL_PACKAGE_SPEC = gpustack==x.y.z\n$env:<EXISTING_INSTALL_ENV> = <EXISTING_INSTALL_ENV_VALUE>\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } <EXISTING_GPUSTACK_ARGS>\"\n
"},{"location":"upgrade/#docker-upgrade","title":"Docker Upgrade","text":"

If you installed GPUStack using Docker, upgrade to the a new version by pulling the Docker image with the desired version tag.

For example:

docker pull gpustack/gpustack:vX.Y.Z\n

Then restart the GPUStack service with the new image.

"},{"location":"upgrade/#manual-upgrade","title":"Manual Upgrade","text":"

If you install GPUStack manually, upgrade using the common pip workflow.

For example, to upgrade GPUStack to the latest version:

pip install --upgrade gpustack\n

Then restart the GPUStack service according to your setup.

"},{"location":"cli-reference/chat/","title":"gpustack chat","text":"

Chat with a large language model.

gpustack chat model [prompt]\n
"},{"location":"cli-reference/chat/#positional-arguments","title":"Positional Arguments","text":"Name Description model The model to use for chat. prompt The prompt to send to the model. [Optional]"},{"location":"cli-reference/chat/#one-time-chat-with-a-prompt","title":"One-time Chat with a Prompt","text":"

If a prompt is provided, it performs a one-time inference. For example:

gpustack chat llama3 \"tell me a joke.\"\n

Example output:

Why couldn't the bicycle stand up by itself?\n\nBecause it was two-tired!\n
"},{"location":"cli-reference/chat/#interactive-chat","title":"Interactive Chat","text":"

If the prompt argument is not provided, you can chat with the large language model interactively. For example:

gpustack chat llama3\n

Example output:

>tell me a joke.\nHere's one:\n\nWhy couldn't the bicycle stand up by itself?\n\n(wait for it...)\n\nBecause it was two-tired!\n\nHope that made you smile!\n>Do you have a better one?\nHere's another one:\n\nWhy did the scarecrow win an award?\n\n(think about it for a sec...)\n\nBecause he was outstanding in his field!\n\nHope that one stuck with you!\n\nDo you want to hear another one?\n>\\quit\n
"},{"location":"cli-reference/chat/#interactive-commands","title":"Interactive Commands","text":"

Followings are available commands in interactive chat:

Commands:\n  \\q or \\quit - Quit the chat\n  \\c or \\clear - Clear chat context in prompt\n  \\? or \\h or \\help - Print this help message\n
"},{"location":"cli-reference/chat/#connect-to-external-gpustack-server","title":"Connect to External GPUStack Server","text":"

If you are not running gpustack chat on the server node, or if you are serving on a custom host or port, you should provide the following environment variables:

Name Description GPUSTACK_SERVER_URL URL of the GPUStack server, e.g., http://myserver. GPUSTACK_API_KEY GPUStack API key."},{"location":"cli-reference/download-tools/","title":"gpustack download-tools","text":"

Download dependency tools, including llama-box, gguf-parser, and fastfetch.

gpustack download-tools [OPTIONS]\n
"},{"location":"cli-reference/download-tools/#configurations","title":"Configurations","text":"Flag Default Description ----tools-download-base-url value (empty) Base URL to download dependency tools. --save-archive value (empty) Path to save downloaded tools as a tar archive. --load-archive value (empty) Path to load downloaded tools from a tar archive, instead of downloading. --system value Default is the current OS. Operating system to download tools for. Options: linux, windows, macos. --arch value Default is the current architecture. Architecture to download tools for. Options: amd64, arm64. --device value Default is the current device. Device to download tools for. Options: cuda, mps, npu, musa, cpu."},{"location":"cli-reference/draw/","title":"gpustack draw","text":"

Generate an image with a diffusion model.

gpustack draw [model] [prompt]\n
"},{"location":"cli-reference/draw/#positional-arguments","title":"Positional Arguments","text":"Name Description model The model to use for image generation. prompt Text prompt to use for image generation.

The model can be either of the following:

  1. Name of a GPUStack model. You need to create a model in GPUStack before using it here.
  2. Reference to a Hugging Face GGUF diffusion model in Ollama style. When using this option, the model will be deployed if it is not already available. When not specified the default Q4_0 tag is used. Examples:
"},{"location":"cli-reference/draw/#configurations","title":"Configurations","text":"Flag Default Description --size value 512x512 Size of the image to generate, specified as widthxheight. --sampler value euler Sampling method. Options include: euler_a, euler, heun, dpm2, dpm++2s_a, dpm++2m, lcm, etc. --sample-steps value (Empty) Number of sampling steps. --cfg-scale value (Empty) Classifier-free guidance scale for balancing prompt adherence and creativity. --seed value (Empty) Seed for random number generation. Useful for reproducibility. --negative-prompt value (Empty) Text prompt for what to avoid in the image. --output value (Empty) Path to save the generated image. --show False If True, opens the generated image in the default image viewer. -d, --debug False Enable debug mode."},{"location":"cli-reference/start/","title":"gpustack start","text":"

Run GPUStack server or worker.

gpustack start [OPTIONS]\n
"},{"location":"cli-reference/start/#configurations","title":"Configurations","text":""},{"location":"cli-reference/start/#common-options","title":"Common Options","text":"Flag Default Description --config-file value (empty) Path to the YAML config file. -d value, --debug value False To enable debug mode, the short flag -d is not supported in Windows because this flag is reserved by PowerShell for CommonParameters. --data-dir value (empty) Directory to store data. Default is OS specific. --cache-dir value (empty) Directory to store cache (e.g., model files). Defaults to /cache. -t value, --token value Auto-generated. Shared secret used to add a worker. --huggingface-token value (empty) User Access Token to authenticate to the Hugging Face Hub. Can also be configured via the HF_TOKEN environment variable."},{"location":"cli-reference/start/#server-options","title":"Server Options","text":"Flag Default Description --host value 0.0.0.0 Host to bind the server to. --port value 80 Port to bind the server to. --disable-worker False Disable embedded worker. --bootstrap-password value Auto-generated. Initial password for the default admin user. --database-url value sqlite:///<data-dir>/database.db URL of the database. Example: postgresql://user:password@hostname:port/db_name --ssl-keyfile value (empty) Path to the SSL key file. --ssl-certfile value (empty) Path to the SSL certificate file. --force-auth-localhost False Force authentication for requests originating from localhost (127.0.0.1).When set to True, all requests from localhost will require authentication. --ollama-library-base-url https://registry.ollama.ai Base URL for the Ollama library. --disable-update-check False Disable update check."},{"location":"cli-reference/start/#worker-options","title":"Worker Options","text":"Flag Default Description -s value, --server-url value (empty) Server to connect to. --worker-ip value (empty) IP address of the worker node. Auto-detected by default. --disable-metrics False Disable metrics. --disable-rpc-servers False Disable RPC servers. --metrics-port value 10151 Port to expose metrics. --worker-port value 10150 Port to bind the worker to. Use a consistent value for all workers. --log-dir value (empty) Directory to store logs. --system-reserved value \"{\\\"ram\\\": 2, \\\"vram\\\": 0}\" The system reserves resources for the worker during scheduling, measured in GiB. By default, 2 GiB of RAM is reserved, Note: '{\\\"memory\\\": 2, \\\"gpu_memory\\\": 0}' is also supported, but it is deprecated and will be removed in future releases. --tools-download-base-url value Base URL for downloading dependency tools."},{"location":"cli-reference/start/#available-environment-variables","title":"Available Environment Variables","text":"

Most of the options can be set via environment variables. The environment variables are prefixed with GPUSTACK_ and are in uppercase. For example, --data-dir can be set via the GPUSTACK_DATA_DIR environment variable.

Below are additional environment variables that can be set:

Flag Description HF_ENDPOINT Hugging Face Hub endpoint. e.g., https://hf-mirror.com"},{"location":"cli-reference/start/#config-file","title":"Config File","text":"

You can configure start options using a YAML-format config file when starting GPUStack server or worker. Here is a complete example:

# Common Options\ndebug: false\ndata_dir: /path/to/data_dir\ncache_dir: /path/to/cache_dir\ntoken: mytoken\n\n# Server Options\nhost: 0.0.0.0\nport: 80\ndisable_worker: false\ndatabase_url: postgresql://user:password@hostname:port/db_name\nssl_keyfile: /path/to/keyfile\nssl_certfile: /path/to/certfile\nforce_auth_localhost: false\nbootstrap_password: myadminpassword\nollama_library_base_url: https://registry.mycompany.com\ndisable_update_check: false\n\n# Worker Options\nserver_url: http://myserver\nworker_ip: 192.168.1.101\ndisable_metrics: false\ndisable_rpc_servers: false\nmetrics_port: 10151\nworker_port: 10150\nlog_dir: /path/to/log_dir\nsystem_reserved:\n  ram: 2\n  vram: 0\ntools_download_base_url: https://mirror.mycompany.com\n
"},{"location":"installation/air-gapped-installation/","title":"Air-Gapped Installation","text":"

You can install GPUStack in an air-gapped environment. An air-gapped environment refers to a setup where GPUStack will be installed offline, behind a firewall, or behind a proxy.

The following methods are available for installing GPUStack in an air-gapped environment:

"},{"location":"installation/air-gapped-installation/#docker-installation","title":"Docker Installation","text":"

When running GPUStack with Docker, it works out of the box in an air-gapped environment as long as the Docker images are available. To do this, follow these steps:

  1. Pull GPUStack Docker images in an online environment.
  2. Publish Docker images to a private registry.
  3. Refer to the Docker Installation guide to run GPUStack using Docker.
"},{"location":"installation/air-gapped-installation/#manual-installation","title":"Manual Installation","text":"

For manual installation, you need to prepare the required packages and tools in an online environment and then transfer them to the air-gapped environment.

"},{"location":"installation/air-gapped-installation/#prerequisites","title":"Prerequisites","text":"

Set up an online environment identical to the air-gapped environment, including OS, architecture, and Python version.

"},{"location":"installation/air-gapped-installation/#step-1-download-the-required-packages","title":"Step 1: Download the Required Packages","text":"

Run the following commands in an online environment:

# On Windows (PowerShell):\n# $PACKAGE_SPEC = \"gpustack\"\n\n# Optional: To include extra dependencies (vllm, audio, all) or install a specific version\n# PACKAGE_SPEC=\"gpustack[all]\"\n# PACKAGE_SPEC=\"gpustack==0.4.0\"\nPACKAGE_SPEC=\"gpustack\"\n\n# Download all required packages\npip wheel $PACKAGE_SPEC -w gpustack_offline_packages\n\n# Install GPUStack to access its CLI\npip install gpustack\n\n# Download dependency tools and save them as an archive\ngpustack download-tools --save-archive gpustack_offline_tools.tar.gz\n

Note

This instruction assumes that the online environment uses the same GPU type as the air-gapped environment. If the GPU types differ, use the --device flag to specify the device type for the air-gapped environment. Refer to the download-tools command for more information.

"},{"location":"installation/air-gapped-installation/#step-2-transfer-the-packages","title":"Step 2: Transfer the Packages","text":"

Transfer the following files from the online environment to the air-gapped environment.

"},{"location":"installation/air-gapped-installation/#step-3-install-gpustack","title":"Step 3: Install GPUStack","text":"

In the air-gapped environment, run the following commands:

# Install GPUStack from the downloaded packages\npip install --no-index --find-links=gpustack_offline_packages gpustack\n\n# Load and apply the pre-downloaded tools archive\ngpustack download-tools --load-archive gpustack_offline_tools.tar.gz\n

Now you can run GPUStack by following the instructions in the Manual Installation guide.

"},{"location":"installation/docker-installation/","title":"Docker Installation","text":"

You can use the official Docker image to run GPUStack in a container. Installation using docker is supported on:

"},{"location":"installation/docker-installation/#prerequisites","title":"Prerequisites","text":""},{"location":"installation/docker-installation/#run-gpustack-with-docker","title":"Run GPUStack with Docker","text":"

Run the following command to start the GPUStack server:

docker run -d --gpus all -p 80:80 --ipc=host \\\n    -v gpustack-data:/var/lib/gpustack gpustack/gpustack\n

Note

You can either use the --ipc=host flag or --shm-size flag to allow the container to access the host\u2019s shared memory. It is used by vLLM and pyTorch to share data between processes under the hood, particularly for tensor parallel inference.

You can set additional flags for the gpustack start command by appending them to the docker run command.

For example, to start a GPUStack worker:

docker run -d --gpus all --ipc=host --network=host \\\n    gpustack/gpustack --server-url http://myserver --token mytoken\n

Note

The --network=host flag is used to ensure that server is accessible to the worker and inference services running on it. Alternatively, you can set --worker-ip <host-ip> -p 10150:10150 -p 40000-41024:40000-41024 to expose relevant ports.

For configuration details, please refer to the CLI Reference.

"},{"location":"installation/docker-installation/#run-gpustack-with-docker-compose","title":"Run GPUStack with Docker Compose","text":"

Get the docker-compose file from GPUStack repository, run the following command to start the GPUStack server:

docker-compose up -d\n

You can update the docker-compose.yml file to customize the command while starting a GPUStack worker.

"},{"location":"installation/docker-installation/#build-your-own-docker-image","title":"Build Your Own Docker Image","text":"

The official Docker image is built with CUDA 12.4. If you want to use a different version of CUDA, you can build your own Docker image.

# Example Dockerfile\nARG CUDA_VERSION=12.4.1\n\nFROM nvidia/cuda:$CUDA_VERSION-cudnn-runtime-ubuntu22.04\n\nENV DEBIAN_FRONTEND=noninteractive\n\nRUN apt-get update && apt-get install -y \\\n    wget \\\n    tzdata \\\n    python3 \\\n    python3-pip \\\n    && rm -rf /var/lib/apt/lists/*\n\n\nRUN pip3 install gpustack[all] && \\\n    pip3 cache purge\n\nENTRYPOINT [ \"gpustack\", \"start\" ]\n

Run the following command to build the Docker image:

docker build -t my/gpustack --build-arg CUDA_VERSION=12.0.0 .\n
"},{"location":"installation/installation-script/","title":"Installation Script","text":""},{"location":"installation/installation-script/#linux-and-macos","title":"Linux and macOS","text":"

You can use the installation script available at https://get.gpustack.ai to install GPUStack as a service on systemd and launchd based systems.

You can set additional environment variables and CLI flags when running the script. The following are examples running the installation script with different configurations:

# Run server.\ncurl -sfL https://get.gpustack.ai | sh -s -\n\n# Run server without the embedded worker.\ncurl -sfL https://get.gpustack.ai | sh -s - --disable-worker\n\n# Run server with TLS.\ncurl -sfL https://get.gpustack.ai | sh -s - --ssl-keyfile /path/to/keyfile --ssl-certfile /path/to/certfile\n\n# Run server with external postgresql database.\ncurl -sfL https://get.gpustack.ai | sh -s - --database-url \"postgresql://username:password@host:port/database_name\"\n\n# Run worker with specified IP.\ncurl -sfL https://get.gpustack.ai | sh -s - --server-url http://myserver --token mytoken --worker-ip 192.168.1.100\n\n# Install with a custom index URL.\ncurl -sfL https://get.gpustack.ai | INSTALL_INDEX_URL=https://pypi.tuna.tsinghua.edu.cn/simple sh -s -\n\n# Install a custom wheel package other than releases form pypi.org.\ncurl -sfL https://get.gpustack.ai | INSTALL_PACKAGE_SPEC=https://repo.mycompany.com/my-gpustack.whl sh -s -\n
"},{"location":"installation/installation-script/#windows","title":"Windows","text":"

You can use the installation script available at https://get.gpustack.ai to install GPUStack as a service on Windows Service Manager.

You can set additional environment variables and CLI flags when running the script. The following are examples running the installation script with different configurations:

# Run server.\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n\n# Run server without the embedded worker.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --disable-worker\"\n\n# Run server with TLS.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --ssl-keyfile 'C:\\path\\to\\keyfile' --ssl-certfile 'C:\\path\\to\\certfile'\"\n\n\n# Run server with external postgresql database.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --database-url 'postgresql://username:password@host:port/database_name'\"\n\n# Run worker with specified IP.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --server-url 'http://myserver' --token 'mytoken' --worker-ip '192.168.1.100'\"\n\n# Run worker with customize reserved resource.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --server-url 'http://myserver' --token 'mytoken' --system-reserved '{\"\"ram\"\":5, \"\"vram\"\":5}'\"\n\n# Install with a custom index URL.\n$env:INSTALL_INDEX_URL = \"https://pypi.tuna.tsinghua.edu.cn/simple\"\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n\n# Install a custom wheel package other than releases form pypi.org.\n$env:INSTALL_PACKAGE_SPEC = \"https://repo.mycompany.com/my-gpustack.whl\"\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n

Warning

Avoid using PowerShell ISE as it is not compatible with the installation script.

"},{"location":"installation/installation-script/#available-environment-variables-for-the-installation-script","title":"Available Environment Variables for the Installation Script","text":"Name Default Description INSTALL_INDEX_URL (empty) Base URL of the Python Package Index. INSTALL_PACKAGE_SPEC gpustack[all] or gpustack[audio] The package spec to install. The install script will automatically decide based on the platform. It supports PYPI package names, URLs, and local paths. See the pip install documentation for details. INSTALL_PRE_RELEASE (empty) If set to 1, pre-release packages will be installed. INSTALL_SKIP_POST_CHECK (empty) If set to 1, the installation script will skip the post-installation check."},{"location":"installation/installation-script/#set-environment-variables-for-the-gpustack-service","title":"Set Environment Variables for the GPUStack Service","text":"

You can set environment variables for the GPUStack service in an environment file located at:

The following is an example of the content of the file:

HF_TOKEN=\"mytoken\"\nHF_ENDPOINT=\"https://my-hf-endpoint\"\n

Note

Unlike Systemd, Launchd and Windows services do not natively support reading environment variables from a file. Configuration via the environment file is implemented by the installation script. It reads the file and applies the variables to the service configuration. After modifying the environment file on Windows and macOS, you need to re-run the installation script to apply changes to the GPUStack service.

"},{"location":"installation/installation-script/#available-cli-flags","title":"Available CLI Flags","text":"

The appended CLI flags of the installation script are passed directly as flags for the gpustack start command. You can refer to the CLI Reference for details.

"},{"location":"installation/installation-script/#install-server","title":"Install Server","text":"

To set up the GPUStack server (the management node), install GPUStack without the --server-url flag. By default, the GPUStack server includes an embedded worker. To disable this embedded worker on the server, use the --disable-worker flag.

"},{"location":"installation/installation-script/#install-worker","title":"Install Worker","text":"

To form a cluster, you can add GPUStack workers on additional nodes. Install GPUStack with the --server-url flag to specify the server' address and the --token flag for worker authenticate.

Examples are as follows:

"},{"location":"installation/installation-script/#linux-or-macos","title":"Linux or macOS","text":"
curl -sfL https://get.gpustack.ai | sh -s - --server-url http://myserver --token mytoken\n

In the default setup, you can run the following on the server node to get the token used for adding workers:

cat /var/lib/gpustack/token\n
"},{"location":"installation/installation-script/#windows_1","title":"Windows","text":"
Invoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --server-url http://myserver --token mytoken\"\n

In the default setup, you can run the following on the server node to get the token used for adding workers:

Get-Content -Path \"$env:APPDATA\\gpustack\\token\" -Raw\n
"},{"location":"installation/manual-installation/","title":"Manual Installation","text":""},{"location":"installation/manual-installation/#prerequites","title":"Prerequites:","text":"

Install python3.10 or above with pip.

"},{"location":"installation/manual-installation/#install-gpustack-cli","title":"Install GPUStack CLI","text":"

Run the following to install GPUStack:

# You can add extra dependencies, options are \"vllm\", \"audio\" and \"all\".\n# e.g., gpustack[all]\npip install gpustack\n

To verify, run:

gpustack version\n
"},{"location":"installation/manual-installation/#run-gpustack","title":"Run GPUStack","text":"

Run the following command to start the GPUStack server:

gpustack start\n

By default, GPUStack uses /var/lib/gpustack as the data directory so you need sudo or proper permission for that. You can also set a custom data directory by running:

gpustack start --data-dir mypath\n
"},{"location":"installation/manual-installation/#run-gpustack-as-a-system-service","title":"Run GPUStack as a System Service","text":"

A recommended way is to run GPUStack as a startup service. For example, using systemd:

Create a service file in /etc/systemd/system/gpustack.service:

[Unit]\nDescription=GPUStack Service\nWants=network-online.target\nAfter=network-online.target\n\n[Service]\nEnvironmentFile=-/etc/default/%N\nExecStart=gpustack start\nRestart=always\nRestartSec=3\nStandardOutput=append:/var/log/gpustack.log\nStandardError=append:/var/log/gpustack.log\n\n[Install]\nWantedBy=multi-user.target\n

Then start GPUStack:

systemctl daemon-reload\nsystemctl enable gpustack\n
"},{"location":"installation/uninstallation/","title":"Uninstallation","text":""},{"location":"installation/uninstallation/#uninstallation-script","title":"Uninstallation Script","text":"

Warning

Uninstallation script deletes the data in local datastore(sqlite), configuration, model cache, and all of the scripts and CLI tools. It does not remove any data from external datastores.

If you installed GPUStack using the installation script, a script to uninstall GPUStack was generated during installation.

"},{"location":"installation/uninstallation/#linux-or-macos","title":"Linux or macOS","text":"

Run the following command to uninstall GPUStack:

sudo /var/lib/gpustack/uninstall.sh\n
"},{"location":"installation/uninstallation/#windows","title":"Windows","text":"

Run the following command in PowerShell to uninstall GPUStack:

Set-ExecutionPolicy Bypass -Scope Process -Force; & \"$env:APPDATA\\gpustack\\uninstall.ps1\"\n
"},{"location":"installation/uninstallation/#manual-uninstallation","title":"Manual Uninstallation","text":"

If you install GPUStack manually, the followings are example commands to uninstall GPUStack. You can modify according to your setup:

# Stop and remove the service.\nsystemctl stop gpustack.service\nrm /etc/systemd/system/gpustack.service\nsystemctl daemon-reload\n# Uninstall the CLI.\npip uninstall gpustack\n# Remove the data directory.\nrm -rf /var/lib/gpustack\n
"},{"location":"tutorials/creating-text-embeddings/","title":"Creating Text Embeddings","text":"

Text embeddings are numerical representations of text that capture semantic meaning, enabling machines to understand relationships and similarities between different pieces of text. In essence, they transform text into vectors in a continuous space, where texts with similar meanings are positioned closer together. Text embeddings are widely used in applications such as natural language processing, information retrieval, and recommendation systems.

In this tutorial, we will demonstrate how to deploy embedding models in GPUStack and generate text embeddings using the deployed models.

"},{"location":"tutorials/creating-text-embeddings/#prerequisites","title":"Prerequisites","text":"

Before you begin, ensure that you have the following:

"},{"location":"tutorials/creating-text-embeddings/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":"

Follow these steps to deploy the model from Hugging Face:

  1. Navigate to the Models page in the GPUStack UI.
  2. Click the Deploy Model button.
  3. In the dropdown, select Hugging Face as the source for your model.
  4. Enable the GGUF checkbox to filter models by GGUF format.
  5. Use the search bar in the top left to search for the model name CompendiumLabs/bge-small-en-v1.5-gguf.
  6. Leave everything as default and click the Save button to deploy the model.

After deployment, you can monitor the model's status on the Models page.

"},{"location":"tutorials/creating-text-embeddings/#step-2-generate-an-api-key","title":"Step 2: Generate an API Key","text":"

We will use the GPUStack API to generate text embeddings, and an API key is required:

  1. Navigate to the API Keys page in the GPUStack UI.
  2. Click the New API Key button.
  3. Enter a name for the API key and click the Save button.
  4. Copy the generated API key. You can only view the API key once, so make sure to save it securely.
"},{"location":"tutorials/creating-text-embeddings/#step-3-generate-text-embeddings","title":"Step 3: Generate Text Embeddings","text":"

With the model deployed and an API key, you can generate text embeddings via the GPUStack API. Here is an example script using curl:

export SERVER_URL=<your-server-url>\nexport GPUSTACK_API_KEY=<your-api-key>\ncurl $SERVER_URL/v1-openai/embeddings \\\n  -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"input\": \"The food was delicious and the waiter...\",\n    \"model\": \"bge-small-en-v1.5\",\n    \"encoding_format\": \"float\"\n  }'\n

Replace <your-server-url> with the URL of your GPUStack server and <your-api-key> with the API key you generated in the previous step.

Example response:

{\n  \"data\": [\n    {\n      \"embedding\": [\n        -0.012189436703920364, 0.016934078186750412, 0.003965042531490326,\n        -0.03453584015369415, -0.07623119652271271, -0.007116147316992283,\n        0.11278388649225235, 0.019714849069714546, 0.010370955802500248,\n        -0.04219457507133484, -0.029902394860982895, 0.01122555136680603,\n        0.022912170737981796, 0.031186765059828758, 0.006303929258137941,\n        # ... additional values\n      ],\n      \"index\": 0,\n      \"object\": \"embedding\"\n    }\n  ],\n  \"model\": \"bge-small-en-v1.5\",\n  \"object\": \"list\",\n  \"usage\": { \"prompt_tokens\": 12, \"total_tokens\": 12 }\n}\n
"},{"location":"tutorials/inference-on-cpus/","title":"Inference on CPUs","text":"

GPUStack supports inference on CPUs, offering flexibility when GPU resources are limited or when model sizes exceed available GPU memory. The following CPU inference modes are available:

Note

CPU inference is supported when using the llama-box (llama.cpp) backend.

To deploy a model with CPU offloading, enable the Allow CPU Offloading option in the deployment configuration (this setting is enabled by default).

After deployment, you can view the number of model layers offloaded to the CPU.

"},{"location":"tutorials/inference-with-function-calling/","title":"Inference with Function Calling","text":"

Function calling allows you to connect models to external tools and systems. This is useful for many things such as empowering AI assistants with capabilities, or building deep integrations between your applications and the models.

In this tutorial, you\u2019ll learn how to set up and use function calling within GPUStack to extend your AI\u2019s capabilities.

Note

  1. Function calling is supported in the vLLM inference backend.
  2. Function calling is essentially achieved through prompt engineering, requiring models to be trained with internalized templates to enable this capability. Therefore, not all LLMs support function calling.
"},{"location":"tutorials/inference-with-function-calling/#prerequisites","title":"Prerequisites","text":"

Before proceeding, ensure the following:

"},{"location":"tutorials/inference-with-function-calling/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":"
  1. Navigate to the Models page in the GPUStack UI and click the Deploy Model button. In the dropdown, select Hugging Face as the source for your model.
  2. Use the search bar to find the Qwen/Qwen2.5-7B-Instruct model.
  3. Expand the Advanced section in configurations and scroll down to the Backend Parameters section.
  4. Click on the Add Parameter button and add the following parameters:
  1. Click the Save button to deploy the model.

After deployment, you can monitor the model's status on the Models page.

"},{"location":"tutorials/inference-with-function-calling/#step-2-generate-an-api-key","title":"Step 2: Generate an API Key","text":"

We will use the GPUStack API to interact with the model. To do this, you need to generate an API key:

  1. Navigate to the API Keys page in the GPUStack UI.
  2. Click the New API Key button.
  3. Enter a name for the API key and click the Save button.
  4. Copy the generated API key for later use.
"},{"location":"tutorials/inference-with-function-calling/#step-3-do-inference","title":"Step 3: Do Inference","text":"

With the model deployed and an API key, you can call the model via the GPUStack API. Here is an example script using curl (replace <your-server-url> with your GPUStack server URL and <your-api-key> with the API key generated in the previous step):

export GPUSTACK_SERVER_URL=<your-server-url>\nexport GPUSTACK_API_KEY=<your-api-key>\ncurl $GPUSTACK_SERVER_URL/v1-openai/chat/completions \\\n-H \"Content-Type: application/json\" \\\n-H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n-d '{\n  \"model\": \"qwen2.5-7b-instruct\",\n  \"messages\": [\n    {\n      \"role\": \"user\",\n      \"content\": \"What'\\''s the weather like in Boston today?\"\n    }\n  ],\n  \"tools\": [\n    {\n      \"type\": \"function\",\n      \"function\": {\n        \"name\": \"get_current_weather\",\n        \"description\": \"Get the current weather in a given location\",\n        \"parameters\": {\n          \"type\": \"object\",\n          \"properties\": {\n            \"location\": {\n              \"type\": \"string\",\n              \"description\": \"The city and state, e.g. San Francisco, CA\"\n            },\n            \"unit\": {\n              \"type\": \"string\",\n              \"enum\": [\"celsius\", \"fahrenheit\"]\n            }\n          },\n          \"required\": [\"location\"]\n        }\n      }\n    }\n  ],\n  \"tool_choice\": \"auto\"\n}'\n

Example response:

{\n  \"model\": \"qwen2.5-7b-instruct\",\n  \"choices\": [\n    {\n      \"index\": 0,\n      \"message\": {\n        \"role\": \"assistant\",\n        \"content\": null,\n        \"tool_calls\": [\n          {\n            \"id\": \"chatcmpl-tool-b99d32848b324eaea4bac5a5830d00b8\",\n            \"type\": \"function\",\n            \"function\": {\n              \"name\": \"get_current_weather\",\n              \"arguments\": \"{\\\"location\\\": \\\"Boston, MA\\\", \\\"unit\\\": \\\"fahrenheit\\\"}\"\n            }\n          }\n        ]\n      },\n      \"finish_reason\": \"tool_calls\"\n    }\n  ],\n  \"usage\": {\n    \"prompt_tokens\": 212,\n    \"total_tokens\": 242,\n    \"completion_tokens\": 30\n  }\n}\n
"},{"location":"tutorials/performing-distributed-inference-across-workers/","title":"Performing Distributed Inference Across Workers","text":"

This tutorial will guide you through the process of configuring and running distributed inference across multiple workers using GPUStack. Distributed inference allows you to handle larger language models by distributing the computational workload among multiple workers. This is particularly useful when individual workers do not have sufficient resources, such as VRAM, to run the entire model independently.

"},{"location":"tutorials/performing-distributed-inference-across-workers/#prerequisites","title":"Prerequisites","text":"

Before proceeding, ensure the following:

In this tutorial, we\u2019ll assume a cluster with two nodes, each equipped with an NVIDIA P40 GPU (22GB VRAM), as shown in the following image:

We aim to run a large language model that requires more VRAM than a single worker can provide. For this tutorial, we\u2019ll use the Qwen/Qwen2.5-72B-Instruct model with the q2_k quantization format. The required resources for running this model can be estimated using the gguf-parser tool:

$ gguf-parser --hf-repo Qwen/Qwen2.5-72B-Instruct-GGUF --hf-file qwen2.5-72b-instruct-q2_k-00001-of-00007.gguf --ctx-size=8192 --in-short --skip-architecture --skip-metadata --skip-tokenizer\n\n+--------------------------------------------------------------------------------------+\n| ESTIMATE                                                                             |\n+----------------------------------------------+---------------------------------------+\n|                      RAM                     |                 VRAM 0                |\n+--------------------+------------+------------+----------------+----------+-----------+\n| LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |    UMA   |   NONUMA  |\n+--------------------+------------+------------+----------------+----------+-----------+\n|      1 + 0 + 0     | 243.89 MiB | 393.89 MiB |     80 + 1     | 2.50 GiB | 28.92 GiB |\n+--------------------+------------+------------+----------------+----------+-----------+\n

From the output, we can see that the estimated VRAM requirement for this model exceeds the 22GB VRAM available on each worker node. Thus, we need to distribute the inference across multiple workers to successfully run the model.

"},{"location":"tutorials/performing-distributed-inference-across-workers/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":"

Follow these steps to deploy the model from Hugging Face, enabling distributed inference:

  1. Navigate to the Models page in the GPUStack UI.
  2. Click the Deploy Model button.
  3. In the dropdown, select Hugging Face as the source for your model.
  4. Enable the GGUF checkbox to filter models by GGUF format.
  5. Use the search bar in the top left to search for the model name Qwen/Qwen2.5-72B-Instruct-GGUF.
  6. In the Available Files section, select the q2_k quantization format.
  7. Expand the Advanced section and scroll down. Disable the Allow CPU Offloading option and verify that the Allow Distributed Inference Across Workers option is enabled(this is enabled by default). GPUStack will evaluate the available resources in the cluster and run the model in a distributed manner if required.
  8. Click the Save button to deploy the model.

"},{"location":"tutorials/performing-distributed-inference-across-workers/#step-2-verify-the-model-deployment","title":"Step 2: Verify the Model Deployment","text":"

Once the model is deployed, verify the deployment on the Models page, where you can view details about how the model is running across multiple workers.

You can also check worker and GPU resource usage by navigating to the Resources page.

Finally, go to the Playground page to interact with the model and verify that everything is functioning correctly.

"},{"location":"tutorials/performing-distributed-inference-across-workers/#conclusion","title":"Conclusion","text":"

Congratulations! You have successfully configured and run distributed inference across multiple workers using GPUStack.

"},{"location":"tutorials/running-inference-with-ascend-npus/","title":"Running Inference With Ascend NPUs","text":"

GPUStack supports running inference on Ascend NPUs. This tutorial will guide you through the configuration steps.

"},{"location":"tutorials/running-inference-with-ascend-npus/#system-and-hardware-support","title":"System and Hardware Support","text":"OS Status Verified Linux Support Ubuntu 20.04 Device Status Verified Ascend 910 Support Ascend 910B"},{"location":"tutorials/running-inference-with-ascend-npus/#setup-steps","title":"Setup Steps","text":""},{"location":"tutorials/running-inference-with-ascend-npus/#install-ascend-packages","title":"Install Ascend packages","text":"
  1. Download Ascend packages

Choose the packages according to your system, hardware and GPUStack is compatible with CANN 8.x from resources download center(links below).

Download the driver and firmware from here.

Package Name Description Ascend-hdk-{chiptype}-npu-driver{version}_linux-{arch}.run Ascend Driver (run format) Ascend-hdk-{chiptype}-npu-firmware{version}.run Ascend (run format)

Download the toolkit and kernels from here.

Package Name Description Ascend-cann-toolkit_{version}_linux-{arch}.run CANN Toolkit (run format) Ascend-cann-kernels-{chiptype}{version}_linux-{arch}.run CANN Kernels (run format)
  1. Create the user and group for running
sudo groupadd -g HwHiAiUser\nsudo useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash\nsudo usermod -aG HwHiAiUser $USER\n
  1. Install driver
sudo chmod +x Ascend-hdk-xxx-npu-driver_x.x.x_linux-{arch}.run\n# Driver installation, default installation path: \"/usr/local/Ascend\"\nsudo sh Ascend-hdk-xxx-npu-driver_x.x.x_linux-{arch}.run --full --install-for-all\n

If you see the following message, the firmware installation is complete:

Driver package installed successfully!\n
  1. Verify successful driver installation

After the driver successful installation, run the npu-smi info command to check if the driver was installed correctly.

$npu-smi info\n+------------------------------------------------------------------------------------------------+\n| npu-smi 23.0.1                   Version: 23.0.1                                               |\n+---------------------------+---------------+----------------------------------------------------+\n| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|\n| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |\n+===========================+===============+====================================================+\n| 4     910B3               | OK            | 93.6        40                0    / 0             |\n| 0                         | 0000:01:00.0  | 0           0    / 0          3161 / 65536         |\n+===========================+===============+====================================================+\n+---------------------------+---------------+----------------------------------------------------+\n| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |\n+===========================+===============+====================================================+\n| No running processes found in NPU 4                                                            |\n+===========================+===============+====================================================+\n
  1. Install firmware
sudo chmod +x Ascend-hdk-xxx-npu-firmware_x.x.x.x.X.run\nsudo sh Ascend-hdk-xxx-npu-firmware_x.x.x.x.X.run --full\n

If you see the following message, the firmware installation is complete:

Firmware package installed successfully!\n
  1. Install toolkit and kernels

As an example for Ubuntu, adapt commands according to your system.

Check for dependencies to ensure Python, GCC, and other required tools are installed.

gcc --version\ng++ --version\nmake --version\ncmake --version\ndpkg -l zlib1g| grep zlib1g| grep ii\ndpkg -l zlib1g-dev| grep zlib1g-dev| grep ii\ndpkg -l libsqlite3-dev| grep libsqlite3-dev| grep ii\ndpkg -l openssl| grep openssl| grep ii\ndpkg -l libssl-dev| grep libssl-dev| grep ii\ndpkg -l libffi-dev| grep libffi-dev| grep ii\ndpkg -l libbz2-dev| grep libbz2-dev| grep ii\ndpkg -l libxslt1-dev| grep libxslt1-dev| grep ii\ndpkg -l unzip| grep unzip| grep ii\ndpkg -l pciutils| grep pciutils| grep ii\ndpkg -l net-tools| grep net-tools| grep ii\ndpkg -l libblas-dev| grep libblas-dev| grep ii\ndpkg -l gfortran| grep gfortran| grep ii\ndpkg -l libblas3| grep libblas3| grep ii\n

If the commands return messages showing missing packages, install them as follows (adjust the command if only specific packages are missing):

sudo apt-get install -y gcc g++ make cmake zlib1g zlib1g-dev openssl libsqlite3-dev libssl-dev libffi-dev libbz2-dev libxslt1-dev unzip pciutils net-tools libblas-dev gfortran libblas3\n

Install Python dependencies:

pip3 install --upgrade pip\npip3 install attrs numpy decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions\n

Install the toolkit and kernels:

chmod +x Ascend-cann-toolkit_{vesion}_linux-{arch}.run\nchmod +x Ascend-cann-kernels-{chip_type}_{version}_linux-{arch}.run\n\nsh Ascend-cann-toolkit_{vesion}_linux-{arch}.run --install\nsh Ascend-cann-kernels-{chip_type}_{version}_linux-{arch}.run --install\n

Once installation completes, you should see a success message like this:

xxx install success\n
  1. Configure environment variables
echo \"source ~/Ascend/ascend-toolkit/set_env.sh\" >> ~/.bashrc\nsource ~/.bashrc\n

For more details, refer to the Ascend Documentation.

"},{"location":"tutorials/running-inference-with-ascend-npus/#installing-gpustack","title":"Installing GPUStack","text":"

Once your environment is ready, you can install GPUStack following the installation guide.

Once installed, you should see that GPUStack successfully recognizes the Ascend Device in the resources page.

"},{"location":"tutorials/running-inference-with-ascend-npus/#running-inference","title":"Running Inference","text":"

After installation, you can deploy models and run inference. Refer to the model management for usage details.

The Ascend NPU supports inference through the llama-box (llama.cpp) backend. For supported models, see the llama.cpp Ascend NPU model supports.

"},{"location":"tutorials/running-inference-with-moorethreads-gpus/","title":"Running Inference With Moore Threads GPUs","text":"

GPUStack supports running inference on Moore Threads GPUs. This tutorial provides a comprehensive guide to configuring your system for optimal performance.

"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#system-and-hardware-support","title":"System and Hardware Support","text":"OS Architecture Status Verified Linux x86_64 Support Ubuntu 20.04/22.04 Device Status Verified MTT S80 Support Yes MTT S3000 Support Yes MTT S4000 Support Yes"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#prerequisites","title":"Prerequisites","text":"

The following instructions are applicable for Ubuntu 20.04/22.04 systems with x86_64 architecture.

"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#configure-the-container-runtime","title":"Configure the Container Runtime","text":"

Follow these links to install and configure the container runtime:

"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#verify-container-runtime-configuration","title":"Verify Container Runtime Configuration","text":"

Ensure the output shows the default runtime as mthreads.

$ (cd /usr/bin/musa && sudo ./docker setup $PWD)\n$ docker info | grep mthreads\n Runtimes: mthreads mthreads-experimental runc\n Default Runtime: mthreads\n
"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#installing-gpustack","title":"Installing GPUStack","text":"

To set up an isolated environment for GPUStack, we recommend using Docker.

docker run -d --name gpustack-musa -p 9009:80 --ipc=host -v gpustack-data:/var/lib/gpustack \\\n    gpustack/gpustack:main-musa\n

This command will:

To check the logs of the running container, use the following command:

docker logs -f gpustack-musa\n

If the following message appears, the GPUStack container is running successfully:

2024-11-15T23:37:46+00:00 - gpustack.server.server - INFO - Serving on 0.0.0.0:80.\n2024-11-15T23:37:46+00:00 - gpustack.worker.worker - INFO - Starting GPUStack worker.\n

Once the container is running, access the GPUStack web interface by navigating to http://localhost:9009 in your browser.

After the initial setup for GPUStack, you should see the following screen:

"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#dashboard","title":"Dashboard","text":""},{"location":"tutorials/running-inference-with-moorethreads-gpus/#workers","title":"Workers","text":""},{"location":"tutorials/running-inference-with-moorethreads-gpus/#gpus","title":"GPUs","text":""},{"location":"tutorials/running-inference-with-moorethreads-gpus/#running-inference","title":"Running Inference","text":"

After installation, you can deploy models and run inference. Refer to the model management for detailed usage instructions.

Moore Threads GPUs support inference through the llama-box (llama.cpp) backend. Most recent models are supported (e.g., llama3.2:1b, llama3.2-vision:11b, qwen2.5:7b, etc.).

Use mthreads-gmi to verify if the model is offloaded to the GPU.

root@a414c45864ee:/# mthreads-gmi\nSat Nov 16 12:00:16 2024\n---------------------------------------------------------------\n    mthreads-gmi:1.14.0          Driver Version:2.7.0\n---------------------------------------------------------------\nID   Name           |PCIe                |%GPU  Mem\n     Device Type    |Pcie Lane Width     |Temp  MPC Capable\n                                         |      ECC Mode\n+-------------------------------------------------------------+\n0    MTT S80        |00000000:01:00.0    |98%   1339MiB(16384MiB)\n     Physical       |16x(16x)            |56C   YES\n                                         |      N/A\n---------------------------------------------------------------\n\n---------------------------------------------------------------\nProcesses:\nID   PID       Process name                         GPU Memory\n                                                         Usage\n+-------------------------------------------------------------+\n0    120       ...ird_party/bin/llama-box/llama-box       2MiB\n0    2022      ...ird_party/bin/llama-box/llama-box    1333MiB\n---------------------------------------------------------------\n
"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/","title":"Running Inference on Copilot+ PCs with Snapdragon X","text":"

GPUStack supports running on ARM64 Windows, enabling use on Snapdragon X-based Copilot+ PCs.

Note

Only CPU-based inference is supported on Snapdragon X devices. GPUStack does not currently support GPU or NPU acceleration on this platform.

"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#prerequisites","title":"Prerequisites","text":""},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#installing-gpustack","title":"Installing GPUStack","text":"

Run PowerShell as administrator (avoid using PowerShell ISE), then run the following command to install GPUStack:

Invoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n

After installation, follow the on-screen instructions to obtain credentials and log in to the GPUStack UI.

"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#deploying-a-model","title":"Deploying a Model","text":"
  1. Navigate to the Models page in the GPUStack UI.
  2. Click on the Deploy Model button and select Ollama Library from the dropdown.
  3. Enter llama3.2 in the Name field.
  4. Select llama3.2 from the Ollama Model dropdown.
  5. Click Save to deploy the model.

Once deployed, you can monitor the model's status on the Models page.

"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#running-inference","title":"Running Inference","text":"

Navigate to the Playground page in the GPUStack UI, where you can interact with the deployed model.

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/","title":"Setting Up a Multi-node GPUStack Cluster","text":"

This tutorial will guide you through setting up a multi-node GPUStack cluster, where you can distribute your workloads across multiple GPU-enabled nodes. This guide assumes you have basic knowledge of running commands on Linux, macOS, or Windows systems.

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#prerequisites","title":"Prerequisites","text":"

Before starting, ensure you have the following:

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-1-install-gpustack-on-the-server-node","title":"Step 1: Install GPUStack on the Server Node","text":"

First, you need to install GPUStack on one of the nodes to act as the server node. Follow the instructions below based on your operating system.

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#linux-or-macos","title":"Linux or macOS","text":"

Run the following command on your server node:

curl -sfL https://get.gpustack.ai | sh -s -\n
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#windows","title":"Windows","text":"

Run PowerShell as administrator and execute the following command:

Invoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n

Once GPUStack is installed, you can proceed to configure your cluster by adding worker nodes.

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-2-retrieve-the-token-from-the-server-node","title":"Step 2: Retrieve the Token from the Server Node","text":"

To add worker nodes to the cluster, you need the token generated by GPUStack on the server node. On the server node, run the following command to get the token:

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#linux-or-macos_1","title":"Linux or macOS","text":"
cat /var/lib/gpustack/token\n
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#windows_1","title":"Windows","text":"
Get-Content -Path \"$env:APPDATA\\gpustack\\token\" -Raw\n

This token will be required in the next steps to authenticate worker nodes.

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-3-add-worker-nodes-to-the-cluster","title":"Step 3: Add Worker Nodes to the Cluster","text":"

Now, you will install GPUStack on additional nodes (worker nodes) and connect them to the server node using the token.

Linux or macOS Worker Nodes

Run the following command on each worker node, replacing http://myserver with the URL of your server node and mytoken with the token retrieved in Step 2:

curl -sfL https://get.gpustack.ai | sh -s - --server-url http://myserver --token mytoken\n

Windows Worker Nodes

Run PowerShell as administrator on each worker node and use the following command, replacing http://myserver and mytoken with the server URL and token:

Invoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } --server-url http://myserver --token mytoken\"\n

Once the command is executed, each worker node will connect to the main server and become part of the GPUStack cluster.

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-4-verify-the-cluster-setup","title":"Step 4: Verify the Cluster Setup","text":"

After adding the worker nodes, you can verify that the cluster is set up correctly by accessing the GPUStack UI.

  1. Open a browser and navigate to http://myserver (replace myserver with the actual server URL).
  2. Log in with the default credentials (username admin). To retrieve the default password, run the following command on the server node:

Linux or macOS

cat /var/lib/gpustack/initial_admin_password\n

Windows

Get-Content -Path \"$env:APPDATA\\gpustack\\initial_admin_password\" -Raw\n
  1. After logging in, navigate to the Resources page in the UI to see all connected nodes and their GPUs. You should see your worker nodes listed and ready for serving LLMs.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#conclusion","title":"Conclusion","text":"

Congratulations! You've successfully set up a multi-node GPUStack cluster! You can now scale your workloads across multiple nodes, making full use of your available GPUs to handle your tasks efficiently.

"},{"location":"tutorials/using-audio-models/","title":"Using Audio Models","text":"

GPUStack supports running both speech-to-text and text-to-speech models. Speech-to-text models convert audio inputs in various languages into written text, while text-to-speech models transform written text into natural and expressive speech.

In this tutorial, we will walk you through deploying and using speech-to-text and text-to-speech models in GPUStack.

"},{"location":"tutorials/using-audio-models/#prerequisites","title":"Prerequisites","text":"

Before you begin, ensure that you have the following:

"},{"location":"tutorials/using-audio-models/#running-speech-to-text-model","title":"Running Speech-to-Text Model","text":""},{"location":"tutorials/using-audio-models/#step-1-deploy-speech-to-text-model","title":"Step 1: Deploy Speech-to-Text Model","text":"

Follow these steps to deploy the model from Hugging Face:

  1. Navigate to the Models page in the GPUStack UI.
  2. Click the Deploy Model button.
  3. In the dropdown, select Hugging Face as the source for your model.
  4. Use the search bar in the top left to search for the model name Systran/faster-whisper-medium.
  5. Leave everything as default and click the Save button to deploy the model.

After deployment, you can monitor the model's status on the Models page.

"},{"location":"tutorials/using-audio-models/#step-2-interact-with-speech-to-text-model-models","title":"Step 2: Interact with Speech-to-Text Model Models","text":"
  1. Navigate to the Playground > Audio page in the GPUStack UI.
  2. Select the Speech to Text Tab.
  3. Select the deployed model from the top-right dropdown.
  4. Click the Upload button to upload audio file or click the Microphone button to record audio.
  5. Click the Generate Text Content button to generate the text.
"},{"location":"tutorials/using-audio-models/#running-text-to-speech-model","title":"Running Text-to-Speech Model","text":""},{"location":"tutorials/using-audio-models/#step-1-deploy-text-to-speech-model","title":"Step 1: Deploy Text-to-Speech Model","text":"

Follow these steps to deploy the model from Hugging Face:

  1. Navigate to the Models page in the GPUStack UI.
  2. Click the Deploy Model button.
  3. In the dropdown, select Hugging Face as the source for your model.
  4. Use the search bar in the top left to search for the model name FunAudioLLM/CosyVoice-300M.
  5. Leave everything as default and click the Save button to deploy the model.

After deployment, you can monitor the model's status on the Models page.

"},{"location":"tutorials/using-audio-models/#step-2-interact-with-text-to-speech-model-models","title":"Step 2: Interact with Text to Speech Model Models","text":"
  1. Navigate to the Playground > Audio page in the GPUStack UI.
  2. Select the Text to Speech Tab.
  3. Choose the deployed model from the dropdown menu in the top-right corner. Then, configure the voice and output audio format.
  4. Input the text to generate.
  5. Click the Submit button to generate the text.
"},{"location":"tutorials/using-image-generation-models/","title":"Using Image Generation Models","text":"

GPUStack supports deploying and running state-of-the-art image generation models. These models allow you to generate stunning images from textual descriptions, enabling applications in design, content creation, and more.

In this tutorial, we will walk you through deploying and using image generation models in GPUStack.

"},{"location":"tutorials/using-image-generation-models/#prerequisites","title":"Prerequisites","text":"

Before you begin, ensure that you have the following:

"},{"location":"tutorials/using-image-generation-models/#step-1-deploy-the-stable-diffusion-model","title":"Step 1: Deploy the Stable Diffusion Model","text":"

Follow these steps to deploy the model from Hugging Face:

  1. Navigate to the Models page in the GPUStack UI.
  2. Click the Deploy Model button.
  3. In the dropdown, select Hugging Face as the source for your model.
  4. Use the search bar in the top left to search for the model name gpustack/stable-diffusion-v3-5-medium-GGUF.
  5. In the Available Files section, select the stable-diffusion-v3-5-medium-Q4_0.gguf file.
  6. Leave everything as default and click the Save button to deploy the model.

After deployment, you can monitor the model's status on the Models page.

"},{"location":"tutorials/using-image-generation-models/#step-2-use-the-model-for-image-generation","title":"Step 2: Use the Model for Image Generation","text":"
  1. Navigate to the Playground > Image page in the GPUStack UI.
  2. Verify that the deployed model is selected from the top-right Model dropdown.
  3. Enter a prompt describing the image you want to generate. For example:
a female character with long, flowing hair that appears to be made of ethereal, swirling patterns resembling the Northern Lights or Aurora Borealis. The background is dominated by deep blues and purples, creating a mysterious and dramatic atmosphere. The character's face is serene, with pale skin and striking features. She wears a dark-colored outfit with subtle patterns. The overall style of the artwork is reminiscent of fantasy or supernatural genres.\n
  1. Select euler in the Sampler dropdown.
  2. Set the Sample Steps to 20.
  3. Click the Submit button to create the image.

The generated image will be displayed in the UI. Your image may look different given the seed and randomness involved in the generation process.

"},{"location":"tutorials/using-image-generation-models/#conclusion","title":"Conclusion","text":"

Congratulations! You\u2019ve successfully deployed and used an image generation model in GPUStack. With this setup, you can generate unique and visually compelling images from textual prompts. Experiment with different prompts and settings to push the boundaries of what\u2019s possible.

"},{"location":"tutorials/using-reranker-models/","title":"Using Reranker Models","text":"

Reranker models are specialized models designed to improve the ranking of a list of items based on relevance to a given query. They are commonly used in information retrieval and search systems to refine initial search results, prioritizing items that are more likely to meet the user\u2019s intent. Reranker models take the initial document list and reorder items to enhance precision in applications such as search engines, recommendation systems, and question-answering tasks.

In this tutorial, we will guide you through deploying and using reranker models in GPUStack.

"},{"location":"tutorials/using-reranker-models/#prerequisites","title":"Prerequisites","text":"

Before you begin, ensure that you have the following:

"},{"location":"tutorials/using-reranker-models/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":"

Follow these steps to deploy the model from Hugging Face:

  1. Navigate to the Models page in the GPUStack UI.
  2. Click the Deploy Model button.
  3. In the dropdown, select Hugging Face as the source for your model.
  4. Enable the GGUF checkbox to filter models by GGUF format.
  5. Use the search bar in the top left to search for the model name gpustack/bge-reranker-v2-m3-GGUF.
  6. Leave everything as default and click the Save button to deploy the model.

After deployment, you can monitor the model's status on the Models page.

"},{"location":"tutorials/using-reranker-models/#step-2-generate-an-api-key","title":"Step 2: Generate an API Key","text":"

We will use the GPUStack API to interact with the model. To do this, you need to generate an API key:

  1. Navigate to the API Keys page in the GPUStack UI.
  2. Click the New API Key button.
  3. Enter a name for the API key and click the Save button.
  4. Copy the generated API key. You can only view the API key once, so make sure to save it securely.
"},{"location":"tutorials/using-reranker-models/#step-3-reranking","title":"Step 3: Reranking","text":"

With the model deployed and an API key, you can rerank a list of documents via the GPUStack API. Here is an example script using curl:

export SERVER_URL=<your-server-url>\nexport GPUSTACK_API_KEY=<your-api-key>\ncurl $SERVER_URL/v1/rerank \\\n    -H \"Content-Type: application/json\" \\\n    -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n    -d '{\n        \"model\": \"bge-reranker-v2-m3\",\n        \"query\": \"What is a panda?\",\n        \"top_n\": 3,\n        \"documents\": [\n            \"hi\",\n            \"it is a bear\",\n            \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n        ]\n    }' | jq\n

Replace <your-server-url> with the URL of your GPUStack server and <your-api-key> with the API key you generated in the previous step.

Example response:

{\n  \"model\": \"bge-reranker-v2-m3\",\n  \"object\": \"list\",\n  \"results\": [\n    {\n      \"document\": {\n        \"text\": \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n      },\n      \"index\": 2,\n      \"relevance_score\": 1.951932668685913\n    },\n    {\n      \"document\": {\n        \"text\": \"it is a bear\"\n      },\n      \"index\": 1,\n      \"relevance_score\": -3.7347371578216553\n    },\n    {\n      \"document\": {\n        \"text\": \"hi\"\n      },\n      \"index\": 0,\n      \"relevance_score\": -6.157620906829834\n    }\n  ],\n  \"usage\": {\n    \"prompt_tokens\": 69,\n    \"total_tokens\": 69\n  }\n}\n
"},{"location":"tutorials/using-vision-language-models/","title":"Using Vision Language Models","text":"

Vision Language Models can process both visual (image) and language (text) data simultaneously, making them versatile tools for various applications, such as image captioning, visual question answering, and more. In this tutorial, you will learn how to deploy and interact with Vision Language Models (VLMs) in GPUStack.

The procedure for deploying and interacting with these models in GPUStack is similar. The main difference is the parameters you need to set when deploying the models. For more information on the parameters you can set, please refer to Backend Parameters .

In this tutorial, we will cover the deployment of the following models:

"},{"location":"tutorials/using-vision-language-models/#prerequisites","title":"Prerequisites","text":"

Before you begin, ensure that you have the following:

Note

An Ubuntu node equipped with one H100 (80GB) GPU is used throughout this tutorial.

"},{"location":"tutorials/using-vision-language-models/#step-1-install-gpustack","title":"Step 1: Install GPUStack","text":"

Run the following command to install GPUStack:

curl -sfL https://get.gpustack.ai | sh -s - --huggingface-token <Hugging Face API Key>\n

Replace <Hugging Face API Key> with your Hugging Face API key. GPUStack will use this key to download the model files.

"},{"location":"tutorials/using-vision-language-models/#step-2-log-in-to-gpustack-ui","title":"Step 2: Log in to GPUStack UI","text":"

Run the following command to get the default password:

cat /var/lib/gpustack/initial_admin_password\n

Open your browser and navigate to http://<your-server-ip>. Replace <your-server-ip> with the IP address of your server. Log in using the username admin and the password you obtained in the previous step.

"},{"location":"tutorials/using-vision-language-models/#step-3-deploy-vision-language-models","title":"Step 3: Deploy Vision Language Models","text":""},{"location":"tutorials/using-vision-language-models/#deploy-llama32-vision","title":"Deploy Llama3.2-Vision","text":"
  1. Navigate to the Models page in the GPUStack UI.
  2. Click on the Deploy Model button, then select Hugging Face in the dropdown.
  3. Search for meta-llama/Llama-3.2-11B-Vision-Instruct in the search bar.
  4. Expand the Advanced section in configurations and scroll down to the Backend Parameters section.
  5. Click on the Add Parameter button multiple times and add the following parameters:
  1. Click the Save button.
"},{"location":"tutorials/using-vision-language-models/#deploy-qwen2-vl","title":"Deploy Qwen2-VL","text":"
  1. Navigate to the Models page in the GPUStack UI.
  2. Click on the Deploy Model button, then select Hugging Face in the dropdown.
  3. Search for Qwen/Qwen2-VL-7B-Instruct in the search bar.
  4. Click the Save button. The default configurations should work as long as you have enough GPU resources.
"},{"location":"tutorials/using-vision-language-models/#deploy-pixtral","title":"Deploy Pixtral","text":"
  1. Navigate to the Models page in the GPUStack UI.
  2. Click on the Deploy Model button, then select Hugging Face in the dropdown.
  3. Search for mistralai/Pixtral-12B-2409 in the search bar.
  4. Expand the Advanced section in configurations and scroll down to the Backend Parameters section.
  5. Click on the Add Parameter button multiple times and add the following parameters:
  1. Click the Save button.
"},{"location":"tutorials/using-vision-language-models/#deploy-phi35-vision","title":"Deploy Phi3.5-Vision","text":"
  1. Navigate to the Models page in the GPUStack UI.
  2. Click on the Deploy Model button, then select Hugging Face in the dropdown.
  3. Search for microsoft/Phi-3.5-vision-instruct in the search bar.
  4. Expand the Advanced section in configurations and scroll down to the Backend Parameters section.
  5. Click on the Add Parameter button and add the following parameter:
  1. Click the Save button.
"},{"location":"tutorials/using-vision-language-models/#step-4-interact-with-vision-language-models","title":"Step 4: Interact with Vision Language Models","text":"
  1. Navigate to the Playground page in the GPUStack UI.
  2. Select the deployed model from the top-right dropdown.
  3. Click on the Upload Image button above the input text area and upload an image.
  4. Enter a prompt in the input text area. For example, \"Describe the image.\"
  5. Click the Submit button to generate the output.
"},{"location":"tutorials/using-vision-language-models/#conclusion","title":"Conclusion","text":"

In this tutorial, you learned how to deploy and interact with Vision Language Models in GPUStack. You can use the same approach to deploy other Vision Language Models not covered in this tutorial. If you have any questions or need further assistance, feel free to reach out to us.

"},{"location":"user-guide/api-key-management/","title":"API Key Management","text":"

GPUStack supports authentication using API keys. Each GPUStack user can generate and manage their own API keys.

"},{"location":"user-guide/api-key-management/#create-api-key","title":"Create API Key","text":"
  1. Navigate to the API Keys page.
  2. Click the New API Key button.
  3. Fill in the Name, Description, and select the Expiration of the API key.
  4. Click the Save button.
  5. Copy and store the key somewhere safe, then click the Done button.

Note

Please note that you can only see the generated API key once upon creation.

"},{"location":"user-guide/api-key-management/#delete-api-key","title":"Delete API Key","text":"
  1. Navigate to the API Keys page.
  2. Find the API key you want to delete.
  3. Click the Delete button in the Operations column.
  4. Confirm the deletion.
"},{"location":"user-guide/api-key-management/#use-api-key","title":"Use API Key","text":"

GPUStack supports using the API key as a bearer token. The following is an example using curl:

export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n  -d '{\n    \"model\": \"llama3\",\n    \"messages\": [\n      {\n        \"role\": \"system\",\n        \"content\": \"You are a helpful assistant.\"\n      },\n      {\n        \"role\": \"user\",\n        \"content\": \"Hello!\"\n      }\n    ],\n    \"stream\": true\n  }'\n
"},{"location":"user-guide/image-generation-apis/","title":"Image Generation APIs","text":"

GPUStack provides APIs for generating images given a prompt and/or an input image when running diffusion models.

Note

The image generation APIs are only available when using the llama-box inference backend.

"},{"location":"user-guide/image-generation-apis/#supported-models","title":"Supported Models","text":"

The following models are available for image generation:

"},{"location":"user-guide/image-generation-apis/#api-details","title":"API Details","text":"

The image generation APIs adhere to OpenAI API specification. While OpenAI APIs for image generation are simple and opinionated, GPUStack extends these capabilities with additional features.

"},{"location":"user-guide/image-generation-apis/#create-image","title":"Create Image","text":""},{"location":"user-guide/image-generation-apis/#streaming","title":"Streaming","text":"

This image generation API supports streaming responses to return the progressing of the generation. To enable streaming, set the stream parameter to true in the request body. Example:

REQUEST : (application/json)\n{\n  \"n\": 1,\n  \"response_format\": \"b64_json\",\n  \"size\": \"512x512\",\n  \"prompt\": \"A lovely cat\",\n  \"quality\": \"standard\",\n  \"stream\": true,\n  \"stream_options\": {\n    \"include_usage\": true, // return usage information\n  }\n}\n\nRESPONSE : (text/event-stream)\ndata: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\ndata: [DONE]\n
"},{"location":"user-guide/image-generation-apis/#advanced-options","title":"Advanced Options","text":"

This image generation API supports additional options to control the generation process. The following options are available:

REQUEST : (application/json)\n{\n  \"n\": 1,\n  \"response_format\": \"b64_json\",\n  \"size\": \"512x512\",\n  \"prompt\": \"A lovely cat\",\n  \"sampler\": \"euler\",      // required, select from euler_a;euler;heun;dpm2;dpm++2s_a;dpm++2m;dpm++2mv2;ipndm;ipndm_v;lcm\n  \"schedule\": \"default\",   // optional, select from default;discrete;karras;exponential;ays;gits\n  \"seed\": null,            // optional, random seed\n  \"cfg_scale\": 4.5,        // optional, for sampler, the scale of classifier-free guidance in the output phase\n  \"sample_steps\": 20,      // optional, number of sample steps\n  \"negative_prompt\": \"\",   // optional, negative prompt\n  \"stream\": true,\n  \"stream_options\": {\n    \"include_usage\": true, // return usage information\n  }\n}\n\nRESPONSE : (text/event-stream)\ndata: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\ndata: [DONE]\n
"},{"location":"user-guide/image-generation-apis/#create-image-edit","title":"Create Image Edit","text":""},{"location":"user-guide/image-generation-apis/#streaming_1","title":"Streaming","text":"

This image generation API supports streaming responses to return the progressing of the generation. To enable streaming, set the stream parameter to true in the request body. Example:

REQUEST: (multipart/form-data)\nn=1\nresponse_format=b64_json\nsize=512x512\nprompt=\"A lovely cat\"\nquality=standard\nimage=...                         // required\nmask=...                          // optional\nstream=true\nstream_options_include_usage=true // return usage information\n\nRESPONSE : (text/event-stream)\nCASE 1: correct input image\n  data: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n  ...\n  data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n  ...\n  data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\n  data: [DONE]\nCASE 2: illegal input image\n  error: {\"code\": 400, \"message\": \"Invalid image\", \"type\": \"invalid_request_error\"}\n
"},{"location":"user-guide/image-generation-apis/#advanced-options_1","title":"Advanced Options","text":"

This image generation API supports additional options to control the generation process. The following options are available:

REQUEST: (multipart/form-data)\nn=1\nresponse_format=b64_json\nsize=512x512\nprompt=\"A lovely cat\"\nimage=...                         // required\nmask=...                          // optional\nsampler=euler                     // required, select from euler_a;euler;heun;dpm2;dpm++2s_a;dpm++2m;dpm++2mv2;ipndm;ipndm_v;lcm\nschedule=default                  // optional, select from default;discrete;karras;exponential;ays;gits\nseed=null                         // optional, random seed\ncfg_scale=4.5                     // optional, for sampler, the scale of classifier-free guidance in the output phase\nsample_steps=20                   // optional, number of sample steps\nnegative_prompt=\"\"                // optional, negative prompt\nstream=true\nstream_options_include_usage=true // return usage information\n\nRESPONSE : (text/event-stream)\nCASE 1: correct input image\n  data: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n  ...\n  data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n  ...\n  data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\n  data: [DONE]\nCASE 2: illegal input image\n  error: {\"code\": 400, \"message\": \"Invalid image\", \"type\": \"invalid_request_error\"}\n
"},{"location":"user-guide/image-generation-apis/#usage","title":"Usage","text":"

The followings are examples using the image generation APIs:

"},{"location":"user-guide/image-generation-apis/#curl-create-image","title":"curl (Create Image)","text":"
export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/image/generate \\\n    -H \"Content-Type: application/json\" \\\n    -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n    -d '{\n        \"n\": 1,\n        \"response_format\": \"b64_json\",\n        \"size\": \"512x512\",\n        \"prompt\": \"A lovely cat\",\n        \"quality\": \"standard\",\n        \"stream\": true,\n        \"stream_options\": {\n        \"include_usage\": true\n        }\n    }'\n
"},{"location":"user-guide/image-generation-apis/#curl-create-image-edit","title":"curl (Create Image Edit)","text":"
export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/image/edit \\\n    -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n    -F image=\"@otter.png\" \\\n    -F mask=\"@mask.png\" \\\n    -F prompt=\"A lovely cat\" \\\n    -F n=1 \\\n    -F size=\"512x512\"\n
"},{"location":"user-guide/inference-backends/","title":"Inference Backends","text":"

GPUStack supports the following inference backends:

When users deploy a model, the backend is selected automatically based on the following criteria:

"},{"location":"user-guide/inference-backends/#llama-box","title":"llama-box","text":"

llama-box is a LM inference server based on llama.cpp and stable-diffusion.cpp.

"},{"location":"user-guide/inference-backends/#supported-platforms","title":"Supported Platforms","text":"

The llama-box backend supports Linux, macOS and Windows (with CPU offloading only on Windows ARM architecture) platforms.

"},{"location":"user-guide/inference-backends/#supported-models","title":"Supported Models","text":""},{"location":"user-guide/inference-backends/#supported-features","title":"Supported Features","text":""},{"location":"user-guide/inference-backends/#allow-cpu-offloading","title":"Allow CPU Offloading","text":"

After enabling CPU offloading, GPUStack prioritizes loading as many layers as possible onto the GPU to optimize performance. If GPU resources are limited, some layers will be offloaded to the CPU, with full CPU inference used only when no GPU is available.

"},{"location":"user-guide/inference-backends/#allow-distributed-inference-across-workers","title":"Allow Distributed Inference Across Workers","text":"

Enable distributed inference across multiple workers. The primary Model Instance will communicate with backend instances on one or more others workers, offloading computation tasks to them.

"},{"location":"user-guide/inference-backends/#parameters-reference","title":"Parameters Reference","text":"

See the full list of supported parameters for llama-box here.

"},{"location":"user-guide/inference-backends/#vllm","title":"vLLM","text":"

vLLM is a high-throughput and memory-efficient LLMs inference engine. It is a popular choice for running LLMs in production. vLLM seamlessly supports most state-of-the-art open-source models, including: Transformer-like LLMs (e.g., Llama), Mixture-of-Expert LLMs (e.g., Mixtral), Embedding Models (e.g. E5-Mistral), Multi-modal LLMs (e.g., LLaVA)

By default, GPUStack estimates the VRAM requirement for the model instance based on the model's metadata. You can customize the parameters to fit your needs. The following vLLM parameters might be useful:

For more details, please refer to vLLM documentation.

"},{"location":"user-guide/inference-backends/#supported-platforms_1","title":"Supported Platforms","text":"

The vLLM backend works on AMD Linux.

Note

  1. When users install GPUStack on amd64 Linux using the installation script, vLLM is automatically installed.
  2. When users deploy a model using the vLLM backend, GPUStack sets worker label selectors to {\"os\": \"linux\", \"arch\": \"amd64\"} by default to ensure the model instance is scheduled to proper workers. You can customize the worker label selectors in the model configuration.
"},{"location":"user-guide/inference-backends/#supported-models_1","title":"Supported Models","text":"

Please refer to the vLLM documentation for supported models.

"},{"location":"user-guide/inference-backends/#supported-features_1","title":"Supported Features","text":""},{"location":"user-guide/inference-backends/#multimodal-language-models","title":"Multimodal Language Models","text":"

vLLM supports multimodal language models listed here. When users deploy a vision language model using the vLLM backend, image inputs are supported in the chat completion API.

"},{"location":"user-guide/inference-backends/#parameters-reference_1","title":"Parameters Reference","text":"

See the full list of supported parameters for vLLM here.

"},{"location":"user-guide/inference-backends/#vox-box","title":"vox-box","text":"

vox-box is an inference engine designed for deploying text-to-speech and speech-to-text models. It also provides an API that is fully compatible with the OpenAI audio API.

"},{"location":"user-guide/inference-backends/#supported-platforms_2","title":"Supported Platforms","text":"

The vox-box backend supports Linux, macOS and Windows platforms.

Note

  1. To use Nvidia GPUs, ensure the following NVIDIA libraries are installed on workers:
    • cuBLAS for CUDA 12
    • cuDNN 9 for CUDA 12
  2. When users install GPUStack on Linux, macOS and Windows using the installation script, vox-box is automatically installed.
  3. CosyVoice models are natively supported on Linux AMD architecture and macOS. However, these models are not supported on Linux ARM or Windows architectures.
"},{"location":"user-guide/inference-backends/#supported-models_2","title":"Supported Models","text":"Model Type Link Supported Platforms Faster-whisper-large-v3 speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-large-v2 speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-large-v1 speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-medium speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-medium.en speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-small speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-small.en speech-to-text Hugging Face Linux, macOS, Windows Faster-distil-whisper-large-v3 speech-to-text Hugging Face Linux, macOS, Windows Faster-distil-whisper-large-v2 speech-to-text Hugging Face Linux, macOS, Windows Faster-distil-whisper-medium.en speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-tiny speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-tiny.en speech-to-text Hugging Face Linux, macOS, Windows CosyVoice-300M-Instruct text-to-speech Hugging Face, ModelScope Linux(ARM not supported), macOS, Windows(Not supported) CosyVoice-300M-SFT text-to-speech Hugging Face, ModelScope Linux(ARM not supported), macOS, Windows(Not supported) CosyVoice-300M text-to-speech Hugging Face, ModelScope Linux(ARM not supported), macOS, Windows(Not supported) CosyVoice-300M-25Hz text-to-speech ModelScope Linux(ARM not supported), macOS, Windows(Not supported)"},{"location":"user-guide/inference-backends/#supported-features_2","title":"Supported Features","text":""},{"location":"user-guide/inference-backends/#allow-gpucpu-offloading","title":"Allow GPU/CPU Offloading","text":"

vox-box supports deploying models to NVIDIA GPUs. If GPU resources are insufficient, it will automatically deploy the models to the CPU.

"},{"location":"user-guide/model-management/","title":"Model Management","text":"

You can manage large language models in GPUStack by navigating to the Models page. A model in GPUStack contains one or multiple replicas of model instances. On deployment, GPUStack automatically computes resource requirements for the model instances from model metadata and schedules them to available workers accordingly.

"},{"location":"user-guide/model-management/#deploy-model","title":"Deploy Model","text":"

Currently, models from Hugging Face, ModelScope, Ollama and local paths are supported.

"},{"location":"user-guide/model-management/#deploying-a-hugging-face-model","title":"Deploying a Hugging Face Model","text":"
  1. Click the Deploy Model button, then select Hugging Face in the dropdown.

  2. Search the model by name from Hugging Face using the search bar in the top left. For example, microsoft/Phi-3-mini-4k-instruct-gguf. If you only want to search for GGUF models, check the \"GGUF\" checkbox.

  3. Select a file with the desired quantization format from Available Files.

  4. Adjust the Name and Replicas as needed.

  5. Expand the Advanced section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.

  6. Click the Save button.

"},{"location":"user-guide/model-management/#deploying-a-modelscope-model","title":"Deploying a ModelScope Model","text":"
  1. Click the Deploy Model button, then select ModelScope in the dropdown.

  2. Search the model by name from ModelScope using the search bar in the top left. For example, Qwen/Qwen2-0.5B-Instruct. If you only want to search for GGUF models, check the \"GGUF\" checkbox.

  3. Select a file with the desired quantization format from Available Files.

  4. Adjust the Name and Replicas as needed.

  5. Expand the Advanced section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.

  6. Click the Save button.

"},{"location":"user-guide/model-management/#deploying-an-ollama-model","title":"Deploying an Ollama Model","text":"
  1. Click the Deploy Model button, then select Ollama Library in the dropdown.

  2. Fill in the Name of the model.

  3. Select an Ollama Model from the dropdown list, or input any Ollama model you need. For example, llama3, llama3:70b or youraccount/llama3:70b.

  4. Adjust the Replicas as needed.

  5. Expand the Advanced section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.

  6. Click the Save button.

"},{"location":"user-guide/model-management/#deploying-a-local-path-model","title":"Deploying a Local Path Model","text":"

You can deploy a model from a local path. The model path can be a directory (e.g., a downloaded Hugging Face model directory) or a file (e.g., a GGUF model file) located on workers. This is useful when running in an air-gapped environment.

Note

  1. GPUStack does not check the validity of the model path for scheduling, which may lead to deployment failure if the model path is inaccessible. It is recommended to ensure the model path is accessible on all workers(e.g., using NFS, rsync, etc.). You can also use the worker selector configuration to deploy the model to specific workers.
  2. GPUStack cannot evaluate the model's resource requirements unless the server has access to the same model path. Consequently, you may observe empty VRAM/RAM allocations for a deployed model. To mitigate this, it is recommended to make the model files available on the same path on the server. Alternatively, you can customize backend parameters, such as tensor-split, to configure how the model is distributed across the GPUs.

To deploy a local path model:

  1. Click the Deploy Model button, then select Local Path in the dropdown.

  2. Fill in the Name of the model.

  3. Fill in the Model Path.

  4. Adjust the Replicas as needed.

  5. Expand the Advanced section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.

  6. Click the Save button.

"},{"location":"user-guide/model-management/#edit-model","title":"Edit Model","text":"
  1. Find the model you want to edit on the model list page.
  2. Click the Edit button in the Operations column.
  3. Update the attributes as needed. For example, change the Replicas to scale up or down.
  4. Click the Save button.

Note

After editing the model, the configuration will not be applied to existing model instances. You need to delete the existing model instances. GPUStack will recreate new instances based on the updated model configuration.

"},{"location":"user-guide/model-management/#delete-model","title":"Delete Model","text":"
  1. Find the model you want to delete on the model list page.
  2. Click the ellipsis button in the Operations column, then select Delete.
  3. Confirm the deletion.
"},{"location":"user-guide/model-management/#view-model-instance","title":"View Model Instance","text":"
  1. Find the model you want to check on the model list page.
  2. Click the > symbol to view the instance list of the model.
"},{"location":"user-guide/model-management/#delete-model-instance","title":"Delete Model Instance","text":"
  1. Find the model you want to check on the model list page.
  2. Click the > symbol to view the instance list of the model.
  3. Find the model instance you want to delete.
  4. Click the ellipsis button for the model instance in the Operations column, then select Delete.
  5. Confirm the deletion.

Note

After a model instance is deleted, GPUStack will recreate a new instance to satisfy the expected replicas of the model if necessary.

"},{"location":"user-guide/model-management/#view-model-instance-logs","title":"View Model Instance Logs","text":"
  1. Find the model you want to check on the model list page.
  2. Click the > symbol to view the instance list of the model.
  3. Find the model instance you want to check.
  4. Click the View Logs button for the model instance in the Operations column.
"},{"location":"user-guide/model-management/#use-self-hosted-ollama-models","title":"Use Self-hosted Ollama Models","text":"

You can deploy self-hosted Ollama models by configuring the --ollama-library-base-url option in the GPUStack server. The Ollama Library URL should point to the base URL of the Ollama model registry. For example, https://registry.mycompany.com.

Here is an example workflow to set up a registry, publish a model, and use it in GPUStack:

# Run a self-hosted OCI registry\ndocker run -d -p 5001:5000 --name registry registry:2\n\n# Push a model to the registry using Ollama\nollama pull llama3\nollama cp llama3 localhost:5001/library/llama3\nollama push localhost:5001/library/llama3 --insecure\n\n# Start GPUStack server with the custom Ollama library URL\ncurl -sfL https://get.gpustack.ai | sh -s - --ollama-library-base-url http://localhost:5001\n

That's it! You can now deploy the model llama3 from Ollama Library source in GPUStack as usual, but the model will now be fetched from the self-hosted registry.

"},{"location":"user-guide/model-management/#advanced-model-configuration","title":"Advanced Model Configuration","text":"

GPUStack supports tailored configurations for model deployment.

"},{"location":"user-guide/model-management/#schedule-type","title":"Schedule Type","text":""},{"location":"user-guide/model-management/#auto","title":"Auto","text":"

GPUStack automatically schedules model instances to appropriate GPUs/Workers based on current resource availability.

When configured, the scheduler will deploy the model instance to the worker containing specified labels.

  1. Navigate to the Resources page and edit the desired worker. Assign custom labels to the worker by adding them in the labels section.

  2. Go to the Models page and click on the Deploy Model button. Expand the Advanced section and input the previously assigned worker labels in the Worker Selector configuration. During deployment, the Model Instance will be allocated to the corresponding worker based on these labels.

"},{"location":"user-guide/model-management/#manual","title":"Manual","text":"

This schedule type allows users to specify which GPU to deploy the model instance on.

Select a GPU from the list. The model instance will attempt to deploy to this GPU if resources permit.

"},{"location":"user-guide/model-management/#backend","title":"Backend","text":"

The inference backend. Currently, GPUStack supports three backends: llama-box, vLLM and vox-box. GPUStack automatically selects the backend based on the model's configuration.

For more details, please refer to the Inference Backends section.

"},{"location":"user-guide/model-management/#backend-version","title":"Backend Version","text":"

Specify a backend version, such as v1.0.0. The version format and availability depend on the selected backend. This option is useful for ensuring compatibility or taking advantage of features introduced in specific backend versions. Refer to the Pinned Backend Versions section for more information.

"},{"location":"user-guide/model-management/#backend-parameters","title":"Backend Parameters","text":"

Input the parameters for the backend you want to customize when running the model. The parameter should be in the format --parameter=value, --bool-parameter or as separate fields for --parameter and value. For example, use --ctx-size=8192 for llama-box.

For full list of supported parameters, please refer to the Inference Backends section.

"},{"location":"user-guide/model-management/#allow-cpu-offloading","title":"Allow CPU Offloading","text":"

Note

Available for llama-box backend only.

After enabling CPU offloading, GPUStack prioritizes loading as many layers as possible onto the GPU to optimize performance. If GPU resources are limited, some layers will be offloaded to the CPU, with full CPU inference used only when no GPU is available.

"},{"location":"user-guide/model-management/#allow-distributed-inference-across-workers","title":"Allow Distributed Inference Across Workers","text":"

Note

Available for llama-box backend only.

Enable distributed inference across multiple workers. The primary Model Instance will communicate with backend instances on one or more other workers, offloading computation tasks to them.

"},{"location":"user-guide/openai-compatible-apis/","title":"OpenAI Compatible APIs","text":"

GPUStack serves OpenAI-compatible APIs using the /v1-openai path. Most of the APIs also work under the /v1 path as an alias, except for the models endpoint, which is reserved for GPUStack management APIs.

"},{"location":"user-guide/openai-compatible-apis/#supported-endpoints","title":"Supported Endpoints","text":"

The following API endpoints are supported:

"},{"location":"user-guide/openai-compatible-apis/#usage","title":"Usage","text":"

The following are examples using the APIs in different languages:

"},{"location":"user-guide/openai-compatible-apis/#curl","title":"curl","text":"
export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n  -d '{\n    \"model\": \"llama3\",\n    \"messages\": [\n      {\n        \"role\": \"system\",\n        \"content\": \"You are a helpful assistant.\"\n      },\n      {\n        \"role\": \"user\",\n        \"content\": \"Hello!\"\n      }\n    ],\n    \"stream\": true\n  }'\n
"},{"location":"user-guide/openai-compatible-apis/#openai-python-api-library","title":"OpenAI Python API library","text":"
from openai import OpenAI\n\nclient = OpenAI(base_url=\"http://myserver/v1-openai\", api_key=\"myapikey\")\n\ncompletion = client.chat.completions.create(\n  model=\"llama3\",\n  messages=[\n    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n    {\"role\": \"user\", \"content\": \"Hello!\"}\n  ]\n)\n\nprint(completion.choices[0].message)\n
"},{"location":"user-guide/openai-compatible-apis/#openai-node-api-library","title":"OpenAI Node API library","text":"
const OpenAI = require(\"openai\");\n\nconst openai = new OpenAI({\n  apiKey: \"myapikey\",\n  baseURL: \"http://myserver/v1-openai\",\n});\n\nasync function main() {\n  const params = {\n    model: \"llama3\",\n    messages: [\n      {\n        role: \"system\",\n        content: \"You are a helpful assistant.\",\n      },\n      {\n        role: \"user\",\n        content: \"Hello!\",\n      },\n    ],\n  };\n  const chatCompletion = await openai.chat.completions.create(params);\n  console.log(chatCompletion.choices[0].message);\n}\nmain();\n
"},{"location":"user-guide/pinned-backend-versions/","title":"Pinned Backend Versions","text":"

Inference engines in the generative AI domain are evolving rapidly to enhance performance and unlock new capabilities. This constant evolution provides exciting opportunities but also presents challenges for maintaining model compatibility and deployment stability.

GPUStack allows you to pin inference backend versions to specific releases, offering a balance between staying up-to-date with the latest advancements and ensuring a reliable runtime environment. This feature is particularly beneficial in the following scenarios:

By pinning backend versions, you gain full control over your inference environment, enabling both flexibility and predictability in deployment.

"},{"location":"user-guide/pinned-backend-versions/#automatic-installation-of-pinned-backend-versions","title":"Automatic Installation of Pinned Backend Versions","text":"

To simplify deployment, GPUStack supports the automatic installation of pinned backend versions when feasible. The process depends on the type of backend:

  1. Prebuilt Binaries For backends like llama-box, GPUStack downloads the specified version using the same mechanism as in GPUStack bootstrapping.

Tip

You can customize the download source using the --tools-download-base-url configuration option.

  1. Python-based Backends For backends like vLLM and vox-box, GPUStack uses pipx to install the specified version in an isolated Python environment.

Tip

This automation reduces manual intervention, allowing you to focus on deploying and using your models.

"},{"location":"user-guide/pinned-backend-versions/#manual-installation-of-pinned-backend-versions","title":"Manual Installation of Pinned Backend Versions","text":"

When automatic installation is not feasible or preferred, GPUStack provides a straightforward way to manually install specific versions of inference backends. Follow these steps:

  1. Prepare the Executable Install the backend executable or link it under the GPUStack bin directory. The default locations are:

Tip

You can customize the bin directory using the --bin-dir configuration option.

  1. Name the Executable Ensure the executable is named in the following format:

For example, the vLLM executable for version v0.6.4 should be named vllm_v0.6.4 on Linux.

By following these steps, you can maintain full control over the backend installation process, ensuring that the correct version is used for your deployment.

"},{"location":"user-guide/rerank-api/","title":"Rerank API","text":"

In the context of Retrieval-Augmented Generation (RAG), reranking refers to the process of selecting the most relevant information from retrieved documents or knowledge sources before presenting them to the user or utilizing them for answer generation.

GPUStack serves Jina compatible Rerank API using the /v1/rerank path.

Note

The Rerank API is only available when using the llama-box inference backend.

"},{"location":"user-guide/rerank-api/#supported-models","title":"Supported Models","text":"

The following models are available for reranking:

"},{"location":"user-guide/rerank-api/#usage","title":"Usage","text":"

The following is an example using the Rerank API:

export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1/rerank \\\n    -H \"Content-Type: application/json\" \\\n    -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n    -d '{\n        \"model\": \"bge-reranker-v2-m3\",\n        \"query\": \"What is a panda?\",\n        \"top_n\": 3,\n        \"documents\": [\n            \"hi\",\n            \"it is a bear\",\n            \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n        ]\n    }' | jq\n

Example output:

{\n  \"model\": \"bge-reranker-v2-m3\",\n  \"object\": \"list\",\n  \"results\": [\n    {\n      \"document\": {\n        \"text\": \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n      },\n      \"index\": 2,\n      \"relevance_score\": 1.951932668685913\n    },\n    {\n      \"document\": {\n        \"text\": \"it is a bear\"\n      },\n      \"index\": 1,\n      \"relevance_score\": -3.7347371578216553\n    },\n    {\n      \"document\": {\n        \"text\": \"hi\"\n      },\n      \"index\": 0,\n      \"relevance_score\": -6.157620906829834\n    }\n  ],\n  \"usage\": {\n    \"prompt_tokens\": 69,\n    \"total_tokens\": 69\n  }\n}\n
"},{"location":"user-guide/user-management/","title":"User Management","text":"

GPUStack supports users of two roles: Admin and User. Admins can monitor system status, manage models, users, and system settings. Users can manage their own API keys and use the completion API.

"},{"location":"user-guide/user-management/#default-admin","title":"Default Admin","text":"

On bootstrap, GPUStack creates a default admin user. The initial password for the default admin is stored in <data-dir>/initial_admin_password. In the default setup, it should be /var/lib/gpustack/initial_admin_password. You can customize the default admin password by setting the --bootstrap-password parameter when starting gpustack.

"},{"location":"user-guide/user-management/#create-user","title":"Create User","text":"
  1. Navigate to the Users page.
  2. Click the Create User button.
  3. Fill in Name, Full Name, Password, and select Role for the user.
  4. Click the Save button.
"},{"location":"user-guide/user-management/#update-user","title":"Update User","text":"
  1. Navigate to the Users page.
  2. Find the user you want to edit.
  3. Click the Edit button in the Operations column.
  4. Update the attributes as needed.
  5. Click the Save button.
"},{"location":"user-guide/user-management/#delete-user","title":"Delete User","text":"
  1. Navigate to the Users page.
  2. Find the user you want to delete.
  3. Click the ellipsis button in the Operations column, then select Delete.
  4. Confirm the deletion.
"},{"location":"user-guide/playground/","title":"Playground","text":"

GPUStack offers a playground UI where users can test and experiment with the APIs. Refer to each subpage for detailed instructions and information.

"},{"location":"user-guide/playground/audio/","title":"Audio Playground","text":"

The Audio Playground is a dedicated space for testing and experimenting with GPUStack\u2019s text-to-speech (TTS) and speech-to-text (STT) APIs. It allows users to interactively convert text to audio and audio to text, customize parameters, and review code examples for seamless API integration.

"},{"location":"user-guide/playground/audio/#text-to-speech","title":"Text to Speech","text":"

Switch to the \"Text to Speech\" tab to test TTS models.

"},{"location":"user-guide/playground/audio/#text-input","title":"Text Input","text":"

Enter the text you want to convert, then click the Submit button to generate the corresponding speech.

"},{"location":"user-guide/playground/audio/#clear-text","title":"Clear Text","text":"

Click the Clear button to reset the text input and remove the generated speech.

"},{"location":"user-guide/playground/audio/#select-model","title":"Select Model","text":"

Select an available TTS model in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.

"},{"location":"user-guide/playground/audio/#customize-parameters","title":"Customize Parameters","text":"

Customize the voice and format of the audio output.

Tip

Supported voices may vary between models.

"},{"location":"user-guide/playground/audio/#view-code","title":"View Code","text":"

After experimenting with input text and parameters, click the View Code button to see how to call the API with the same input. Code examples are provided in curl, Python, and Node.js.

"},{"location":"user-guide/playground/audio/#speech-to-text","title":"Speech to Text","text":"

Switch to the \"Speech to Text\" tab to test STT models.

"},{"location":"user-guide/playground/audio/#provide-audio-file","title":"Provide Audio File","text":"

You can provide audio for transcription in two ways:

  1. Upload an audio file.
  2. Record audio online.

Note

If the online recording is not available, it could be due to one of the following reasons:

  1. For HTTPS or http://localhost access, microphone permissions must be enabled in your browser.
  2. For access via http://{host IP}, the URL must be added to your browser's trusted list.

    Example: In Chrome, navigate to chrome://flags/, add the GPUStack URL to \"Insecure origins treated as secure,\" and enable this option.

"},{"location":"user-guide/playground/audio/#select-model_1","title":"Select Model","text":"

Select an available STT model in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.

"},{"location":"user-guide/playground/audio/#copy-text","title":"Copy Text","text":"

Copy the transcription results generated by the model.

"},{"location":"user-guide/playground/audio/#customize-parameters_1","title":"Customize Parameters","text":"

Select the appropriate language for your audio file to optimize transcription accuracy.

"},{"location":"user-guide/playground/audio/#view-code_1","title":"View Code","text":"

After experimenting with audio files and parameters, click the View Code button to see how to call the API with the same input. Code examples are provided in curl, Python, and Node.js.

"},{"location":"user-guide/playground/chat/","title":"Chat Playground","text":"

Interact with the chat completions API. The following is an example screenshot:

"},{"location":"user-guide/playground/chat/#prompts","title":"Prompts","text":"

You can adjust the prompt messages on the left side of the playground. There are three role types of prompt messages: system, user, and assistant.

"},{"location":"user-guide/playground/chat/#edit-system-message","title":"Edit System Message","text":"

You can add and edit the system message at the top of the playground.

"},{"location":"user-guide/playground/chat/#edit-user-and-assistant-messages","title":"Edit User and Assistant Messages","text":"

To add a user or assistant message, click the New Message button.

To remove a user or assistant message, click the minus button at the right corner of the message.

To change the role of a message, click the User or Assistant text at the beginning of the message.

"},{"location":"user-guide/playground/chat/#upload-image","title":"Upload Image","text":"

You can add images to the prompt by clicking the Upload Image button.

"},{"location":"user-guide/playground/chat/#clear-prompts","title":"Clear Prompts","text":"

Click the Clear button to clear all the prompts.

"},{"location":"user-guide/playground/chat/#select-model","title":"Select Model","text":"

You can select available models in GPUStack by clicking the model dropdown at the top-right corner of the playground. Please refer to Model Management to learn about how to manage models.

"},{"location":"user-guide/playground/chat/#customize-parameters","title":"Customize Parameters","text":"

You can customize completion parameters in the Parameters section.

"},{"location":"user-guide/playground/chat/#do-completion","title":"Do Completion","text":"

You can do a completion by clicking the Submit button.

"},{"location":"user-guide/playground/chat/#view-code","title":"View Code","text":"

Once you've done experimenting with the prompts and parameters, you can click the View Code button to check how you can call the API with the same input by code. Code examples in curl, Python, and Node.js are provided.

"},{"location":"user-guide/playground/chat/#compare-playground","title":"Compare Playground","text":"

You can compare multiple models in the playground. The following is an example screenshot:

"},{"location":"user-guide/playground/chat/#comparision-mode","title":"Comparision Mode","text":"

You can choose the number of models to compare by clicking the comparison view buttons, including 2, 3, 4 and 6-model comparison.

"},{"location":"user-guide/playground/chat/#prompts_1","title":"Prompts","text":"

You can adjust the prompt messages similar to the chat playground.

"},{"location":"user-guide/playground/chat/#upload-image_1","title":"Upload Image","text":"

You can add images to the prompt by clicking the Upload Image button.

"},{"location":"user-guide/playground/chat/#clear-prompts_1","title":"Clear Prompts","text":"

Click the Clear button to clear all the prompts.

"},{"location":"user-guide/playground/chat/#select-model_1","title":"Select Model","text":"

You can select available models in GPUStack by clicking the model dropdown at the top-left corner of each model panel.

"},{"location":"user-guide/playground/chat/#customize-parameters_1","title":"Customize Parameters","text":"

You can customize completion parameters by clicking the settings button of each model.

"},{"location":"user-guide/playground/embedding/","title":"Embedding Playground","text":"

The Embedding Playground lets you test the model\u2019s ability to convert text into embeddings. It allows you to experiment with multiple text inputs, visualize embeddings, and review code examples for API integration.

"},{"location":"user-guide/playground/embedding/#add-text","title":"Add Text","text":"

Add at least two text entries and click the Submit button to generate embeddings.

"},{"location":"user-guide/playground/embedding/#batch-input-text","title":"Batch Input Text","text":"

Enable Batch Input Mode to automatically split multi-line text into separate entries based on line breaks. This is useful for processing multiple text snippets in a single operation.

"},{"location":"user-guide/playground/embedding/#visualization","title":"Visualization","text":"

Visualize the embedding results using PCA (Principal Component Analysis) to reduce dimensions and display them on a 2D plot. Results can be viewed in two formats:

  1. Chart - Display PCA results visually.
  2. JSON - View raw embeddings in JSON format.

In the chart, the distance between points represents the similarity between corresponding texts. Closer points indicate higher similarity.

"},{"location":"user-guide/playground/embedding/#clear","title":"Clear","text":"

Click the Clear button to reset text entries and clear the output.

"},{"location":"user-guide/playground/embedding/#select-model","title":"Select Model","text":"

You can select available models in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.

"},{"location":"user-guide/playground/embedding/#view-code","title":"View Code","text":"

After experimenting with the text inputs, click the View Code button to see how you can call the API with the same input. Code examples are provided in curl, Python, and Node.js.

"},{"location":"user-guide/playground/image/","title":"Image Playground","text":"

The Image Playground is a dedicated space for testing and experimenting with GPUStack\u2019s image generation APIs. It allows users to interactively explore the capabilities of different models, customize parameters, and review code examples for seamless API integration.

"},{"location":"user-guide/playground/image/#prompt","title":"Prompt","text":"

You can input or randomly generate a prompt, then click the Submit button to generate an image.

"},{"location":"user-guide/playground/image/#clear-prompt","title":"Clear Prompt","text":"

Click the Clear button to reset the prompt and remove the generated image.

"},{"location":"user-guide/playground/image/#select-model","title":"Select Model","text":"

You can select available models in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.

"},{"location":"user-guide/playground/image/#customize-parameters","title":"Customize Parameters","text":"

You can customize the image generation parameters by switching between two API styles:

  1. OpenAI-compatible mode.
  2. Advanced mode.

"},{"location":"user-guide/playground/image/#advanced-parameters","title":"Advanced Parameters","text":"Parameter Default Description Counts 1 Number of images to generate. Size 512x512 The size of the generated image in 'widthxheight' format. Sampler euler_a The sampler algorithm for image generation. Options include 'euler_a', 'euler', 'heun', 'dpm2', 'dpm++2s_a', 'dpm++2m', 'dpm++2mv2', 'ipndm', 'ipndm_v', and 'lcm'. Schedule discrete The noise scheduling method. Sampler Steps 10 The number of sampling steps to perform. Higher values may improve image quality at the cost of longer processing time. CFG Scale 4.5 The scale for classifier-free guidance. A higher value increases adherence to the prompt. Negative Prompt (empty) A negative prompt to specify what the image should avoid. Seed (empty) Random seed.

Note

The maximum image size is restricted by the model's deployment settings. See the diagram below:

"},{"location":"user-guide/playground/image/#view-code","title":"View Code","text":"

After experimenting with prompts and parameters, click the View Code button to see how to call the API with the same inputs. Code examples are provided in curl, Python, and Node.js.

"},{"location":"user-guide/playground/rerank/","title":"Rerank Playground","text":"

The Rerank Playground allows you to test reranker models that reorder multiple texts based on their relevance to a query. Experiment with various input texts, customize parameters, and review code examples for API integration.

"},{"location":"user-guide/playground/rerank/#add-text","title":"Add Text","text":"

Add multiple text entries to the document for reranking.

"},{"location":"user-guide/playground/rerank/#bach-input-text","title":"Bach Input Text","text":"

Enable Batch Input Mode to split multi-line text into separate entries based on line breaks. This is useful for processing multiple text snippets efficiently.

"},{"location":"user-guide/playground/rerank/#clear","title":"Clear","text":"

Click the Clear button to reset the document and query results.

"},{"location":"user-guide/playground/rerank/#query","title":"Query","text":"

Input a query and click the Submit button to get a ranked list of texts based on their relevance to the query.

"},{"location":"user-guide/playground/rerank/#select-model","title":"Select Model","text":"

Select an available reranker model in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.

"},{"location":"user-guide/playground/rerank/#customize-parameters","title":"Customize Parameters","text":"

In the parameter section, set Top N to specify the number of matching texts to retrieve.

"},{"location":"user-guide/playground/rerank/#view-code","title":"View Code","text":"

After experimenting with the input text and query, click the View Code button to see how to call the API with the same input. Code examples are provided in curl, Python, and Node.js.

"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":""},{"location":"api-reference/","title":"API Reference","text":"

GPUStack provides a built-in Swagger UI. You can access it by navigating to <gpustack-server-url>/docs in your browser to view and interact with the APIs.

"},{"location":"architecture/","title":"Architecture","text":"

The following diagram shows the architecture of GPUStack:

"},{"location":"architecture/#server","title":"Server","text":"

The GPUStack server consists of the following components:

"},{"location":"architecture/#worker","title":"Worker","text":"

GPUStack workers are responsible for:

"},{"location":"architecture/#sql-database","title":"SQL Database","text":"

The GPUStack server connects to a SQL database as the datastore. GPUStack uses SQLite by default, but you can configure it to use an external PostgreSQL as well.

"},{"location":"architecture/#inference-server","title":"Inference Server","text":"

Inference servers are the backends that performs the inference tasks. GPUStack supports llama-box, vLLM and vox-box as the inference server.

"},{"location":"architecture/#rpc-server","title":"RPC Server","text":"

The RPC server enables running llama-box backend on a remote host. The Inference Server communicates with one or several instances of RPC server, offloading computations to these remote hosts. This setup allows for distributed LLM inference across multiple workers, enabling the system to load larger models even when individual resources are limited.

"},{"location":"code-of-conduct/","title":"Contributor Code of Conduct","text":""},{"location":"code-of-conduct/#our-pledge","title":"Our Pledge","text":"

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

"},{"location":"code-of-conduct/#our-standards","title":"Our Standards","text":"

Examples of behavior that contributes to a positive environment for our community include:

Examples of unacceptable behavior include:

"},{"location":"code-of-conduct/#enforcement-responsibilities","title":"Enforcement Responsibilities","text":"

Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.

Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.

"},{"location":"code-of-conduct/#scope","title":"Scope","text":"

This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.

"},{"location":"code-of-conduct/#enforcement","title":"Enforcement","text":"

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at contact@gpustack.ai. All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the reporter of any incident.

"},{"location":"code-of-conduct/#enforcement-guidelines","title":"Enforcement Guidelines","text":"

Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:

"},{"location":"code-of-conduct/#1-correction","title":"1. Correction","text":"

Community Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.

Consequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.

"},{"location":"code-of-conduct/#2-warning","title":"2. Warning","text":"

Community Impact: A violation through a single incident or series of actions.

Consequence: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.

"},{"location":"code-of-conduct/#3-temporary-ban","title":"3. Temporary Ban","text":"

Community Impact: A serious violation of community standards, including sustained inappropriate behavior.

Consequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.

"},{"location":"code-of-conduct/#4-permanent-ban","title":"4. Permanent Ban","text":"

Community Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.

Consequence: A permanent ban from any sort of public interaction within the community.

"},{"location":"code-of-conduct/#attribution","title":"Attribution","text":"

This Code of Conduct is adapted from the Contributor Covenant, version 2.1, available at https://www.contributor-covenant.org/version/2/1/code_of_conduct.html.

"},{"location":"contributing/","title":"Contributing to GPUStack","text":"

Thanks for taking the time to contribute to GPUStack!

Please review and follow the Code of Conduct.

"},{"location":"contributing/#filing-issues","title":"Filing Issues","text":"

If you find any bugs or are having any trouble, please search the reported issue as someone may have experienced the same issue, or we are actively working on a solution.

If you can't find anything related to your issue, contact us by filing an issue. To help us diagnose and resolve, please include as much information as possible, including:

"},{"location":"contributing/#contributing-code","title":"Contributing Code","text":"

For setting up development environment, please refer to Development Guide.

If you're fixing a small issue, you can simply submit a PR. However, if you're planning to submit a bigger PR to implement a new feature or fix a relatively complex bug, please open an issue that explains the change and the motivation for it. If you're addressing a bug, please explain how to reproduce it.

"},{"location":"contributing/#updating-documentation","title":"Updating Documentation","text":"

If you have any updates to our documentation, feel free to file an issue with the documentation label or make a pull request.

"},{"location":"development/","title":"Development Guide","text":""},{"location":"development/#prerequisites","title":"Prerequisites","text":"

Install python 3.10+.

"},{"location":"development/#set-up-environment","title":"Set Up Environment","text":"
make install\n
"},{"location":"development/#run","title":"Run","text":"
poetry run gpustack\n
"},{"location":"development/#build","title":"Build","text":"
make build\n

And check artifacts in dist.

"},{"location":"development/#test","title":"Test","text":"
make test\n
"},{"location":"development/#update-dependencies","title":"Update Dependencies","text":"
poetry add <something>\n

Or

poetry add --group dev <something>\n

For dev/testing dependencies.

"},{"location":"overview/","title":"GPUStack","text":"

GPUStack is an open-source GPU cluster manager for running AI models.

"},{"location":"overview/#key-features","title":"Key Features","text":""},{"location":"overview/#supported-platforms","title":"Supported Platforms","text":"

The following operating systems are verified to work with GPUStack:

OS Versions Windows 10, 11 Ubuntu >= 20.04 Debian >= 11 RHEL >= 8 Rocky >= 8 Fedora >= 36 OpenSUSE >= 15.3 (leap) OpenEuler >= 22.03

Note

The installation of GPUStack worker on a Linux system requires that the GLIBC version be 2.29 or higher.

"},{"location":"overview/#supported-architectures","title":"Supported Architectures","text":"

GPUStack supports both AMD64 and ARM64 architectures, with the following notes:

"},{"location":"overview/#supported-accelerators","title":"Supported Accelerators","text":"

We plan to support the following accelerators in future releases.

"},{"location":"overview/#supported-models","title":"Supported Models","text":"

GPUStack uses llama-box (bundled llama.cpp and stable-diffusion.cpp server), vLLM and vox-box as the backends and supports a wide range of models. Models from the following sources are supported:

  1. Hugging Face

  2. ModelScope

  3. Ollama Library

  4. Local File Path

"},{"location":"overview/#example-models","title":"Example Models:","text":"Category Models Large Language Models(LLMs) Qwen, LLaMA, Mistral, Deepseek, Phi, Yi Vision Language Models(VLMs) Llama3.2-Vision, Pixtral , Qwen2-VL, LLaVA, InternVL2 Diffusion Models Stable Diffusion, FLUX Rerankers GTE, BCE, BGE, Jina Audio Models Whisper (speech-to-text), CosyVoice (text-to-speech)

For full list of supported models, please refer to the supported models section in the inference backends documentation.

"},{"location":"overview/#openai-compatible-apis","title":"OpenAI-Compatible APIs","text":"

GPUStack serves OpenAI compatible APIs. For details, please refer to OpenAI Compatible APIs

"},{"location":"quickstart/","title":"Quickstart","text":""},{"location":"quickstart/#installation","title":"Installation","text":""},{"location":"quickstart/#linux-or-macos","title":"Linux or macOS","text":"

GPUStack provides a script to install it as a service on systemd or launchd based systems. To install GPUStack using this method, just run:

curl -sfL https://get.gpustack.ai | sh -s -\n
"},{"location":"quickstart/#windows","title":"Windows","text":"

Run PowerShell as administrator (avoid using PowerShell ISE), then run the following command to install GPUStack:

Invoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n
"},{"location":"quickstart/#other-installation-methods","title":"Other Installation Methods","text":"

For manual installation, docker installation or detailed configuration options, please refer to the Installation Documentation.

"},{"location":"quickstart/#getting-started","title":"Getting Started","text":"
  1. Run and chat with the llama3.2 model:
gpustack chat llama3.2 \"tell me a joke.\"\n
  1. Run and generate an image with the stable-diffusion-v3-5-large-turbo model:

Tip

This command downloads the model (~12GB) from Hugging Face. The download time depends on your network speed. Ensure you have enough disk space and VRAM (12GB) to run the model. If you encounter issues, you can skip this step and move to the next one.

gpustack draw hf.co/gpustack/stable-diffusion-v3-5-large-turbo-GGUF:stable-diffusion-v3-5-large-turbo-Q4_0.gguf \\\n\"A minion holding a sign that says 'GPUStack'. The background is filled with futuristic elements like neon lights, circuit boards, and holographic displays. The minion is wearing a tech-themed outfit, possibly with LED lights or digital patterns. The sign itself has a sleek, modern design with glowing edges. The overall atmosphere is high-tech and vibrant, with a mix of dark and neon colors.\" \\\n--sample-steps 5 --show\n

Once the command completes, the generated image will appear in the default viewer. You can experiment with the prompt and CLI options to customize the output.

  1. Open http://myserver in the browser to access the GPUStack UI. Log in to GPUStack with username admin and the default password. You can run the following command to get the password for the default setup:

Linux or macOS

cat /var/lib/gpustack/initial_admin_password\n

Windows

Get-Content -Path \"$env:APPDATA\\gpustack\\initial_admin_password\" -Raw\n
  1. Click Playground in the navigation menu. Now you can chat with the LLM in the UI playground.

  1. Click API Keys in the navigation menu, then click the New API Key button.

  2. Fill in the Name and click the Save button.

  3. Copy the generated API key and save it somewhere safe. Please note that you can only see it once on creation.

  4. Now you can use the API key to access the OpenAI-compatible API. For example, use curl as the following:

export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n  -d '{\n    \"model\": \"llama3.2\",\n    \"messages\": [\n      {\n        \"role\": \"system\",\n        \"content\": \"You are a helpful assistant.\"\n      },\n      {\n        \"role\": \"user\",\n        \"content\": \"Hello!\"\n      }\n    ],\n    \"stream\": true\n  }'\n
"},{"location":"quickstart/#cleanup","title":"Cleanup","text":"

After you complete using the deployed models, you can go to the Models page in the GPUStack UI and delete the models to free up resources.

"},{"location":"scheduler/","title":"Scheduler","text":""},{"location":"scheduler/#summary","title":"Summary","text":"

The scheduler's primary responsibility is to calculate the resources required by models instance and to evaluate and select the optimal workers/GPUs for model instances through a series of strategies. This ensures that model instances can run efficiently. This document provides a detailed overview of the policies and processes used by the scheduler.

"},{"location":"scheduler/#scheduling-process","title":"Scheduling Process","text":""},{"location":"scheduler/#filtering-phase","title":"Filtering Phase","text":"

The filtering phase aims to narrow down the available workers or GPUs to those that meet specific criteria. The main policies involved are:

"},{"location":"scheduler/#label-matching-policy","title":"Label Matching Policy","text":"

This policy filters workers based on the label selectors configured for the model. If no label selectors are defined for the model, all workers are considered. Otherwise, the system checks whether the labels of each worker node match the model's label selectors, retaining only those workers that match.

"},{"location":"scheduler/#status-policy","title":"Status Policy","text":"

This policy filters workers based on their status, retaining only those that are in a READY state.

"},{"location":"scheduler/#resource-fit-policy","title":"Resource Fit Policy","text":"

The Resource Fit Policy is a critical strategy in the scheduling system, used to filter workers or GPUs based on resource compatibility. The goal of this policy is to ensure that model instances can run on the selected nodes without exceeding resource limits. The Resource Fit Policy prioritizes candidates in the following order:

"},{"location":"scheduler/#scoring-phase","title":"Scoring Phase","text":"

The scoring phase evaluates the filtered candidates, scoring them to select the optimal deployment location. The primary strategy involved is:

"},{"location":"scheduler/#placement-strategy-policy","title":"Placement Strategy Policy","text":"

This strategy aims to \"pack\" as many model instances as possible into the fewest number of \"bins\" (e.g., Workers/GPUs) to optimize resource utilization. The goal is to minimize the number of bins used while maximizing resource efficiency, ensuring each bin is filled as efficiently as possible without exceeding its capacity. Model instances are placed in the bin with the least remaining space to minimize leftover capacity in each bin.

This strategy seeks to distribute multiple model instances across different worker nodes as evenly as possible, improving system fault tolerance and load balancing.

"},{"location":"troubleshooting/","title":"Troubleshooting","text":""},{"location":"troubleshooting/#view-gpustack-logs","title":"View GPUStack Logs","text":"

If you installed GPUStack using the installation script, you can view GPUStack logs at the following path:

"},{"location":"troubleshooting/#linux-or-macos","title":"Linux or macOS","text":"
/var/log/gpustack.log\n
"},{"location":"troubleshooting/#windows","title":"Windows","text":"
\"$env:APPDATA\\gpustack\\log\\gpustack.log\"\n
"},{"location":"troubleshooting/#configure-log-level","title":"Configure Log Level","text":"

You can enable the DEBUG log level on gpustack start by setting the --debug parameter.

You can configure log level of GPUStack server at runtime by running the following command on the server node:

curl -X PUT http://localhost/debug/log_level -d \"debug\"\n
"},{"location":"troubleshooting/#reset-admin-password","title":"Reset Admin Password","text":"

In case you forgot the admin password, you can reset it by running the following command on the server node:

gpustack reset-admin-password\n
"},{"location":"upgrade/","title":"Upgrade","text":"

You can upgrade GPUStack using the installation script or by manually installing the desired version of the GPUStack Python package.

Note

When upgrading, upgrade the GPUStack server first, then upgrade the workers.

"},{"location":"upgrade/#upgrade-gpustack-using-the-installation-script","title":"Upgrade GPUStack Using the Installation Script","text":"

To upgrade GPUStack from an older version, re-run the installation script using the same configuration options you originally used.

Running the installation script will:

  1. Install the latest version of the GPUStack Python package.
  2. Update the system service (systemd, launchd, or Windows) init script to reflect the arguments passed to the installation script.
  3. Restart the GPUStack service.
"},{"location":"upgrade/#linux-and-macos","title":"Linux and macOS","text":"

For example, to upgrade GPUStack to the latest version on a Linux system and macOS:

curl -sfL https://get.gpustack.ai | <EXISTING_INSTALL_ENV> sh -s - <EXISTING_GPUSTACK_ARGS>\n

To upgrade to a specific version, specify the INSTALL_PACKAGE_SPEC environment variable similar to the pip install command:

curl -sfL https://get.gpustack.ai | INSTALL_PACKAGE_SPEC=gpustack==x.y.z <EXISTING_INSTALL_ENV> sh -s - <EXISTING_GPUSTACK_ARGS>\n
"},{"location":"upgrade/#windows","title":"Windows","text":"

To upgrade GPUStack to the latest version on a Windows system:

$env:<EXISTING_INSTALL_ENV> = <EXISTING_INSTALL_ENV_VALUE>\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n

To upgrade to a specific version:

$env:INSTALL_PACKAGE_SPEC = gpustack==x.y.z\n$env:<EXISTING_INSTALL_ENV> = <EXISTING_INSTALL_ENV_VALUE>\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } <EXISTING_GPUSTACK_ARGS>\"\n
"},{"location":"upgrade/#docker-upgrade","title":"Docker Upgrade","text":"

If you installed GPUStack using Docker, upgrade to the a new version by pulling the Docker image with the desired version tag.

For example:

docker pull gpustack/gpustack:vX.Y.Z\n

Then restart the GPUStack service with the new image.

"},{"location":"upgrade/#manual-upgrade","title":"Manual Upgrade","text":"

If you install GPUStack manually, upgrade using the common pip workflow.

For example, to upgrade GPUStack to the latest version:

pip install --upgrade gpustack\n

Then restart the GPUStack service according to your setup.

"},{"location":"cli-reference/chat/","title":"gpustack chat","text":"

Chat with a large language model.

gpustack chat model [prompt]\n
"},{"location":"cli-reference/chat/#positional-arguments","title":"Positional Arguments","text":"Name Description model The model to use for chat. prompt The prompt to send to the model. [Optional]"},{"location":"cli-reference/chat/#one-time-chat-with-a-prompt","title":"One-time Chat with a Prompt","text":"

If a prompt is provided, it performs a one-time inference. For example:

gpustack chat llama3 \"tell me a joke.\"\n

Example output:

Why couldn't the bicycle stand up by itself?\n\nBecause it was two-tired!\n
"},{"location":"cli-reference/chat/#interactive-chat","title":"Interactive Chat","text":"

If the prompt argument is not provided, you can chat with the large language model interactively. For example:

gpustack chat llama3\n

Example output:

>tell me a joke.\nHere's one:\n\nWhy couldn't the bicycle stand up by itself?\n\n(wait for it...)\n\nBecause it was two-tired!\n\nHope that made you smile!\n>Do you have a better one?\nHere's another one:\n\nWhy did the scarecrow win an award?\n\n(think about it for a sec...)\n\nBecause he was outstanding in his field!\n\nHope that one stuck with you!\n\nDo you want to hear another one?\n>\\quit\n
"},{"location":"cli-reference/chat/#interactive-commands","title":"Interactive Commands","text":"

Followings are available commands in interactive chat:

Commands:\n  \\q or \\quit - Quit the chat\n  \\c or \\clear - Clear chat context in prompt\n  \\? or \\h or \\help - Print this help message\n
"},{"location":"cli-reference/chat/#connect-to-external-gpustack-server","title":"Connect to External GPUStack Server","text":"

If you are not running gpustack chat on the server node, or if you are serving on a custom host or port, you should provide the following environment variables:

Name Description GPUSTACK_SERVER_URL URL of the GPUStack server, e.g., http://myserver. GPUSTACK_API_KEY GPUStack API key."},{"location":"cli-reference/download-tools/","title":"gpustack download-tools","text":"

Download dependency tools, including llama-box, gguf-parser, and fastfetch.

gpustack download-tools [OPTIONS]\n
"},{"location":"cli-reference/download-tools/#configurations","title":"Configurations","text":"Flag Default Description ----tools-download-base-url value (empty) Base URL to download dependency tools. --save-archive value (empty) Path to save downloaded tools as a tar archive. --load-archive value (empty) Path to load downloaded tools from a tar archive, instead of downloading. --system value Default is the current OS. Operating system to download tools for. Options: linux, windows, macos. --arch value Default is the current architecture. Architecture to download tools for. Options: amd64, arm64. --device value Default is the current device. Device to download tools for. Options: cuda, mps, npu, musa, cpu."},{"location":"cli-reference/draw/","title":"gpustack draw","text":"

Generate an image with a diffusion model.

gpustack draw [model] [prompt]\n
"},{"location":"cli-reference/draw/#positional-arguments","title":"Positional Arguments","text":"Name Description model The model to use for image generation. prompt Text prompt to use for image generation.

The model can be either of the following:

  1. Name of a GPUStack model. You need to create a model in GPUStack before using it here.
  2. Reference to a Hugging Face GGUF diffusion model in Ollama style. When using this option, the model will be deployed if it is not already available. When not specified the default Q4_0 tag is used. Examples:
"},{"location":"cli-reference/draw/#configurations","title":"Configurations","text":"Flag Default Description --size value 512x512 Size of the image to generate, specified as widthxheight. --sampler value euler Sampling method. Options include: euler_a, euler, heun, dpm2, dpm++2s_a, dpm++2m, lcm, etc. --sample-steps value (Empty) Number of sampling steps. --cfg-scale value (Empty) Classifier-free guidance scale for balancing prompt adherence and creativity. --seed value (Empty) Seed for random number generation. Useful for reproducibility. --negative-prompt value (Empty) Text prompt for what to avoid in the image. --output value (Empty) Path to save the generated image. --show False If True, opens the generated image in the default image viewer. -d, --debug False Enable debug mode."},{"location":"cli-reference/start/","title":"gpustack start","text":"

Run GPUStack server or worker.

gpustack start [OPTIONS]\n
"},{"location":"cli-reference/start/#configurations","title":"Configurations","text":""},{"location":"cli-reference/start/#common-options","title":"Common Options","text":"Flag Default Description --config-file value (empty) Path to the YAML config file. -d value, --debug value False To enable debug mode, the short flag -d is not supported in Windows because this flag is reserved by PowerShell for CommonParameters. --data-dir value (empty) Directory to store data. Default is OS specific. --cache-dir value (empty) Directory to store cache (e.g., model files). Defaults to /cache. -t value, --token value Auto-generated. Shared secret used to add a worker. --huggingface-token value (empty) User Access Token to authenticate to the Hugging Face Hub. Can also be configured via the HF_TOKEN environment variable."},{"location":"cli-reference/start/#server-options","title":"Server Options","text":"Flag Default Description --host value 0.0.0.0 Host to bind the server to. --port value 80 Port to bind the server to. --disable-worker False Disable embedded worker. --bootstrap-password value Auto-generated. Initial password for the default admin user. --database-url value sqlite:///<data-dir>/database.db URL of the database. Example: postgresql://user:password@hostname:port/db_name --ssl-keyfile value (empty) Path to the SSL key file. --ssl-certfile value (empty) Path to the SSL certificate file. --force-auth-localhost False Force authentication for requests originating from localhost (127.0.0.1).When set to True, all requests from localhost will require authentication. --ollama-library-base-url https://registry.ollama.ai Base URL for the Ollama library. --disable-update-check False Disable update check."},{"location":"cli-reference/start/#worker-options","title":"Worker Options","text":"Flag Default Description -s value, --server-url value (empty) Server to connect to. --worker-ip value (empty) IP address of the worker node. Auto-detected by default. --disable-metrics False Disable metrics. --disable-rpc-servers False Disable RPC servers. --metrics-port value 10151 Port to expose metrics. --worker-port value 10150 Port to bind the worker to. Use a consistent value for all workers. --log-dir value (empty) Directory to store logs. --system-reserved value \"{\\\"ram\\\": 2, \\\"vram\\\": 0}\" The system reserves resources for the worker during scheduling, measured in GiB. By default, 2 GiB of RAM is reserved, Note: '{\\\"memory\\\": 2, \\\"gpu_memory\\\": 0}' is also supported, but it is deprecated and will be removed in future releases. --tools-download-base-url value Base URL for downloading dependency tools."},{"location":"cli-reference/start/#available-environment-variables","title":"Available Environment Variables","text":"

Most of the options can be set via environment variables. The environment variables are prefixed with GPUSTACK_ and are in uppercase. For example, --data-dir can be set via the GPUSTACK_DATA_DIR environment variable.

Below are additional environment variables that can be set:

Flag Description HF_ENDPOINT Hugging Face Hub endpoint. e.g., https://hf-mirror.com"},{"location":"cli-reference/start/#config-file","title":"Config File","text":"

You can configure start options using a YAML-format config file when starting GPUStack server or worker. Here is a complete example:

# Common Options\ndebug: false\ndata_dir: /path/to/data_dir\ncache_dir: /path/to/cache_dir\ntoken: mytoken\n\n# Server Options\nhost: 0.0.0.0\nport: 80\ndisable_worker: false\ndatabase_url: postgresql://user:password@hostname:port/db_name\nssl_keyfile: /path/to/keyfile\nssl_certfile: /path/to/certfile\nforce_auth_localhost: false\nbootstrap_password: myadminpassword\nollama_library_base_url: https://registry.mycompany.com\ndisable_update_check: false\n\n# Worker Options\nserver_url: http://myserver\nworker_ip: 192.168.1.101\ndisable_metrics: false\ndisable_rpc_servers: false\nmetrics_port: 10151\nworker_port: 10150\nlog_dir: /path/to/log_dir\nsystem_reserved:\n  ram: 2\n  vram: 0\ntools_download_base_url: https://mirror.mycompany.com\n
"},{"location":"installation/air-gapped-installation/","title":"Air-Gapped Installation","text":"

You can install GPUStack in an air-gapped environment. An air-gapped environment refers to a setup where GPUStack will be installed offline, behind a firewall, or behind a proxy.

The following methods are available for installing GPUStack in an air-gapped environment:

"},{"location":"installation/air-gapped-installation/#docker-installation","title":"Docker Installation","text":"

When running GPUStack with Docker, it works out of the box in an air-gapped environment as long as the Docker images are available. To do this, follow these steps:

  1. Pull GPUStack Docker images in an online environment.
  2. Publish Docker images to a private registry.
  3. Refer to the Docker Installation guide to run GPUStack using Docker.
"},{"location":"installation/air-gapped-installation/#manual-installation","title":"Manual Installation","text":"

For manual installation, you need to prepare the required packages and tools in an online environment and then transfer them to the air-gapped environment.

"},{"location":"installation/air-gapped-installation/#prerequisites","title":"Prerequisites","text":"

Set up an online environment identical to the air-gapped environment, including OS, architecture, and Python version.

"},{"location":"installation/air-gapped-installation/#step-1-download-the-required-packages","title":"Step 1: Download the Required Packages","text":"

Run the following commands in an online environment:

# On Windows (PowerShell):\n# $PACKAGE_SPEC = \"gpustack\"\n\n# Optional: To include extra dependencies (vllm, audio, all) or install a specific version\n# PACKAGE_SPEC=\"gpustack[all]\"\n# PACKAGE_SPEC=\"gpustack==0.4.0\"\nPACKAGE_SPEC=\"gpustack\"\n\n# Download all required packages\npip wheel $PACKAGE_SPEC -w gpustack_offline_packages\n\n# Install GPUStack to access its CLI\npip install gpustack\n\n# Download dependency tools and save them as an archive\ngpustack download-tools --save-archive gpustack_offline_tools.tar.gz\n

Optional: Additional Dependencies for macOS.

# Deploying the speech-to-text CosyVoice model on macOS requires additional dependencies.\nbrew install openfst\nCPLUS_INCLUDE_PATH=$(brew --prefix openfst)/include\nLIBRARY_PATH=$(brew --prefix openfst)/lib\n\nAUDIO_DEPENDENCY_PACKAGE_SPEC=\"wetextprocessing\"\npip wheel $AUDIO_DEPENDENCY_PACKAGE_SPEC -w gpustack_audio_dependency_offline_packages\nmv gpustack_audio_dependency_offline_packages/* gpustack_offline_packages/ && rm -rf gpustack_audio_dependency_offline_packages\n

Note

This instruction assumes that the online environment uses the same GPU type as the air-gapped environment. If the GPU types differ, use the --device flag to specify the device type for the air-gapped environment. Refer to the download-tools command for more information.

"},{"location":"installation/air-gapped-installation/#step-2-transfer-the-packages","title":"Step 2: Transfer the Packages","text":"

Transfer the following files from the online environment to the air-gapped environment.

"},{"location":"installation/air-gapped-installation/#step-3-install-gpustack","title":"Step 3: Install GPUStack","text":"

In the air-gapped environment, run the following commands:

# Install GPUStack from the downloaded packages\npip install --no-index --find-links=gpustack_offline_packages gpustack\n\n# Load and apply the pre-downloaded tools archive\ngpustack download-tools --load-archive gpustack_offline_tools.tar.gz\n

Optional: Additional Dependencies for macOS.

# Install the additional dependencies for speech-to-text CosyVoice model on macOS.\nbrew install openfst\n\npip install --no-index --find-links=gpustack_offline_packages wetextprocessing\n

Now you can run GPUStack by following the instructions in the Manual Installation guide.

"},{"location":"installation/docker-installation/","title":"Docker Installation","text":"

You can use the official Docker image to run GPUStack in a container. Installation using docker is supported on:

"},{"location":"installation/docker-installation/#prerequisites","title":"Prerequisites","text":""},{"location":"installation/docker-installation/#run-gpustack-with-docker","title":"Run GPUStack with Docker","text":"

Run the following command to start the GPUStack server:

docker run -d --gpus all -p 80:80 --ipc=host \\\n    -v gpustack-data:/var/lib/gpustack gpustack/gpustack\n

Note

You can either use the --ipc=host flag or --shm-size flag to allow the container to access the host\u2019s shared memory. It is used by vLLM and pyTorch to share data between processes under the hood, particularly for tensor parallel inference.

You can set additional flags for the gpustack start command by appending them to the docker run command.

For example, to start a GPUStack worker:

docker run -d --gpus all --ipc=host --network=host \\\n    gpustack/gpustack --server-url http://myserver --token mytoken\n

Note

The --network=host flag is used to ensure that server is accessible to the worker and inference services running on it. Alternatively, you can set --worker-ip <host-ip> -p 10150:10150 -p 40000-41024:40000-41024 to expose relevant ports.

For configuration details, please refer to the CLI Reference.

"},{"location":"installation/docker-installation/#run-gpustack-with-docker-compose","title":"Run GPUStack with Docker Compose","text":"

Get the docker-compose file from GPUStack repository, run the following command to start the GPUStack server:

docker-compose up -d\n

You can update the docker-compose.yml file to customize the command while starting a GPUStack worker.

"},{"location":"installation/docker-installation/#build-your-own-docker-image","title":"Build Your Own Docker Image","text":"

The official Docker image is built with CUDA 12.4. If you want to use a different version of CUDA, you can build your own Docker image.

# Example Dockerfile\nARG CUDA_VERSION=12.4.1\n\nFROM nvidia/cuda:$CUDA_VERSION-cudnn-runtime-ubuntu22.04\n\nENV DEBIAN_FRONTEND=noninteractive\n\nRUN apt-get update && apt-get install -y \\\n    wget \\\n    tzdata \\\n    python3 \\\n    python3-pip \\\n    && rm -rf /var/lib/apt/lists/*\n\n\nRUN pip3 install gpustack[all] && \\\n    pip3 cache purge\n\nENTRYPOINT [ \"gpustack\", \"start\" ]\n

Run the following command to build the Docker image:

docker build -t my/gpustack --build-arg CUDA_VERSION=12.0.0 .\n
"},{"location":"installation/installation-script/","title":"Installation Script","text":""},{"location":"installation/installation-script/#linux-and-macos","title":"Linux and macOS","text":"

You can use the installation script available at https://get.gpustack.ai to install GPUStack as a service on systemd and launchd based systems.

You can set additional environment variables and CLI flags when running the script. The following are examples running the installation script with different configurations:

# Run server.\ncurl -sfL https://get.gpustack.ai | sh -s -\n\n# Run server without the embedded worker.\ncurl -sfL https://get.gpustack.ai | sh -s - --disable-worker\n\n# Run server with TLS.\ncurl -sfL https://get.gpustack.ai | sh -s - --ssl-keyfile /path/to/keyfile --ssl-certfile /path/to/certfile\n\n# Run server with external postgresql database.\ncurl -sfL https://get.gpustack.ai | sh -s - --database-url \"postgresql://username:password@host:port/database_name\"\n\n# Run worker with specified IP.\ncurl -sfL https://get.gpustack.ai | sh -s - --server-url http://myserver --token mytoken --worker-ip 192.168.1.100\n\n# Install with a custom index URL.\ncurl -sfL https://get.gpustack.ai | INSTALL_INDEX_URL=https://pypi.tuna.tsinghua.edu.cn/simple sh -s -\n\n# Install a custom wheel package other than releases form pypi.org.\ncurl -sfL https://get.gpustack.ai | INSTALL_PACKAGE_SPEC=https://repo.mycompany.com/my-gpustack.whl sh -s -\n
"},{"location":"installation/installation-script/#windows","title":"Windows","text":"

You can use the installation script available at https://get.gpustack.ai to install GPUStack as a service on Windows Service Manager.

You can set additional environment variables and CLI flags when running the script. The following are examples running the installation script with different configurations:

# Run server.\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n\n# Run server without the embedded worker.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --disable-worker\"\n\n# Run server with TLS.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --ssl-keyfile 'C:\\path\\to\\keyfile' --ssl-certfile 'C:\\path\\to\\certfile'\"\n\n\n# Run server with external postgresql database.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --database-url 'postgresql://username:password@host:port/database_name'\"\n\n# Run worker with specified IP.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --server-url 'http://myserver' --token 'mytoken' --worker-ip '192.168.1.100'\"\n\n# Run worker with customize reserved resource.\nInvoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --server-url 'http://myserver' --token 'mytoken' --system-reserved '{\"\"ram\"\":5, \"\"vram\"\":5}'\"\n\n# Install with a custom index URL.\n$env:INSTALL_INDEX_URL = \"https://pypi.tuna.tsinghua.edu.cn/simple\"\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n\n# Install a custom wheel package other than releases form pypi.org.\n$env:INSTALL_PACKAGE_SPEC = \"https://repo.mycompany.com/my-gpustack.whl\"\nInvoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n

Warning

Avoid using PowerShell ISE as it is not compatible with the installation script.

"},{"location":"installation/installation-script/#available-environment-variables-for-the-installation-script","title":"Available Environment Variables for the Installation Script","text":"Name Default Description INSTALL_INDEX_URL (empty) Base URL of the Python Package Index. INSTALL_PACKAGE_SPEC gpustack[all] or gpustack[audio] The package spec to install. The install script will automatically decide based on the platform. It supports PYPI package names, URLs, and local paths. See the pip install documentation for details. INSTALL_PRE_RELEASE (empty) If set to 1, pre-release packages will be installed. INSTALL_SKIP_POST_CHECK (empty) If set to 1, the installation script will skip the post-installation check."},{"location":"installation/installation-script/#set-environment-variables-for-the-gpustack-service","title":"Set Environment Variables for the GPUStack Service","text":"

You can set environment variables for the GPUStack service in an environment file located at:

The following is an example of the content of the file:

HF_TOKEN=\"mytoken\"\nHF_ENDPOINT=\"https://my-hf-endpoint\"\n

Note

Unlike Systemd, Launchd and Windows services do not natively support reading environment variables from a file. Configuration via the environment file is implemented by the installation script. It reads the file and applies the variables to the service configuration. After modifying the environment file on Windows and macOS, you need to re-run the installation script to apply changes to the GPUStack service.

"},{"location":"installation/installation-script/#available-cli-flags","title":"Available CLI Flags","text":"

The appended CLI flags of the installation script are passed directly as flags for the gpustack start command. You can refer to the CLI Reference for details.

"},{"location":"installation/installation-script/#install-server","title":"Install Server","text":"

To set up the GPUStack server (the management node), install GPUStack without the --server-url flag. By default, the GPUStack server includes an embedded worker. To disable this embedded worker on the server, use the --disable-worker flag.

"},{"location":"installation/installation-script/#install-worker","title":"Install Worker","text":"

To form a cluster, you can add GPUStack workers on additional nodes. Install GPUStack with the --server-url flag to specify the server' address and the --token flag for worker authenticate.

Examples are as follows:

"},{"location":"installation/installation-script/#linux-or-macos","title":"Linux or macOS","text":"
curl -sfL https://get.gpustack.ai | sh -s - --server-url http://myserver --token mytoken\n

In the default setup, you can run the following on the server node to get the token used for adding workers:

cat /var/lib/gpustack/token\n
"},{"location":"installation/installation-script/#windows_1","title":"Windows","text":"
Invoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } -- --server-url http://myserver --token mytoken\"\n

In the default setup, you can run the following on the server node to get the token used for adding workers:

Get-Content -Path \"$env:APPDATA\\gpustack\\token\" -Raw\n
"},{"location":"installation/manual-installation/","title":"Manual Installation","text":""},{"location":"installation/manual-installation/#prerequites","title":"Prerequites:","text":"

Install python3.10 or above with pip.

"},{"location":"installation/manual-installation/#install-gpustack-cli","title":"Install GPUStack CLI","text":"

Run the following to install GPUStack:

# You can add extra dependencies, options are \"vllm\", \"audio\" and \"all\".\n# e.g., gpustack[all]\npip install gpustack\n

To verify, run:

gpustack version\n
"},{"location":"installation/manual-installation/#run-gpustack","title":"Run GPUStack","text":"

Run the following command to start the GPUStack server:

gpustack start\n

By default, GPUStack uses /var/lib/gpustack as the data directory so you need sudo or proper permission for that. You can also set a custom data directory by running:

gpustack start --data-dir mypath\n
"},{"location":"installation/manual-installation/#run-gpustack-as-a-system-service","title":"Run GPUStack as a System Service","text":"

A recommended way is to run GPUStack as a startup service. For example, using systemd:

Create a service file in /etc/systemd/system/gpustack.service:

[Unit]\nDescription=GPUStack Service\nWants=network-online.target\nAfter=network-online.target\n\n[Service]\nEnvironmentFile=-/etc/default/%N\nExecStart=gpustack start\nRestart=always\nRestartSec=3\nStandardOutput=append:/var/log/gpustack.log\nStandardError=append:/var/log/gpustack.log\n\n[Install]\nWantedBy=multi-user.target\n

Then start GPUStack:

systemctl daemon-reload\nsystemctl enable gpustack\n
"},{"location":"installation/uninstallation/","title":"Uninstallation","text":""},{"location":"installation/uninstallation/#uninstallation-script","title":"Uninstallation Script","text":"

Warning

Uninstallation script deletes the data in local datastore(sqlite), configuration, model cache, and all of the scripts and CLI tools. It does not remove any data from external datastores.

If you installed GPUStack using the installation script, a script to uninstall GPUStack was generated during installation.

"},{"location":"installation/uninstallation/#linux-or-macos","title":"Linux or macOS","text":"

Run the following command to uninstall GPUStack:

sudo /var/lib/gpustack/uninstall.sh\n
"},{"location":"installation/uninstallation/#windows","title":"Windows","text":"

Run the following command in PowerShell to uninstall GPUStack:

Set-ExecutionPolicy Bypass -Scope Process -Force; & \"$env:APPDATA\\gpustack\\uninstall.ps1\"\n
"},{"location":"installation/uninstallation/#manual-uninstallation","title":"Manual Uninstallation","text":"

If you install GPUStack manually, the followings are example commands to uninstall GPUStack. You can modify according to your setup:

# Stop and remove the service.\nsystemctl stop gpustack.service\nrm /etc/systemd/system/gpustack.service\nsystemctl daemon-reload\n# Uninstall the CLI.\npip uninstall gpustack\n# Remove the data directory.\nrm -rf /var/lib/gpustack\n
"},{"location":"tutorials/creating-text-embeddings/","title":"Creating Text Embeddings","text":"

Text embeddings are numerical representations of text that capture semantic meaning, enabling machines to understand relationships and similarities between different pieces of text. In essence, they transform text into vectors in a continuous space, where texts with similar meanings are positioned closer together. Text embeddings are widely used in applications such as natural language processing, information retrieval, and recommendation systems.

In this tutorial, we will demonstrate how to deploy embedding models in GPUStack and generate text embeddings using the deployed models.

"},{"location":"tutorials/creating-text-embeddings/#prerequisites","title":"Prerequisites","text":"

Before you begin, ensure that you have the following:

"},{"location":"tutorials/creating-text-embeddings/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":"

Follow these steps to deploy the model from Hugging Face:

  1. Navigate to the Models page in the GPUStack UI.
  2. Click the Deploy Model button.
  3. In the dropdown, select Hugging Face as the source for your model.
  4. Enable the GGUF checkbox to filter models by GGUF format.
  5. Use the search bar in the top left to search for the model name CompendiumLabs/bge-small-en-v1.5-gguf.
  6. Leave everything as default and click the Save button to deploy the model.

After deployment, you can monitor the model's status on the Models page.

"},{"location":"tutorials/creating-text-embeddings/#step-2-generate-an-api-key","title":"Step 2: Generate an API Key","text":"

We will use the GPUStack API to generate text embeddings, and an API key is required:

  1. Navigate to the API Keys page in the GPUStack UI.
  2. Click the New API Key button.
  3. Enter a name for the API key and click the Save button.
  4. Copy the generated API key. You can only view the API key once, so make sure to save it securely.
"},{"location":"tutorials/creating-text-embeddings/#step-3-generate-text-embeddings","title":"Step 3: Generate Text Embeddings","text":"

With the model deployed and an API key, you can generate text embeddings via the GPUStack API. Here is an example script using curl:

export SERVER_URL=<your-server-url>\nexport GPUSTACK_API_KEY=<your-api-key>\ncurl $SERVER_URL/v1-openai/embeddings \\\n  -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"input\": \"The food was delicious and the waiter...\",\n    \"model\": \"bge-small-en-v1.5\",\n    \"encoding_format\": \"float\"\n  }'\n

Replace <your-server-url> with the URL of your GPUStack server and <your-api-key> with the API key you generated in the previous step.

Example response:

{\n  \"data\": [\n    {\n      \"embedding\": [\n        -0.012189436703920364, 0.016934078186750412, 0.003965042531490326,\n        -0.03453584015369415, -0.07623119652271271, -0.007116147316992283,\n        0.11278388649225235, 0.019714849069714546, 0.010370955802500248,\n        -0.04219457507133484, -0.029902394860982895, 0.01122555136680603,\n        0.022912170737981796, 0.031186765059828758, 0.006303929258137941,\n        # ... additional values\n      ],\n      \"index\": 0,\n      \"object\": \"embedding\"\n    }\n  ],\n  \"model\": \"bge-small-en-v1.5\",\n  \"object\": \"list\",\n  \"usage\": { \"prompt_tokens\": 12, \"total_tokens\": 12 }\n}\n
"},{"location":"tutorials/inference-on-cpus/","title":"Inference on CPUs","text":"

GPUStack supports inference on CPUs, offering flexibility when GPU resources are limited or when model sizes exceed available GPU memory. The following CPU inference modes are available:

Note

CPU inference is supported when using the llama-box (llama.cpp) backend.

To deploy a model with CPU offloading, enable the Allow CPU Offloading option in the deployment configuration (this setting is enabled by default).

After deployment, you can view the number of model layers offloaded to the CPU.

"},{"location":"tutorials/inference-with-function-calling/","title":"Inference with Function Calling","text":"

Function calling allows you to connect models to external tools and systems. This is useful for many things such as empowering AI assistants with capabilities, or building deep integrations between your applications and the models.

In this tutorial, you\u2019ll learn how to set up and use function calling within GPUStack to extend your AI\u2019s capabilities.

Note

  1. Function calling is supported in the vLLM inference backend.
  2. Function calling is essentially achieved through prompt engineering, requiring models to be trained with internalized templates to enable this capability. Therefore, not all LLMs support function calling.
"},{"location":"tutorials/inference-with-function-calling/#prerequisites","title":"Prerequisites","text":"

Before proceeding, ensure the following:

"},{"location":"tutorials/inference-with-function-calling/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":"
  1. Navigate to the Models page in the GPUStack UI and click the Deploy Model button. In the dropdown, select Hugging Face as the source for your model.
  2. Use the search bar to find the Qwen/Qwen2.5-7B-Instruct model.
  3. Expand the Advanced section in configurations and scroll down to the Backend Parameters section.
  4. Click on the Add Parameter button and add the following parameters:
  1. Click the Save button to deploy the model.

After deployment, you can monitor the model's status on the Models page.

"},{"location":"tutorials/inference-with-function-calling/#step-2-generate-an-api-key","title":"Step 2: Generate an API Key","text":"

We will use the GPUStack API to interact with the model. To do this, you need to generate an API key:

  1. Navigate to the API Keys page in the GPUStack UI.
  2. Click the New API Key button.
  3. Enter a name for the API key and click the Save button.
  4. Copy the generated API key for later use.
"},{"location":"tutorials/inference-with-function-calling/#step-3-do-inference","title":"Step 3: Do Inference","text":"

With the model deployed and an API key, you can call the model via the GPUStack API. Here is an example script using curl (replace <your-server-url> with your GPUStack server URL and <your-api-key> with the API key generated in the previous step):

export GPUSTACK_SERVER_URL=<your-server-url>\nexport GPUSTACK_API_KEY=<your-api-key>\ncurl $GPUSTACK_SERVER_URL/v1-openai/chat/completions \\\n-H \"Content-Type: application/json\" \\\n-H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n-d '{\n  \"model\": \"qwen2.5-7b-instruct\",\n  \"messages\": [\n    {\n      \"role\": \"user\",\n      \"content\": \"What'\\''s the weather like in Boston today?\"\n    }\n  ],\n  \"tools\": [\n    {\n      \"type\": \"function\",\n      \"function\": {\n        \"name\": \"get_current_weather\",\n        \"description\": \"Get the current weather in a given location\",\n        \"parameters\": {\n          \"type\": \"object\",\n          \"properties\": {\n            \"location\": {\n              \"type\": \"string\",\n              \"description\": \"The city and state, e.g. San Francisco, CA\"\n            },\n            \"unit\": {\n              \"type\": \"string\",\n              \"enum\": [\"celsius\", \"fahrenheit\"]\n            }\n          },\n          \"required\": [\"location\"]\n        }\n      }\n    }\n  ],\n  \"tool_choice\": \"auto\"\n}'\n

Example response:

{\n  \"model\": \"qwen2.5-7b-instruct\",\n  \"choices\": [\n    {\n      \"index\": 0,\n      \"message\": {\n        \"role\": \"assistant\",\n        \"content\": null,\n        \"tool_calls\": [\n          {\n            \"id\": \"chatcmpl-tool-b99d32848b324eaea4bac5a5830d00b8\",\n            \"type\": \"function\",\n            \"function\": {\n              \"name\": \"get_current_weather\",\n              \"arguments\": \"{\\\"location\\\": \\\"Boston, MA\\\", \\\"unit\\\": \\\"fahrenheit\\\"}\"\n            }\n          }\n        ]\n      },\n      \"finish_reason\": \"tool_calls\"\n    }\n  ],\n  \"usage\": {\n    \"prompt_tokens\": 212,\n    \"total_tokens\": 242,\n    \"completion_tokens\": 30\n  }\n}\n
"},{"location":"tutorials/performing-distributed-inference-across-workers/","title":"Performing Distributed Inference Across Workers","text":"

This tutorial will guide you through the process of configuring and running distributed inference across multiple workers using GPUStack. Distributed inference allows you to handle larger language models by distributing the computational workload among multiple workers. This is particularly useful when individual workers do not have sufficient resources, such as VRAM, to run the entire model independently.

"},{"location":"tutorials/performing-distributed-inference-across-workers/#prerequisites","title":"Prerequisites","text":"

Before proceeding, ensure the following:

In this tutorial, we\u2019ll assume a cluster with two nodes, each equipped with an NVIDIA P40 GPU (22GB VRAM), as shown in the following image:

We aim to run a large language model that requires more VRAM than a single worker can provide. For this tutorial, we\u2019ll use the Qwen/Qwen2.5-72B-Instruct model with the q2_k quantization format. The required resources for running this model can be estimated using the gguf-parser tool:

$ gguf-parser --hf-repo Qwen/Qwen2.5-72B-Instruct-GGUF --hf-file qwen2.5-72b-instruct-q2_k-00001-of-00007.gguf --ctx-size=8192 --in-short --skip-architecture --skip-metadata --skip-tokenizer\n\n+--------------------------------------------------------------------------------------+\n| ESTIMATE                                                                             |\n+----------------------------------------------+---------------------------------------+\n|                      RAM                     |                 VRAM 0                |\n+--------------------+------------+------------+----------------+----------+-----------+\n| LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |    UMA   |   NONUMA  |\n+--------------------+------------+------------+----------------+----------+-----------+\n|      1 + 0 + 0     | 243.89 MiB | 393.89 MiB |     80 + 1     | 2.50 GiB | 28.92 GiB |\n+--------------------+------------+------------+----------------+----------+-----------+\n

From the output, we can see that the estimated VRAM requirement for this model exceeds the 22GB VRAM available on each worker node. Thus, we need to distribute the inference across multiple workers to successfully run the model.

"},{"location":"tutorials/performing-distributed-inference-across-workers/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":"

Follow these steps to deploy the model from Hugging Face, enabling distributed inference:

  1. Navigate to the Models page in the GPUStack UI.
  2. Click the Deploy Model button.
  3. In the dropdown, select Hugging Face as the source for your model.
  4. Enable the GGUF checkbox to filter models by GGUF format.
  5. Use the search bar in the top left to search for the model name Qwen/Qwen2.5-72B-Instruct-GGUF.
  6. In the Available Files section, select the q2_k quantization format.
  7. Expand the Advanced section and scroll down. Disable the Allow CPU Offloading option and verify that the Allow Distributed Inference Across Workers option is enabled(this is enabled by default). GPUStack will evaluate the available resources in the cluster and run the model in a distributed manner if required.
  8. Click the Save button to deploy the model.

"},{"location":"tutorials/performing-distributed-inference-across-workers/#step-2-verify-the-model-deployment","title":"Step 2: Verify the Model Deployment","text":"

Once the model is deployed, verify the deployment on the Models page, where you can view details about how the model is running across multiple workers.

You can also check worker and GPU resource usage by navigating to the Resources page.

Finally, go to the Playground page to interact with the model and verify that everything is functioning correctly.

"},{"location":"tutorials/performing-distributed-inference-across-workers/#conclusion","title":"Conclusion","text":"

Congratulations! You have successfully configured and run distributed inference across multiple workers using GPUStack.

"},{"location":"tutorials/running-inference-with-ascend-npus/","title":"Running Inference With Ascend NPUs","text":"

GPUStack supports running inference on Ascend NPUs. This tutorial will guide you through the configuration steps.

"},{"location":"tutorials/running-inference-with-ascend-npus/#system-and-hardware-support","title":"System and Hardware Support","text":"OS Status Verified Linux Support Ubuntu 20.04 Device Status Verified Ascend 910 Support Ascend 910B"},{"location":"tutorials/running-inference-with-ascend-npus/#setup-steps","title":"Setup Steps","text":""},{"location":"tutorials/running-inference-with-ascend-npus/#install-ascend-packages","title":"Install Ascend packages","text":"
  1. Download Ascend packages

Choose the packages according to your system, hardware and GPUStack is compatible with CANN 8.x from resources download center(links below).

Download the driver and firmware from here.

Package Name Description Ascend-hdk-{chiptype}-npu-driver{version}_linux-{arch}.run Ascend Driver (run format) Ascend-hdk-{chiptype}-npu-firmware{version}.run Ascend (run format)

Download the toolkit and kernels from here.

Package Name Description Ascend-cann-toolkit_{version}_linux-{arch}.run CANN Toolkit (run format) Ascend-cann-kernels-{chiptype}{version}_linux-{arch}.run CANN Kernels (run format)
  1. Create the user and group for running
sudo groupadd -g HwHiAiUser\nsudo useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash\nsudo usermod -aG HwHiAiUser $USER\n
  1. Install driver
sudo chmod +x Ascend-hdk-xxx-npu-driver_x.x.x_linux-{arch}.run\n# Driver installation, default installation path: \"/usr/local/Ascend\"\nsudo sh Ascend-hdk-xxx-npu-driver_x.x.x_linux-{arch}.run --full --install-for-all\n

If you see the following message, the firmware installation is complete:

Driver package installed successfully!\n
  1. Verify successful driver installation

After the driver successful installation, run the npu-smi info command to check if the driver was installed correctly.

$npu-smi info\n+------------------------------------------------------------------------------------------------+\n| npu-smi 23.0.1                   Version: 23.0.1                                               |\n+---------------------------+---------------+----------------------------------------------------+\n| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|\n| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |\n+===========================+===============+====================================================+\n| 4     910B3               | OK            | 93.6        40                0    / 0             |\n| 0                         | 0000:01:00.0  | 0           0    / 0          3161 / 65536         |\n+===========================+===============+====================================================+\n+---------------------------+---------------+----------------------------------------------------+\n| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |\n+===========================+===============+====================================================+\n| No running processes found in NPU 4                                                            |\n+===========================+===============+====================================================+\n
  1. Install firmware
sudo chmod +x Ascend-hdk-xxx-npu-firmware_x.x.x.x.X.run\nsudo sh Ascend-hdk-xxx-npu-firmware_x.x.x.x.X.run --full\n

If you see the following message, the firmware installation is complete:

Firmware package installed successfully!\n
  1. Install toolkit and kernels

As an example for Ubuntu, adapt commands according to your system.

Check for dependencies to ensure Python, GCC, and other required tools are installed.

gcc --version\ng++ --version\nmake --version\ncmake --version\ndpkg -l zlib1g| grep zlib1g| grep ii\ndpkg -l zlib1g-dev| grep zlib1g-dev| grep ii\ndpkg -l libsqlite3-dev| grep libsqlite3-dev| grep ii\ndpkg -l openssl| grep openssl| grep ii\ndpkg -l libssl-dev| grep libssl-dev| grep ii\ndpkg -l libffi-dev| grep libffi-dev| grep ii\ndpkg -l libbz2-dev| grep libbz2-dev| grep ii\ndpkg -l libxslt1-dev| grep libxslt1-dev| grep ii\ndpkg -l unzip| grep unzip| grep ii\ndpkg -l pciutils| grep pciutils| grep ii\ndpkg -l net-tools| grep net-tools| grep ii\ndpkg -l libblas-dev| grep libblas-dev| grep ii\ndpkg -l gfortran| grep gfortran| grep ii\ndpkg -l libblas3| grep libblas3| grep ii\n

If the commands return messages showing missing packages, install them as follows (adjust the command if only specific packages are missing):

sudo apt-get install -y gcc g++ make cmake zlib1g zlib1g-dev openssl libsqlite3-dev libssl-dev libffi-dev libbz2-dev libxslt1-dev unzip pciutils net-tools libblas-dev gfortran libblas3\n

Install Python dependencies:

pip3 install --upgrade pip\npip3 install attrs numpy decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions\n

Install the toolkit and kernels:

chmod +x Ascend-cann-toolkit_{vesion}_linux-{arch}.run\nchmod +x Ascend-cann-kernels-{chip_type}_{version}_linux-{arch}.run\n\nsh Ascend-cann-toolkit_{vesion}_linux-{arch}.run --install\nsh Ascend-cann-kernels-{chip_type}_{version}_linux-{arch}.run --install\n

Once installation completes, you should see a success message like this:

xxx install success\n
  1. Configure environment variables
echo \"source ~/Ascend/ascend-toolkit/set_env.sh\" >> ~/.bashrc\nsource ~/.bashrc\n

For more details, refer to the Ascend Documentation.

"},{"location":"tutorials/running-inference-with-ascend-npus/#installing-gpustack","title":"Installing GPUStack","text":"

Once your environment is ready, you can install GPUStack following the installation guide.

Once installed, you should see that GPUStack successfully recognizes the Ascend Device in the resources page.

"},{"location":"tutorials/running-inference-with-ascend-npus/#running-inference","title":"Running Inference","text":"

After installation, you can deploy models and run inference. Refer to the model management for usage details.

The Ascend NPU supports inference through the llama-box (llama.cpp) backend. For supported models, see the llama.cpp Ascend NPU model supports.

"},{"location":"tutorials/running-inference-with-moorethreads-gpus/","title":"Running Inference With Moore Threads GPUs","text":"

GPUStack supports running inference on Moore Threads GPUs. This tutorial provides a comprehensive guide to configuring your system for optimal performance.

"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#system-and-hardware-support","title":"System and Hardware Support","text":"OS Architecture Status Verified Linux x86_64 Support Ubuntu 20.04/22.04 Device Status Verified MTT S80 Support Yes MTT S3000 Support Yes MTT S4000 Support Yes"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#prerequisites","title":"Prerequisites","text":"

The following instructions are applicable for Ubuntu 20.04/22.04 systems with x86_64 architecture.

"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#configure-the-container-runtime","title":"Configure the Container Runtime","text":"

Follow these links to install and configure the container runtime:

"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#verify-container-runtime-configuration","title":"Verify Container Runtime Configuration","text":"

Ensure the output shows the default runtime as mthreads.

$ (cd /usr/bin/musa && sudo ./docker setup $PWD)\n$ docker info | grep mthreads\n Runtimes: mthreads mthreads-experimental runc\n Default Runtime: mthreads\n
"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#installing-gpustack","title":"Installing GPUStack","text":"

To set up an isolated environment for GPUStack, we recommend using Docker.

docker run -d --name gpustack-musa -p 9009:80 --ipc=host -v gpustack-data:/var/lib/gpustack \\\n    gpustack/gpustack:main-musa\n

This command will:

To check the logs of the running container, use the following command:

docker logs -f gpustack-musa\n

If the following message appears, the GPUStack container is running successfully:

2024-11-15T23:37:46+00:00 - gpustack.server.server - INFO - Serving on 0.0.0.0:80.\n2024-11-15T23:37:46+00:00 - gpustack.worker.worker - INFO - Starting GPUStack worker.\n

Once the container is running, access the GPUStack web interface by navigating to http://localhost:9009 in your browser.

After the initial setup for GPUStack, you should see the following screen:

"},{"location":"tutorials/running-inference-with-moorethreads-gpus/#dashboard","title":"Dashboard","text":""},{"location":"tutorials/running-inference-with-moorethreads-gpus/#workers","title":"Workers","text":""},{"location":"tutorials/running-inference-with-moorethreads-gpus/#gpus","title":"GPUs","text":""},{"location":"tutorials/running-inference-with-moorethreads-gpus/#running-inference","title":"Running Inference","text":"

After installation, you can deploy models and run inference. Refer to the model management for detailed usage instructions.

Moore Threads GPUs support inference through the llama-box (llama.cpp) backend. Most recent models are supported (e.g., llama3.2:1b, llama3.2-vision:11b, qwen2.5:7b, etc.).

Use mthreads-gmi to verify if the model is offloaded to the GPU.

root@a414c45864ee:/# mthreads-gmi\nSat Nov 16 12:00:16 2024\n---------------------------------------------------------------\n    mthreads-gmi:1.14.0          Driver Version:2.7.0\n---------------------------------------------------------------\nID   Name           |PCIe                |%GPU  Mem\n     Device Type    |Pcie Lane Width     |Temp  MPC Capable\n                                         |      ECC Mode\n+-------------------------------------------------------------+\n0    MTT S80        |00000000:01:00.0    |98%   1339MiB(16384MiB)\n     Physical       |16x(16x)            |56C   YES\n                                         |      N/A\n---------------------------------------------------------------\n\n---------------------------------------------------------------\nProcesses:\nID   PID       Process name                         GPU Memory\n                                                         Usage\n+-------------------------------------------------------------+\n0    120       ...ird_party/bin/llama-box/llama-box       2MiB\n0    2022      ...ird_party/bin/llama-box/llama-box    1333MiB\n---------------------------------------------------------------\n
"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/","title":"Running Inference on Copilot+ PCs with Snapdragon X","text":"

GPUStack supports running on ARM64 Windows, enabling use on Snapdragon X-based Copilot+ PCs.

Note

Only CPU-based inference is supported on Snapdragon X devices. GPUStack does not currently support GPU or NPU acceleration on this platform.

"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#prerequisites","title":"Prerequisites","text":""},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#installing-gpustack","title":"Installing GPUStack","text":"

Run PowerShell as administrator (avoid using PowerShell ISE), then run the following command to install GPUStack:

Invoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n

After installation, follow the on-screen instructions to obtain credentials and log in to the GPUStack UI.

"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#deploying-a-model","title":"Deploying a Model","text":"
  1. Navigate to the Models page in the GPUStack UI.
  2. Click on the Deploy Model button and select Ollama Library from the dropdown.
  3. Enter llama3.2 in the Name field.
  4. Select llama3.2 from the Ollama Model dropdown.
  5. Click Save to deploy the model.

Once deployed, you can monitor the model's status on the Models page.

"},{"location":"tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/#running-inference","title":"Running Inference","text":"

Navigate to the Playground page in the GPUStack UI, where you can interact with the deployed model.

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/","title":"Setting Up a Multi-node GPUStack Cluster","text":"

This tutorial will guide you through setting up a multi-node GPUStack cluster, where you can distribute your workloads across multiple GPU-enabled nodes. This guide assumes you have basic knowledge of running commands on Linux, macOS, or Windows systems.

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#prerequisites","title":"Prerequisites","text":"

Before starting, ensure you have the following:

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-1-install-gpustack-on-the-server-node","title":"Step 1: Install GPUStack on the Server Node","text":"

First, you need to install GPUStack on one of the nodes to act as the server node. Follow the instructions below based on your operating system.

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#linux-or-macos","title":"Linux or macOS","text":"

Run the following command on your server node:

curl -sfL https://get.gpustack.ai | sh -s -\n
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#windows","title":"Windows","text":"

Run PowerShell as administrator and execute the following command:

Invoke-Expression (Invoke-WebRequest -Uri \"https://get.gpustack.ai\" -UseBasicParsing).Content\n

Once GPUStack is installed, you can proceed to configure your cluster by adding worker nodes.

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-2-retrieve-the-token-from-the-server-node","title":"Step 2: Retrieve the Token from the Server Node","text":"

To add worker nodes to the cluster, you need the token generated by GPUStack on the server node. On the server node, run the following command to get the token:

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#linux-or-macos_1","title":"Linux or macOS","text":"
cat /var/lib/gpustack/token\n
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#windows_1","title":"Windows","text":"
Get-Content -Path \"$env:APPDATA\\gpustack\\token\" -Raw\n

This token will be required in the next steps to authenticate worker nodes.

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-3-add-worker-nodes-to-the-cluster","title":"Step 3: Add Worker Nodes to the Cluster","text":"

Now, you will install GPUStack on additional nodes (worker nodes) and connect them to the server node using the token.

Linux or macOS Worker Nodes

Run the following command on each worker node, replacing http://myserver with the URL of your server node and mytoken with the token retrieved in Step 2:

curl -sfL https://get.gpustack.ai | sh -s - --server-url http://myserver --token mytoken\n

Windows Worker Nodes

Run PowerShell as administrator on each worker node and use the following command, replacing http://myserver and mytoken with the server URL and token:

Invoke-Expression \"& { $((Invoke-WebRequest -Uri 'https://get.gpustack.ai' -UseBasicParsing).Content) } --server-url http://myserver --token mytoken\"\n

Once the command is executed, each worker node will connect to the main server and become part of the GPUStack cluster.

"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#step-4-verify-the-cluster-setup","title":"Step 4: Verify the Cluster Setup","text":"

After adding the worker nodes, you can verify that the cluster is set up correctly by accessing the GPUStack UI.

  1. Open a browser and navigate to http://myserver (replace myserver with the actual server URL).
  2. Log in with the default credentials (username admin). To retrieve the default password, run the following command on the server node:

Linux or macOS

cat /var/lib/gpustack/initial_admin_password\n

Windows

Get-Content -Path \"$env:APPDATA\\gpustack\\initial_admin_password\" -Raw\n
  1. After logging in, navigate to the Resources page in the UI to see all connected nodes and their GPUs. You should see your worker nodes listed and ready for serving LLMs.
"},{"location":"tutorials/setting-up-a-multi-node-gpustack-cluster/#conclusion","title":"Conclusion","text":"

Congratulations! You've successfully set up a multi-node GPUStack cluster! You can now scale your workloads across multiple nodes, making full use of your available GPUs to handle your tasks efficiently.

"},{"location":"tutorials/using-audio-models/","title":"Using Audio Models","text":"

GPUStack supports running both speech-to-text and text-to-speech models. Speech-to-text models convert audio inputs in various languages into written text, while text-to-speech models transform written text into natural and expressive speech.

In this tutorial, we will walk you through deploying and using speech-to-text and text-to-speech models in GPUStack.

"},{"location":"tutorials/using-audio-models/#prerequisites","title":"Prerequisites","text":"

Before you begin, ensure that you have the following:

"},{"location":"tutorials/using-audio-models/#running-speech-to-text-model","title":"Running Speech-to-Text Model","text":""},{"location":"tutorials/using-audio-models/#step-1-deploy-speech-to-text-model","title":"Step 1: Deploy Speech-to-Text Model","text":"

Follow these steps to deploy the model from Hugging Face:

  1. Navigate to the Models page in the GPUStack UI.
  2. Click the Deploy Model button.
  3. In the dropdown, select Hugging Face as the source for your model.
  4. Use the search bar in the top left to search for the model name Systran/faster-whisper-medium.
  5. Leave everything as default and click the Save button to deploy the model.

After deployment, you can monitor the model's status on the Models page.

"},{"location":"tutorials/using-audio-models/#step-2-interact-with-speech-to-text-model-models","title":"Step 2: Interact with Speech-to-Text Model Models","text":"
  1. Navigate to the Playground > Audio page in the GPUStack UI.
  2. Select the Speech to Text Tab.
  3. Select the deployed model from the top-right dropdown.
  4. Click the Upload button to upload audio file or click the Microphone button to record audio.
  5. Click the Generate Text Content button to generate the text.
"},{"location":"tutorials/using-audio-models/#running-text-to-speech-model","title":"Running Text-to-Speech Model","text":""},{"location":"tutorials/using-audio-models/#step-1-deploy-text-to-speech-model","title":"Step 1: Deploy Text-to-Speech Model","text":"

Follow these steps to deploy the model from Hugging Face:

  1. Navigate to the Models page in the GPUStack UI.
  2. Click the Deploy Model button.
  3. In the dropdown, select Hugging Face as the source for your model.
  4. Use the search bar in the top left to search for the model name FunAudioLLM/CosyVoice-300M.
  5. Leave everything as default and click the Save button to deploy the model.

After deployment, you can monitor the model's status on the Models page.

"},{"location":"tutorials/using-audio-models/#step-2-interact-with-text-to-speech-model-models","title":"Step 2: Interact with Text to Speech Model Models","text":"
  1. Navigate to the Playground > Audio page in the GPUStack UI.
  2. Select the Text to Speech Tab.
  3. Choose the deployed model from the dropdown menu in the top-right corner. Then, configure the voice and output audio format.
  4. Input the text to generate.
  5. Click the Submit button to generate the text.
"},{"location":"tutorials/using-image-generation-models/","title":"Using Image Generation Models","text":"

GPUStack supports deploying and running state-of-the-art image generation models. These models allow you to generate stunning images from textual descriptions, enabling applications in design, content creation, and more.

In this tutorial, we will walk you through deploying and using image generation models in GPUStack.

"},{"location":"tutorials/using-image-generation-models/#prerequisites","title":"Prerequisites","text":"

Before you begin, ensure that you have the following:

"},{"location":"tutorials/using-image-generation-models/#step-1-deploy-the-stable-diffusion-model","title":"Step 1: Deploy the Stable Diffusion Model","text":"

Follow these steps to deploy the model from Hugging Face:

  1. Navigate to the Models page in the GPUStack UI.
  2. Click the Deploy Model button.
  3. In the dropdown, select Hugging Face as the source for your model.
  4. Use the search bar in the top left to search for the model name gpustack/stable-diffusion-v3-5-medium-GGUF.
  5. In the Available Files section, select the stable-diffusion-v3-5-medium-Q4_0.gguf file.
  6. Leave everything as default and click the Save button to deploy the model.

After deployment, you can monitor the model's status on the Models page.

"},{"location":"tutorials/using-image-generation-models/#step-2-use-the-model-for-image-generation","title":"Step 2: Use the Model for Image Generation","text":"
  1. Navigate to the Playground > Image page in the GPUStack UI.
  2. Verify that the deployed model is selected from the top-right Model dropdown.
  3. Enter a prompt describing the image you want to generate. For example:
a female character with long, flowing hair that appears to be made of ethereal, swirling patterns resembling the Northern Lights or Aurora Borealis. The background is dominated by deep blues and purples, creating a mysterious and dramatic atmosphere. The character's face is serene, with pale skin and striking features. She wears a dark-colored outfit with subtle patterns. The overall style of the artwork is reminiscent of fantasy or supernatural genres.\n
  1. Select euler in the Sampler dropdown.
  2. Set the Sample Steps to 20.
  3. Click the Submit button to create the image.

The generated image will be displayed in the UI. Your image may look different given the seed and randomness involved in the generation process.

"},{"location":"tutorials/using-image-generation-models/#conclusion","title":"Conclusion","text":"

Congratulations! You\u2019ve successfully deployed and used an image generation model in GPUStack. With this setup, you can generate unique and visually compelling images from textual prompts. Experiment with different prompts and settings to push the boundaries of what\u2019s possible.

"},{"location":"tutorials/using-reranker-models/","title":"Using Reranker Models","text":"

Reranker models are specialized models designed to improve the ranking of a list of items based on relevance to a given query. They are commonly used in information retrieval and search systems to refine initial search results, prioritizing items that are more likely to meet the user\u2019s intent. Reranker models take the initial document list and reorder items to enhance precision in applications such as search engines, recommendation systems, and question-answering tasks.

In this tutorial, we will guide you through deploying and using reranker models in GPUStack.

"},{"location":"tutorials/using-reranker-models/#prerequisites","title":"Prerequisites","text":"

Before you begin, ensure that you have the following:

"},{"location":"tutorials/using-reranker-models/#step-1-deploy-the-model","title":"Step 1: Deploy the Model","text":"

Follow these steps to deploy the model from Hugging Face:

  1. Navigate to the Models page in the GPUStack UI.
  2. Click the Deploy Model button.
  3. In the dropdown, select Hugging Face as the source for your model.
  4. Enable the GGUF checkbox to filter models by GGUF format.
  5. Use the search bar in the top left to search for the model name gpustack/bge-reranker-v2-m3-GGUF.
  6. Leave everything as default and click the Save button to deploy the model.

After deployment, you can monitor the model's status on the Models page.

"},{"location":"tutorials/using-reranker-models/#step-2-generate-an-api-key","title":"Step 2: Generate an API Key","text":"

We will use the GPUStack API to interact with the model. To do this, you need to generate an API key:

  1. Navigate to the API Keys page in the GPUStack UI.
  2. Click the New API Key button.
  3. Enter a name for the API key and click the Save button.
  4. Copy the generated API key. You can only view the API key once, so make sure to save it securely.
"},{"location":"tutorials/using-reranker-models/#step-3-reranking","title":"Step 3: Reranking","text":"

With the model deployed and an API key, you can rerank a list of documents via the GPUStack API. Here is an example script using curl:

export SERVER_URL=<your-server-url>\nexport GPUSTACK_API_KEY=<your-api-key>\ncurl $SERVER_URL/v1/rerank \\\n    -H \"Content-Type: application/json\" \\\n    -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n    -d '{\n        \"model\": \"bge-reranker-v2-m3\",\n        \"query\": \"What is a panda?\",\n        \"top_n\": 3,\n        \"documents\": [\n            \"hi\",\n            \"it is a bear\",\n            \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n        ]\n    }' | jq\n

Replace <your-server-url> with the URL of your GPUStack server and <your-api-key> with the API key you generated in the previous step.

Example response:

{\n  \"model\": \"bge-reranker-v2-m3\",\n  \"object\": \"list\",\n  \"results\": [\n    {\n      \"document\": {\n        \"text\": \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n      },\n      \"index\": 2,\n      \"relevance_score\": 1.951932668685913\n    },\n    {\n      \"document\": {\n        \"text\": \"it is a bear\"\n      },\n      \"index\": 1,\n      \"relevance_score\": -3.7347371578216553\n    },\n    {\n      \"document\": {\n        \"text\": \"hi\"\n      },\n      \"index\": 0,\n      \"relevance_score\": -6.157620906829834\n    }\n  ],\n  \"usage\": {\n    \"prompt_tokens\": 69,\n    \"total_tokens\": 69\n  }\n}\n
"},{"location":"tutorials/using-vision-language-models/","title":"Using Vision Language Models","text":"

Vision Language Models can process both visual (image) and language (text) data simultaneously, making them versatile tools for various applications, such as image captioning, visual question answering, and more. In this tutorial, you will learn how to deploy and interact with Vision Language Models (VLMs) in GPUStack.

The procedure for deploying and interacting with these models in GPUStack is similar. The main difference is the parameters you need to set when deploying the models. For more information on the parameters you can set, please refer to Backend Parameters .

In this tutorial, we will cover the deployment of the following models:

"},{"location":"tutorials/using-vision-language-models/#prerequisites","title":"Prerequisites","text":"

Before you begin, ensure that you have the following:

Note

An Ubuntu node equipped with one H100 (80GB) GPU is used throughout this tutorial.

"},{"location":"tutorials/using-vision-language-models/#step-1-install-gpustack","title":"Step 1: Install GPUStack","text":"

Run the following command to install GPUStack:

curl -sfL https://get.gpustack.ai | sh -s - --huggingface-token <Hugging Face API Key>\n

Replace <Hugging Face API Key> with your Hugging Face API key. GPUStack will use this key to download the model files.

"},{"location":"tutorials/using-vision-language-models/#step-2-log-in-to-gpustack-ui","title":"Step 2: Log in to GPUStack UI","text":"

Run the following command to get the default password:

cat /var/lib/gpustack/initial_admin_password\n

Open your browser and navigate to http://<your-server-ip>. Replace <your-server-ip> with the IP address of your server. Log in using the username admin and the password you obtained in the previous step.

"},{"location":"tutorials/using-vision-language-models/#step-3-deploy-vision-language-models","title":"Step 3: Deploy Vision Language Models","text":""},{"location":"tutorials/using-vision-language-models/#deploy-llama32-vision","title":"Deploy Llama3.2-Vision","text":"
  1. Navigate to the Models page in the GPUStack UI.
  2. Click on the Deploy Model button, then select Hugging Face in the dropdown.
  3. Search for meta-llama/Llama-3.2-11B-Vision-Instruct in the search bar.
  4. Expand the Advanced section in configurations and scroll down to the Backend Parameters section.
  5. Click on the Add Parameter button multiple times and add the following parameters:
  1. Click the Save button.
"},{"location":"tutorials/using-vision-language-models/#deploy-qwen2-vl","title":"Deploy Qwen2-VL","text":"
  1. Navigate to the Models page in the GPUStack UI.
  2. Click on the Deploy Model button, then select Hugging Face in the dropdown.
  3. Search for Qwen/Qwen2-VL-7B-Instruct in the search bar.
  4. Click the Save button. The default configurations should work as long as you have enough GPU resources.
"},{"location":"tutorials/using-vision-language-models/#deploy-pixtral","title":"Deploy Pixtral","text":"
  1. Navigate to the Models page in the GPUStack UI.
  2. Click on the Deploy Model button, then select Hugging Face in the dropdown.
  3. Search for mistralai/Pixtral-12B-2409 in the search bar.
  4. Expand the Advanced section in configurations and scroll down to the Backend Parameters section.
  5. Click on the Add Parameter button multiple times and add the following parameters:
  1. Click the Save button.
"},{"location":"tutorials/using-vision-language-models/#deploy-phi35-vision","title":"Deploy Phi3.5-Vision","text":"
  1. Navigate to the Models page in the GPUStack UI.
  2. Click on the Deploy Model button, then select Hugging Face in the dropdown.
  3. Search for microsoft/Phi-3.5-vision-instruct in the search bar.
  4. Expand the Advanced section in configurations and scroll down to the Backend Parameters section.
  5. Click on the Add Parameter button and add the following parameter:
  1. Click the Save button.
"},{"location":"tutorials/using-vision-language-models/#step-4-interact-with-vision-language-models","title":"Step 4: Interact with Vision Language Models","text":"
  1. Navigate to the Playground page in the GPUStack UI.
  2. Select the deployed model from the top-right dropdown.
  3. Click on the Upload Image button above the input text area and upload an image.
  4. Enter a prompt in the input text area. For example, \"Describe the image.\"
  5. Click the Submit button to generate the output.
"},{"location":"tutorials/using-vision-language-models/#conclusion","title":"Conclusion","text":"

In this tutorial, you learned how to deploy and interact with Vision Language Models in GPUStack. You can use the same approach to deploy other Vision Language Models not covered in this tutorial. If you have any questions or need further assistance, feel free to reach out to us.

"},{"location":"user-guide/api-key-management/","title":"API Key Management","text":"

GPUStack supports authentication using API keys. Each GPUStack user can generate and manage their own API keys.

"},{"location":"user-guide/api-key-management/#create-api-key","title":"Create API Key","text":"
  1. Navigate to the API Keys page.
  2. Click the New API Key button.
  3. Fill in the Name, Description, and select the Expiration of the API key.
  4. Click the Save button.
  5. Copy and store the key somewhere safe, then click the Done button.

Note

Please note that you can only see the generated API key once upon creation.

"},{"location":"user-guide/api-key-management/#delete-api-key","title":"Delete API Key","text":"
  1. Navigate to the API Keys page.
  2. Find the API key you want to delete.
  3. Click the Delete button in the Operations column.
  4. Confirm the deletion.
"},{"location":"user-guide/api-key-management/#use-api-key","title":"Use API Key","text":"

GPUStack supports using the API key as a bearer token. The following is an example using curl:

export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n  -d '{\n    \"model\": \"llama3\",\n    \"messages\": [\n      {\n        \"role\": \"system\",\n        \"content\": \"You are a helpful assistant.\"\n      },\n      {\n        \"role\": \"user\",\n        \"content\": \"Hello!\"\n      }\n    ],\n    \"stream\": true\n  }'\n
"},{"location":"user-guide/image-generation-apis/","title":"Image Generation APIs","text":"

GPUStack provides APIs for generating images given a prompt and/or an input image when running diffusion models.

Note

The image generation APIs are only available when using the llama-box inference backend.

"},{"location":"user-guide/image-generation-apis/#supported-models","title":"Supported Models","text":"

The following models are available for image generation:

"},{"location":"user-guide/image-generation-apis/#api-details","title":"API Details","text":"

The image generation APIs adhere to OpenAI API specification. While OpenAI APIs for image generation are simple and opinionated, GPUStack extends these capabilities with additional features.

"},{"location":"user-guide/image-generation-apis/#create-image","title":"Create Image","text":""},{"location":"user-guide/image-generation-apis/#streaming","title":"Streaming","text":"

This image generation API supports streaming responses to return the progressing of the generation. To enable streaming, set the stream parameter to true in the request body. Example:

REQUEST : (application/json)\n{\n  \"n\": 1,\n  \"response_format\": \"b64_json\",\n  \"size\": \"512x512\",\n  \"prompt\": \"A lovely cat\",\n  \"quality\": \"standard\",\n  \"stream\": true,\n  \"stream_options\": {\n    \"include_usage\": true, // return usage information\n  }\n}\n\nRESPONSE : (text/event-stream)\ndata: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\ndata: [DONE]\n
"},{"location":"user-guide/image-generation-apis/#advanced-options","title":"Advanced Options","text":"

This image generation API supports additional options to control the generation process. The following options are available:

REQUEST : (application/json)\n{\n  \"n\": 1,\n  \"response_format\": \"b64_json\",\n  \"size\": \"512x512\",\n  \"prompt\": \"A lovely cat\",\n  \"sampler\": \"euler\",      // required, select from euler_a;euler;heun;dpm2;dpm++2s_a;dpm++2m;dpm++2mv2;ipndm;ipndm_v;lcm\n  \"schedule\": \"default\",   // optional, select from default;discrete;karras;exponential;ays;gits\n  \"seed\": null,            // optional, random seed\n  \"cfg_scale\": 4.5,        // optional, for sampler, the scale of classifier-free guidance in the output phase\n  \"sample_steps\": 20,      // optional, number of sample steps\n  \"negative_prompt\": \"\",   // optional, negative prompt\n  \"stream\": true,\n  \"stream_options\": {\n    \"include_usage\": true, // return usage information\n  }\n}\n\nRESPONSE : (text/event-stream)\ndata: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n...\ndata: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\ndata: [DONE]\n
"},{"location":"user-guide/image-generation-apis/#create-image-edit","title":"Create Image Edit","text":""},{"location":"user-guide/image-generation-apis/#streaming_1","title":"Streaming","text":"

This image generation API supports streaming responses to return the progressing of the generation. To enable streaming, set the stream parameter to true in the request body. Example:

REQUEST: (multipart/form-data)\nn=1\nresponse_format=b64_json\nsize=512x512\nprompt=\"A lovely cat\"\nquality=standard\nimage=...                         // required\nmask=...                          // optional\nstream=true\nstream_options_include_usage=true // return usage information\n\nRESPONSE : (text/event-stream)\nCASE 1: correct input image\n  data: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n  ...\n  data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n  ...\n  data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\n  data: [DONE]\nCASE 2: illegal input image\n  error: {\"code\": 400, \"message\": \"Invalid image\", \"type\": \"invalid_request_error\"}\n
"},{"location":"user-guide/image-generation-apis/#advanced-options_1","title":"Advanced Options","text":"

This image generation API supports additional options to control the generation process. The following options are available:

REQUEST: (multipart/form-data)\nn=1\nresponse_format=b64_json\nsize=512x512\nprompt=\"A lovely cat\"\nimage=...                         // required\nmask=...                          // optional\nsampler=euler                     // required, select from euler_a;euler;heun;dpm2;dpm++2s_a;dpm++2m;dpm++2mv2;ipndm;ipndm_v;lcm\nschedule=default                  // optional, select from default;discrete;karras;exponential;ays;gits\nseed=null                         // optional, random seed\ncfg_scale=4.5                     // optional, for sampler, the scale of classifier-free guidance in the output phase\nsample_steps=20                   // optional, number of sample steps\nnegative_prompt=\"\"                // optional, negative prompt\nstream=true\nstream_options_include_usage=true // return usage information\n\nRESPONSE : (text/event-stream)\nCASE 1: correct input image\n  data: {\"created\":1731916353,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":10.0}], ...}\n  ...\n  data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":50.0}], ...}\n  ...\n  data: {\"created\":1731916371,\"data\":[{\"index\":0,\"object\":\"image.chunk\",\"progress\":100.0,\"b64_json\":\"...\"}], \"usage\":{\"generation_per_second\":...,\"time_per_generation_ms\":...,\"time_to_process_ms\":...}, ...}\n  data: [DONE]\nCASE 2: illegal input image\n  error: {\"code\": 400, \"message\": \"Invalid image\", \"type\": \"invalid_request_error\"}\n
"},{"location":"user-guide/image-generation-apis/#usage","title":"Usage","text":"

The followings are examples using the image generation APIs:

"},{"location":"user-guide/image-generation-apis/#curl-create-image","title":"curl (Create Image)","text":"
export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/image/generate \\\n    -H \"Content-Type: application/json\" \\\n    -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n    -d '{\n        \"n\": 1,\n        \"response_format\": \"b64_json\",\n        \"size\": \"512x512\",\n        \"prompt\": \"A lovely cat\",\n        \"quality\": \"standard\",\n        \"stream\": true,\n        \"stream_options\": {\n        \"include_usage\": true\n        }\n    }'\n
"},{"location":"user-guide/image-generation-apis/#curl-create-image-edit","title":"curl (Create Image Edit)","text":"
export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/image/edit \\\n    -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n    -F image=\"@otter.png\" \\\n    -F mask=\"@mask.png\" \\\n    -F prompt=\"A lovely cat\" \\\n    -F n=1 \\\n    -F size=\"512x512\"\n
"},{"location":"user-guide/inference-backends/","title":"Inference Backends","text":"

GPUStack supports the following inference backends:

When users deploy a model, the backend is selected automatically based on the following criteria:

"},{"location":"user-guide/inference-backends/#llama-box","title":"llama-box","text":"

llama-box is a LM inference server based on llama.cpp and stable-diffusion.cpp.

"},{"location":"user-guide/inference-backends/#supported-platforms","title":"Supported Platforms","text":"

The llama-box backend supports Linux, macOS and Windows (with CPU offloading only on Windows ARM architecture) platforms.

"},{"location":"user-guide/inference-backends/#supported-models","title":"Supported Models","text":""},{"location":"user-guide/inference-backends/#supported-features","title":"Supported Features","text":""},{"location":"user-guide/inference-backends/#allow-cpu-offloading","title":"Allow CPU Offloading","text":"

After enabling CPU offloading, GPUStack prioritizes loading as many layers as possible onto the GPU to optimize performance. If GPU resources are limited, some layers will be offloaded to the CPU, with full CPU inference used only when no GPU is available.

"},{"location":"user-guide/inference-backends/#allow-distributed-inference-across-workers","title":"Allow Distributed Inference Across Workers","text":"

Enable distributed inference across multiple workers. The primary Model Instance will communicate with backend instances on one or more others workers, offloading computation tasks to them.

"},{"location":"user-guide/inference-backends/#parameters-reference","title":"Parameters Reference","text":"

See the full list of supported parameters for llama-box here.

"},{"location":"user-guide/inference-backends/#vllm","title":"vLLM","text":"

vLLM is a high-throughput and memory-efficient LLMs inference engine. It is a popular choice for running LLMs in production. vLLM seamlessly supports most state-of-the-art open-source models, including: Transformer-like LLMs (e.g., Llama), Mixture-of-Expert LLMs (e.g., Mixtral), Embedding Models (e.g. E5-Mistral), Multi-modal LLMs (e.g., LLaVA)

By default, GPUStack estimates the VRAM requirement for the model instance based on the model's metadata. You can customize the parameters to fit your needs. The following vLLM parameters might be useful:

For more details, please refer to vLLM documentation.

"},{"location":"user-guide/inference-backends/#supported-platforms_1","title":"Supported Platforms","text":"

The vLLM backend works on AMD Linux.

Note

  1. When users install GPUStack on amd64 Linux using the installation script, vLLM is automatically installed.
  2. When users deploy a model using the vLLM backend, GPUStack sets worker label selectors to {\"os\": \"linux\", \"arch\": \"amd64\"} by default to ensure the model instance is scheduled to proper workers. You can customize the worker label selectors in the model configuration.
"},{"location":"user-guide/inference-backends/#supported-models_1","title":"Supported Models","text":"

Please refer to the vLLM documentation for supported models.

"},{"location":"user-guide/inference-backends/#supported-features_1","title":"Supported Features","text":""},{"location":"user-guide/inference-backends/#multimodal-language-models","title":"Multimodal Language Models","text":"

vLLM supports multimodal language models listed here. When users deploy a vision language model using the vLLM backend, image inputs are supported in the chat completion API.

"},{"location":"user-guide/inference-backends/#parameters-reference_1","title":"Parameters Reference","text":"

See the full list of supported parameters for vLLM here.

"},{"location":"user-guide/inference-backends/#vox-box","title":"vox-box","text":"

vox-box is an inference engine designed for deploying text-to-speech and speech-to-text models. It also provides an API that is fully compatible with the OpenAI audio API.

"},{"location":"user-guide/inference-backends/#supported-platforms_2","title":"Supported Platforms","text":"

The vox-box backend supports Linux, macOS and Windows platforms.

Note

  1. To use Nvidia GPUs, ensure the following NVIDIA libraries are installed on workers:
    • cuBLAS for CUDA 12
    • cuDNN 9 for CUDA 12
  2. When users install GPUStack on Linux, macOS and Windows using the installation script, vox-box is automatically installed.
  3. CosyVoice models are natively supported on Linux AMD architecture and macOS. However, these models are not supported on Linux ARM or Windows architectures.
"},{"location":"user-guide/inference-backends/#supported-models_2","title":"Supported Models","text":"Model Type Link Supported Platforms Faster-whisper-large-v3 speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-large-v2 speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-large-v1 speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-medium speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-medium.en speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-small speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-small.en speech-to-text Hugging Face Linux, macOS, Windows Faster-distil-whisper-large-v3 speech-to-text Hugging Face Linux, macOS, Windows Faster-distil-whisper-large-v2 speech-to-text Hugging Face Linux, macOS, Windows Faster-distil-whisper-medium.en speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-tiny speech-to-text Hugging Face Linux, macOS, Windows Faster-whisper-tiny.en speech-to-text Hugging Face Linux, macOS, Windows CosyVoice-300M-Instruct text-to-speech Hugging Face, ModelScope Linux(ARM not supported), macOS, Windows(Not supported) CosyVoice-300M-SFT text-to-speech Hugging Face, ModelScope Linux(ARM not supported), macOS, Windows(Not supported) CosyVoice-300M text-to-speech Hugging Face, ModelScope Linux(ARM not supported), macOS, Windows(Not supported) CosyVoice-300M-25Hz text-to-speech ModelScope Linux(ARM not supported), macOS, Windows(Not supported)"},{"location":"user-guide/inference-backends/#supported-features_2","title":"Supported Features","text":""},{"location":"user-guide/inference-backends/#allow-gpucpu-offloading","title":"Allow GPU/CPU Offloading","text":"

vox-box supports deploying models to NVIDIA GPUs. If GPU resources are insufficient, it will automatically deploy the models to the CPU.

"},{"location":"user-guide/model-management/","title":"Model Management","text":"

You can manage large language models in GPUStack by navigating to the Models page. A model in GPUStack contains one or multiple replicas of model instances. On deployment, GPUStack automatically computes resource requirements for the model instances from model metadata and schedules them to available workers accordingly.

"},{"location":"user-guide/model-management/#deploy-model","title":"Deploy Model","text":"

Currently, models from Hugging Face, ModelScope, Ollama and local paths are supported.

"},{"location":"user-guide/model-management/#deploying-a-hugging-face-model","title":"Deploying a Hugging Face Model","text":"
  1. Click the Deploy Model button, then select Hugging Face in the dropdown.

  2. Search the model by name from Hugging Face using the search bar in the top left. For example, microsoft/Phi-3-mini-4k-instruct-gguf. If you only want to search for GGUF models, check the \"GGUF\" checkbox.

  3. Select a file with the desired quantization format from Available Files.

  4. Adjust the Name and Replicas as needed.

  5. Expand the Advanced section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.

  6. Click the Save button.

"},{"location":"user-guide/model-management/#deploying-a-modelscope-model","title":"Deploying a ModelScope Model","text":"
  1. Click the Deploy Model button, then select ModelScope in the dropdown.

  2. Search the model by name from ModelScope using the search bar in the top left. For example, Qwen/Qwen2-0.5B-Instruct. If you only want to search for GGUF models, check the \"GGUF\" checkbox.

  3. Select a file with the desired quantization format from Available Files.

  4. Adjust the Name and Replicas as needed.

  5. Expand the Advanced section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.

  6. Click the Save button.

"},{"location":"user-guide/model-management/#deploying-an-ollama-model","title":"Deploying an Ollama Model","text":"
  1. Click the Deploy Model button, then select Ollama Library in the dropdown.

  2. Fill in the Name of the model.

  3. Select an Ollama Model from the dropdown list, or input any Ollama model you need. For example, llama3, llama3:70b or youraccount/llama3:70b.

  4. Adjust the Replicas as needed.

  5. Expand the Advanced section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.

  6. Click the Save button.

"},{"location":"user-guide/model-management/#deploying-a-local-path-model","title":"Deploying a Local Path Model","text":"

You can deploy a model from a local path. The model path can be a directory (e.g., a downloaded Hugging Face model directory) or a file (e.g., a GGUF model file) located on workers. This is useful when running in an air-gapped environment.

Note

  1. GPUStack does not check the validity of the model path for scheduling, which may lead to deployment failure if the model path is inaccessible. It is recommended to ensure the model path is accessible on all workers(e.g., using NFS, rsync, etc.). You can also use the worker selector configuration to deploy the model to specific workers.
  2. GPUStack cannot evaluate the model's resource requirements unless the server has access to the same model path. Consequently, you may observe empty VRAM/RAM allocations for a deployed model. To mitigate this, it is recommended to make the model files available on the same path on the server. Alternatively, you can customize backend parameters, such as tensor-split, to configure how the model is distributed across the GPUs.

To deploy a local path model:

  1. Click the Deploy Model button, then select Local Path in the dropdown.

  2. Fill in the Name of the model.

  3. Fill in the Model Path.

  4. Adjust the Replicas as needed.

  5. Expand the Advanced section for advanced configurations if needed. Please refer to the Advanced Model Configuration section for more details.

  6. Click the Save button.

"},{"location":"user-guide/model-management/#edit-model","title":"Edit Model","text":"
  1. Find the model you want to edit on the model list page.
  2. Click the Edit button in the Operations column.
  3. Update the attributes as needed. For example, change the Replicas to scale up or down.
  4. Click the Save button.

Note

After editing the model, the configuration will not be applied to existing model instances. You need to delete the existing model instances. GPUStack will recreate new instances based on the updated model configuration.

"},{"location":"user-guide/model-management/#delete-model","title":"Delete Model","text":"
  1. Find the model you want to delete on the model list page.
  2. Click the ellipsis button in the Operations column, then select Delete.
  3. Confirm the deletion.
"},{"location":"user-guide/model-management/#view-model-instance","title":"View Model Instance","text":"
  1. Find the model you want to check on the model list page.
  2. Click the > symbol to view the instance list of the model.
"},{"location":"user-guide/model-management/#delete-model-instance","title":"Delete Model Instance","text":"
  1. Find the model you want to check on the model list page.
  2. Click the > symbol to view the instance list of the model.
  3. Find the model instance you want to delete.
  4. Click the ellipsis button for the model instance in the Operations column, then select Delete.
  5. Confirm the deletion.

Note

After a model instance is deleted, GPUStack will recreate a new instance to satisfy the expected replicas of the model if necessary.

"},{"location":"user-guide/model-management/#view-model-instance-logs","title":"View Model Instance Logs","text":"
  1. Find the model you want to check on the model list page.
  2. Click the > symbol to view the instance list of the model.
  3. Find the model instance you want to check.
  4. Click the View Logs button for the model instance in the Operations column.
"},{"location":"user-guide/model-management/#use-self-hosted-ollama-models","title":"Use Self-hosted Ollama Models","text":"

You can deploy self-hosted Ollama models by configuring the --ollama-library-base-url option in the GPUStack server. The Ollama Library URL should point to the base URL of the Ollama model registry. For example, https://registry.mycompany.com.

Here is an example workflow to set up a registry, publish a model, and use it in GPUStack:

# Run a self-hosted OCI registry\ndocker run -d -p 5001:5000 --name registry registry:2\n\n# Push a model to the registry using Ollama\nollama pull llama3\nollama cp llama3 localhost:5001/library/llama3\nollama push localhost:5001/library/llama3 --insecure\n\n# Start GPUStack server with the custom Ollama library URL\ncurl -sfL https://get.gpustack.ai | sh -s - --ollama-library-base-url http://localhost:5001\n

That's it! You can now deploy the model llama3 from Ollama Library source in GPUStack as usual, but the model will now be fetched from the self-hosted registry.

"},{"location":"user-guide/model-management/#advanced-model-configuration","title":"Advanced Model Configuration","text":"

GPUStack supports tailored configurations for model deployment.

"},{"location":"user-guide/model-management/#schedule-type","title":"Schedule Type","text":""},{"location":"user-guide/model-management/#auto","title":"Auto","text":"

GPUStack automatically schedules model instances to appropriate GPUs/Workers based on current resource availability.

When configured, the scheduler will deploy the model instance to the worker containing specified labels.

  1. Navigate to the Resources page and edit the desired worker. Assign custom labels to the worker by adding them in the labels section.

  2. Go to the Models page and click on the Deploy Model button. Expand the Advanced section and input the previously assigned worker labels in the Worker Selector configuration. During deployment, the Model Instance will be allocated to the corresponding worker based on these labels.

"},{"location":"user-guide/model-management/#manual","title":"Manual","text":"

This schedule type allows users to specify which GPU to deploy the model instance on.

Select a GPU from the list. The model instance will attempt to deploy to this GPU if resources permit.

"},{"location":"user-guide/model-management/#backend","title":"Backend","text":"

The inference backend. Currently, GPUStack supports three backends: llama-box, vLLM and vox-box. GPUStack automatically selects the backend based on the model's configuration.

For more details, please refer to the Inference Backends section.

"},{"location":"user-guide/model-management/#backend-version","title":"Backend Version","text":"

Specify a backend version, such as v1.0.0. The version format and availability depend on the selected backend. This option is useful for ensuring compatibility or taking advantage of features introduced in specific backend versions. Refer to the Pinned Backend Versions section for more information.

"},{"location":"user-guide/model-management/#backend-parameters","title":"Backend Parameters","text":"

Input the parameters for the backend you want to customize when running the model. The parameter should be in the format --parameter=value, --bool-parameter or as separate fields for --parameter and value. For example, use --ctx-size=8192 for llama-box.

For full list of supported parameters, please refer to the Inference Backends section.

"},{"location":"user-guide/model-management/#allow-cpu-offloading","title":"Allow CPU Offloading","text":"

Note

Available for llama-box backend only.

After enabling CPU offloading, GPUStack prioritizes loading as many layers as possible onto the GPU to optimize performance. If GPU resources are limited, some layers will be offloaded to the CPU, with full CPU inference used only when no GPU is available.

"},{"location":"user-guide/model-management/#allow-distributed-inference-across-workers","title":"Allow Distributed Inference Across Workers","text":"

Note

Available for llama-box backend only.

Enable distributed inference across multiple workers. The primary Model Instance will communicate with backend instances on one or more other workers, offloading computation tasks to them.

"},{"location":"user-guide/openai-compatible-apis/","title":"OpenAI Compatible APIs","text":"

GPUStack serves OpenAI-compatible APIs using the /v1-openai path. Most of the APIs also work under the /v1 path as an alias, except for the models endpoint, which is reserved for GPUStack management APIs.

"},{"location":"user-guide/openai-compatible-apis/#supported-endpoints","title":"Supported Endpoints","text":"

The following API endpoints are supported:

"},{"location":"user-guide/openai-compatible-apis/#usage","title":"Usage","text":"

The following are examples using the APIs in different languages:

"},{"location":"user-guide/openai-compatible-apis/#curl","title":"curl","text":"
export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1-openai/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n  -d '{\n    \"model\": \"llama3\",\n    \"messages\": [\n      {\n        \"role\": \"system\",\n        \"content\": \"You are a helpful assistant.\"\n      },\n      {\n        \"role\": \"user\",\n        \"content\": \"Hello!\"\n      }\n    ],\n    \"stream\": true\n  }'\n
"},{"location":"user-guide/openai-compatible-apis/#openai-python-api-library","title":"OpenAI Python API library","text":"
from openai import OpenAI\n\nclient = OpenAI(base_url=\"http://myserver/v1-openai\", api_key=\"myapikey\")\n\ncompletion = client.chat.completions.create(\n  model=\"llama3\",\n  messages=[\n    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n    {\"role\": \"user\", \"content\": \"Hello!\"}\n  ]\n)\n\nprint(completion.choices[0].message)\n
"},{"location":"user-guide/openai-compatible-apis/#openai-node-api-library","title":"OpenAI Node API library","text":"
const OpenAI = require(\"openai\");\n\nconst openai = new OpenAI({\n  apiKey: \"myapikey\",\n  baseURL: \"http://myserver/v1-openai\",\n});\n\nasync function main() {\n  const params = {\n    model: \"llama3\",\n    messages: [\n      {\n        role: \"system\",\n        content: \"You are a helpful assistant.\",\n      },\n      {\n        role: \"user\",\n        content: \"Hello!\",\n      },\n    ],\n  };\n  const chatCompletion = await openai.chat.completions.create(params);\n  console.log(chatCompletion.choices[0].message);\n}\nmain();\n
"},{"location":"user-guide/pinned-backend-versions/","title":"Pinned Backend Versions","text":"

Inference engines in the generative AI domain are evolving rapidly to enhance performance and unlock new capabilities. This constant evolution provides exciting opportunities but also presents challenges for maintaining model compatibility and deployment stability.

GPUStack allows you to pin inference backend versions to specific releases, offering a balance between staying up-to-date with the latest advancements and ensuring a reliable runtime environment. This feature is particularly beneficial in the following scenarios:

By pinning backend versions, you gain full control over your inference environment, enabling both flexibility and predictability in deployment.

"},{"location":"user-guide/pinned-backend-versions/#automatic-installation-of-pinned-backend-versions","title":"Automatic Installation of Pinned Backend Versions","text":"

To simplify deployment, GPUStack supports the automatic installation of pinned backend versions when feasible. The process depends on the type of backend:

  1. Prebuilt Binaries For backends like llama-box, GPUStack downloads the specified version using the same mechanism as in GPUStack bootstrapping.

Tip

You can customize the download source using the --tools-download-base-url configuration option.

  1. Python-based Backends For backends like vLLM and vox-box, GPUStack uses pipx to install the specified version in an isolated Python environment.

Tip

This automation reduces manual intervention, allowing you to focus on deploying and using your models.

"},{"location":"user-guide/pinned-backend-versions/#manual-installation-of-pinned-backend-versions","title":"Manual Installation of Pinned Backend Versions","text":"

When automatic installation is not feasible or preferred, GPUStack provides a straightforward way to manually install specific versions of inference backends. Follow these steps:

  1. Prepare the Executable Install the backend executable or link it under the GPUStack bin directory. The default locations are:

Tip

You can customize the bin directory using the --bin-dir configuration option.

  1. Name the Executable Ensure the executable is named in the following format:

For example, the vLLM executable for version v0.6.4 should be named vllm_v0.6.4 on Linux.

By following these steps, you can maintain full control over the backend installation process, ensuring that the correct version is used for your deployment.

"},{"location":"user-guide/rerank-api/","title":"Rerank API","text":"

In the context of Retrieval-Augmented Generation (RAG), reranking refers to the process of selecting the most relevant information from retrieved documents or knowledge sources before presenting them to the user or utilizing them for answer generation.

GPUStack serves Jina compatible Rerank API using the /v1/rerank path.

Note

The Rerank API is only available when using the llama-box inference backend.

"},{"location":"user-guide/rerank-api/#supported-models","title":"Supported Models","text":"

The following models are available for reranking:

"},{"location":"user-guide/rerank-api/#usage","title":"Usage","text":"

The following is an example using the Rerank API:

export GPUSTACK_API_KEY=myapikey\ncurl http://myserver/v1/rerank \\\n    -H \"Content-Type: application/json\" \\\n    -H \"Authorization: Bearer $GPUSTACK_API_KEY\" \\\n    -d '{\n        \"model\": \"bge-reranker-v2-m3\",\n        \"query\": \"What is a panda?\",\n        \"top_n\": 3,\n        \"documents\": [\n            \"hi\",\n            \"it is a bear\",\n            \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n        ]\n    }' | jq\n

Example output:

{\n  \"model\": \"bge-reranker-v2-m3\",\n  \"object\": \"list\",\n  \"results\": [\n    {\n      \"document\": {\n        \"text\": \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\"\n      },\n      \"index\": 2,\n      \"relevance_score\": 1.951932668685913\n    },\n    {\n      \"document\": {\n        \"text\": \"it is a bear\"\n      },\n      \"index\": 1,\n      \"relevance_score\": -3.7347371578216553\n    },\n    {\n      \"document\": {\n        \"text\": \"hi\"\n      },\n      \"index\": 0,\n      \"relevance_score\": -6.157620906829834\n    }\n  ],\n  \"usage\": {\n    \"prompt_tokens\": 69,\n    \"total_tokens\": 69\n  }\n}\n
"},{"location":"user-guide/user-management/","title":"User Management","text":"

GPUStack supports users of two roles: Admin and User. Admins can monitor system status, manage models, users, and system settings. Users can manage their own API keys and use the completion API.

"},{"location":"user-guide/user-management/#default-admin","title":"Default Admin","text":"

On bootstrap, GPUStack creates a default admin user. The initial password for the default admin is stored in <data-dir>/initial_admin_password. In the default setup, it should be /var/lib/gpustack/initial_admin_password. You can customize the default admin password by setting the --bootstrap-password parameter when starting gpustack.

"},{"location":"user-guide/user-management/#create-user","title":"Create User","text":"
  1. Navigate to the Users page.
  2. Click the Create User button.
  3. Fill in Name, Full Name, Password, and select Role for the user.
  4. Click the Save button.
"},{"location":"user-guide/user-management/#update-user","title":"Update User","text":"
  1. Navigate to the Users page.
  2. Find the user you want to edit.
  3. Click the Edit button in the Operations column.
  4. Update the attributes as needed.
  5. Click the Save button.
"},{"location":"user-guide/user-management/#delete-user","title":"Delete User","text":"
  1. Navigate to the Users page.
  2. Find the user you want to delete.
  3. Click the ellipsis button in the Operations column, then select Delete.
  4. Confirm the deletion.
"},{"location":"user-guide/playground/","title":"Playground","text":"

GPUStack offers a playground UI where users can test and experiment with the APIs. Refer to each subpage for detailed instructions and information.

"},{"location":"user-guide/playground/audio/","title":"Audio Playground","text":"

The Audio Playground is a dedicated space for testing and experimenting with GPUStack\u2019s text-to-speech (TTS) and speech-to-text (STT) APIs. It allows users to interactively convert text to audio and audio to text, customize parameters, and review code examples for seamless API integration.

"},{"location":"user-guide/playground/audio/#text-to-speech","title":"Text to Speech","text":"

Switch to the \"Text to Speech\" tab to test TTS models.

"},{"location":"user-guide/playground/audio/#text-input","title":"Text Input","text":"

Enter the text you want to convert, then click the Submit button to generate the corresponding speech.

"},{"location":"user-guide/playground/audio/#clear-text","title":"Clear Text","text":"

Click the Clear button to reset the text input and remove the generated speech.

"},{"location":"user-guide/playground/audio/#select-model","title":"Select Model","text":"

Select an available TTS model in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.

"},{"location":"user-guide/playground/audio/#customize-parameters","title":"Customize Parameters","text":"

Customize the voice and format of the audio output.

Tip

Supported voices may vary between models.

"},{"location":"user-guide/playground/audio/#view-code","title":"View Code","text":"

After experimenting with input text and parameters, click the View Code button to see how to call the API with the same input. Code examples are provided in curl, Python, and Node.js.

"},{"location":"user-guide/playground/audio/#speech-to-text","title":"Speech to Text","text":"

Switch to the \"Speech to Text\" tab to test STT models.

"},{"location":"user-guide/playground/audio/#provide-audio-file","title":"Provide Audio File","text":"

You can provide audio for transcription in two ways:

  1. Upload an audio file.
  2. Record audio online.

Note

If the online recording is not available, it could be due to one of the following reasons:

  1. For HTTPS or http://localhost access, microphone permissions must be enabled in your browser.
  2. For access via http://{host IP}, the URL must be added to your browser's trusted list.

    Example: In Chrome, navigate to chrome://flags/, add the GPUStack URL to \"Insecure origins treated as secure,\" and enable this option.

"},{"location":"user-guide/playground/audio/#select-model_1","title":"Select Model","text":"

Select an available STT model in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.

"},{"location":"user-guide/playground/audio/#copy-text","title":"Copy Text","text":"

Copy the transcription results generated by the model.

"},{"location":"user-guide/playground/audio/#customize-parameters_1","title":"Customize Parameters","text":"

Select the appropriate language for your audio file to optimize transcription accuracy.

"},{"location":"user-guide/playground/audio/#view-code_1","title":"View Code","text":"

After experimenting with audio files and parameters, click the View Code button to see how to call the API with the same input. Code examples are provided in curl, Python, and Node.js.

"},{"location":"user-guide/playground/chat/","title":"Chat Playground","text":"

Interact with the chat completions API. The following is an example screenshot:

"},{"location":"user-guide/playground/chat/#prompts","title":"Prompts","text":"

You can adjust the prompt messages on the left side of the playground. There are three role types of prompt messages: system, user, and assistant.

"},{"location":"user-guide/playground/chat/#edit-system-message","title":"Edit System Message","text":"

You can add and edit the system message at the top of the playground.

"},{"location":"user-guide/playground/chat/#edit-user-and-assistant-messages","title":"Edit User and Assistant Messages","text":"

To add a user or assistant message, click the New Message button.

To remove a user or assistant message, click the minus button at the right corner of the message.

To change the role of a message, click the User or Assistant text at the beginning of the message.

"},{"location":"user-guide/playground/chat/#upload-image","title":"Upload Image","text":"

You can add images to the prompt by clicking the Upload Image button.

"},{"location":"user-guide/playground/chat/#clear-prompts","title":"Clear Prompts","text":"

Click the Clear button to clear all the prompts.

"},{"location":"user-guide/playground/chat/#select-model","title":"Select Model","text":"

You can select available models in GPUStack by clicking the model dropdown at the top-right corner of the playground. Please refer to Model Management to learn about how to manage models.

"},{"location":"user-guide/playground/chat/#customize-parameters","title":"Customize Parameters","text":"

You can customize completion parameters in the Parameters section.

"},{"location":"user-guide/playground/chat/#do-completion","title":"Do Completion","text":"

You can do a completion by clicking the Submit button.

"},{"location":"user-guide/playground/chat/#view-code","title":"View Code","text":"

Once you've done experimenting with the prompts and parameters, you can click the View Code button to check how you can call the API with the same input by code. Code examples in curl, Python, and Node.js are provided.

"},{"location":"user-guide/playground/chat/#compare-playground","title":"Compare Playground","text":"

You can compare multiple models in the playground. The following is an example screenshot:

"},{"location":"user-guide/playground/chat/#comparision-mode","title":"Comparision Mode","text":"

You can choose the number of models to compare by clicking the comparison view buttons, including 2, 3, 4 and 6-model comparison.

"},{"location":"user-guide/playground/chat/#prompts_1","title":"Prompts","text":"

You can adjust the prompt messages similar to the chat playground.

"},{"location":"user-guide/playground/chat/#upload-image_1","title":"Upload Image","text":"

You can add images to the prompt by clicking the Upload Image button.

"},{"location":"user-guide/playground/chat/#clear-prompts_1","title":"Clear Prompts","text":"

Click the Clear button to clear all the prompts.

"},{"location":"user-guide/playground/chat/#select-model_1","title":"Select Model","text":"

You can select available models in GPUStack by clicking the model dropdown at the top-left corner of each model panel.

"},{"location":"user-guide/playground/chat/#customize-parameters_1","title":"Customize Parameters","text":"

You can customize completion parameters by clicking the settings button of each model.

"},{"location":"user-guide/playground/embedding/","title":"Embedding Playground","text":"

The Embedding Playground lets you test the model\u2019s ability to convert text into embeddings. It allows you to experiment with multiple text inputs, visualize embeddings, and review code examples for API integration.

"},{"location":"user-guide/playground/embedding/#add-text","title":"Add Text","text":"

Add at least two text entries and click the Submit button to generate embeddings.

"},{"location":"user-guide/playground/embedding/#batch-input-text","title":"Batch Input Text","text":"

Enable Batch Input Mode to automatically split multi-line text into separate entries based on line breaks. This is useful for processing multiple text snippets in a single operation.

"},{"location":"user-guide/playground/embedding/#visualization","title":"Visualization","text":"

Visualize the embedding results using PCA (Principal Component Analysis) to reduce dimensions and display them on a 2D plot. Results can be viewed in two formats:

  1. Chart - Display PCA results visually.
  2. JSON - View raw embeddings in JSON format.

In the chart, the distance between points represents the similarity between corresponding texts. Closer points indicate higher similarity.

"},{"location":"user-guide/playground/embedding/#clear","title":"Clear","text":"

Click the Clear button to reset text entries and clear the output.

"},{"location":"user-guide/playground/embedding/#select-model","title":"Select Model","text":"

You can select available models in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.

"},{"location":"user-guide/playground/embedding/#view-code","title":"View Code","text":"

After experimenting with the text inputs, click the View Code button to see how you can call the API with the same input. Code examples are provided in curl, Python, and Node.js.

"},{"location":"user-guide/playground/image/","title":"Image Playground","text":"

The Image Playground is a dedicated space for testing and experimenting with GPUStack\u2019s image generation APIs. It allows users to interactively explore the capabilities of different models, customize parameters, and review code examples for seamless API integration.

"},{"location":"user-guide/playground/image/#prompt","title":"Prompt","text":"

You can input or randomly generate a prompt, then click the Submit button to generate an image.

"},{"location":"user-guide/playground/image/#clear-prompt","title":"Clear Prompt","text":"

Click the Clear button to reset the prompt and remove the generated image.

"},{"location":"user-guide/playground/image/#select-model","title":"Select Model","text":"

You can select available models in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.

"},{"location":"user-guide/playground/image/#customize-parameters","title":"Customize Parameters","text":"

You can customize the image generation parameters by switching between two API styles:

  1. OpenAI-compatible mode.
  2. Advanced mode.

"},{"location":"user-guide/playground/image/#advanced-parameters","title":"Advanced Parameters","text":"Parameter Default Description Counts 1 Number of images to generate. Size 512x512 The size of the generated image in 'widthxheight' format. Sampler euler_a The sampler algorithm for image generation. Options include 'euler_a', 'euler', 'heun', 'dpm2', 'dpm++2s_a', 'dpm++2m', 'dpm++2mv2', 'ipndm', 'ipndm_v', and 'lcm'. Schedule discrete The noise scheduling method. Sampler Steps 10 The number of sampling steps to perform. Higher values may improve image quality at the cost of longer processing time. CFG Scale 4.5 The scale for classifier-free guidance. A higher value increases adherence to the prompt. Negative Prompt (empty) A negative prompt to specify what the image should avoid. Seed (empty) Random seed.

Note

The maximum image size is restricted by the model's deployment settings. See the diagram below:

"},{"location":"user-guide/playground/image/#view-code","title":"View Code","text":"

After experimenting with prompts and parameters, click the View Code button to see how to call the API with the same inputs. Code examples are provided in curl, Python, and Node.js.

"},{"location":"user-guide/playground/rerank/","title":"Rerank Playground","text":"

The Rerank Playground allows you to test reranker models that reorder multiple texts based on their relevance to a query. Experiment with various input texts, customize parameters, and review code examples for API integration.

"},{"location":"user-guide/playground/rerank/#add-text","title":"Add Text","text":"

Add multiple text entries to the document for reranking.

"},{"location":"user-guide/playground/rerank/#bach-input-text","title":"Bach Input Text","text":"

Enable Batch Input Mode to split multi-line text into separate entries based on line breaks. This is useful for processing multiple text snippets efficiently.

"},{"location":"user-guide/playground/rerank/#clear","title":"Clear","text":"

Click the Clear button to reset the document and query results.

"},{"location":"user-guide/playground/rerank/#query","title":"Query","text":"

Input a query and click the Submit button to get a ranked list of texts based on their relevance to the query.

"},{"location":"user-guide/playground/rerank/#select-model","title":"Select Model","text":"

Select an available reranker model in GPUStack by clicking the model dropdown at the top-right corner of the playground UI.

"},{"location":"user-guide/playground/rerank/#customize-parameters","title":"Customize Parameters","text":"

In the parameter section, set Top N to specify the number of matching texts to retrieve.

"},{"location":"user-guide/playground/rerank/#view-code","title":"View Code","text":"

After experimenting with the input text and query, click the View Code button to see how to call the API with the same input. Code examples are provided in curl, Python, and Node.js.

"}]} \ No newline at end of file diff --git a/0.4/sitemap.xml b/0.4/sitemap.xml index c5feca6..f37dec3 100644 --- a/0.4/sitemap.xml +++ b/0.4/sitemap.xml @@ -2,232 +2,232 @@ https://docs.gpustack.ai/0.4/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/api-reference/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/architecture/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/code-of-conduct/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/contributing/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/development/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/overview/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/quickstart/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/scheduler/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/troubleshooting/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/upgrade/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/cli-reference/chat/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/cli-reference/download-tools/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/cli-reference/draw/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/cli-reference/start/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/installation/air-gapped-installation/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/installation/docker-installation/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/installation/installation-script/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/installation/manual-installation/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/installation/uninstallation/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/tutorials/creating-text-embeddings/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/tutorials/inference-on-cpus/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/tutorials/inference-with-function-calling/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/tutorials/performing-distributed-inference-across-workers/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/tutorials/running-inference-with-ascend-npus/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/tutorials/running-inference-with-moorethreads-gpus/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/tutorials/running-on-copilot-plus-pcs-with-snapdragon-x/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/tutorials/setting-up-a-multi-node-gpustack-cluster/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/tutorials/using-audio-models/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/tutorials/using-image-generation-models/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/tutorials/using-reranker-models/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/tutorials/using-vision-language-models/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/user-guide/api-key-management/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/user-guide/image-generation-apis/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/user-guide/inference-backends/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/user-guide/model-management/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/user-guide/openai-compatible-apis/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/user-guide/pinned-backend-versions/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/user-guide/rerank-api/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/user-guide/user-management/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/user-guide/playground/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/user-guide/playground/audio/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/user-guide/playground/chat/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/user-guide/playground/embedding/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/user-guide/playground/image/ - 2024-12-05 + 2024-12-06 daily https://docs.gpustack.ai/0.4/user-guide/playground/rerank/ - 2024-12-05 + 2024-12-06 daily \ No newline at end of file diff --git a/0.4/sitemap.xml.gz b/0.4/sitemap.xml.gz index 06536584e3b16dd85354c8fa407a66fd99fc97b9..e97a99d1c58c2a43e17b5bd19b02d70459844cb7 100644 GIT binary patch delta 30 mcmX@WdVrN(zMF%C;auQEc6r9l8`UQ=arC7H_M|E>FaQ9Gs0g0` delta 30 lcmX@WdVrN(zMF%ip(bD=yF8=yM)iqI9CnHoJ*f%|3;=|r2j&0(