Skip to content

Latest commit

 

History

History
66 lines (49 loc) · 2.94 KB

DEVICE_MAPPING.md

File metadata and controls

66 lines (49 loc) · 2.94 KB

Device mapping

In mistral.rs, device mapping is automatically managed to be as performant and easy as possible. Automatic device mapping is enabled by default in the CLI/server and Python API and does not make any changes when the model fits entirely on the GPU.

Automatic device mapping works by prioritizing loading models into GPU memory, and any remaining parts are loaded into CPU memory. Models architectures such as vision models which greatly benefit from GPU acceleration also automatically prioritize keeping those components on the GPU.

To control the mapping across devices, you can set the following maximum parameters which the model should expect in a prompt.

  • maximum sequence length
  • maximum batch size
  • (vision models) maximum image length (length refers to the edge length)
  • (vision models) maximum number of images

These parameters do not translate to hard limits during runtime, they only control the mapping.

Note

The maximum sequence length is also used to ensure that a KV cache will fit for with and without PagedAttention.

Examples


If you want to manually device map the model (not recommended), please continue reading.

Note

Manual device mapping is deprecated in favor of automatic device mapping due to the possibility for user error in manual.

Manual device mapping

There are 2 ways to do device mapping:

  1. Specify the number of layers to put on the GPU - this uses the GPU with ordinal 0.
  2. Specify the ordinals and number of layers - this allows for cross-GPU device mapping.

The format for the ordinals and number of layers is ORD:NUM;... where ORD is the unique ordinal and NUM is the number of layers for that GPU. This may be repeated as many times as necessary.

Note: We refer to GPU layers as "device layers" throughout mistral.rs.

Example of specifying ordinals

cargo run --release --features cuda -- -n "0:16;1:16" -i plain -m gradientai/Llama-3-8B-Instruct-262k -a llama

Note: In the Python API, the "0:16;1:16" string is passed as the list ["0:16", "1:16"].

Example of specifying the number of GPU layers

cargo run --release --features cuda -- -n 16 -i plain -m gradientai/Llama-3-8B-Instruct-262k -a llama