Skip to content

Commit

Permalink
Arm AArch64: Documentation updates (ggerganov#9321)
Browse files Browse the repository at this point in the history
* Arm AArch64: Documentation updates

* Update docs/build.md to include information on how to enable the Arm optimized gemm/gemv kernels

* Update examples/quantize/README.md with information on the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats

* Add newline to the end of docs/build.md
  • Loading branch information
eddnjjn authored and arthw committed Nov 15, 2024
1 parent 2835fa8 commit 7eb355f
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 0 deletions.
6 changes: 6 additions & 0 deletions docs/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -380,3 +380,9 @@ For detailed info, such as model/device supports, CANN install, please refer to
### Android
To read documentation for how to build on Android, [click here](./android.md)
### Arm CPU optimized mulmat kernels
Llama.cpp includes a set of optimized mulmat kernels for the Arm architecture, leveraging Arm® Neon™, int8mm and SVE instructions. These kernels are enabled at build time through the appropriate compiler cpu-type flags, such as `-DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+sve`. Note that these optimized kernels require the model to be quantized into one of the formats: `Q4_0_4_4` (Arm Neon), `Q4_0_4_8` (int8mm) or `Q4_0_8_8` (SVE). The SVE mulmat kernel specifically requires a vector width of 256 bits. When running on devices with a different vector width, it is recommended to use the `Q4_0_4_8` (int8mm) or `Q4_0_4_4` (Arm Neon) formats for better performance. Refer to [examples/quantize/README.md](../examples/quantize/README.md) for more information on the quantization formats.
To support `Q4_0_4_4`, you must build with `GGML_NO_LLAMAFILE=1` (`make`) or `-DGGML_LLAMAFILE=OFF` (`cmake`).
2 changes: 2 additions & 0 deletions examples/quantize/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ As the models are currently fully loaded into memory, you will need adequate dis

Several quantization methods are supported. They differ in the resulting model disk size and inference speed.

The quantization formats `Q4_0_4_4`, `Q4_0_4_8` and `Q4_0_8_8` are block interleaved variants of the `Q4_0` format, providing a data layout that is better suited for specific implementations of optimized mulmat kernels. Since these formats differ only in data layout, they have the same quantized size as the `Q4_0` format.

*(outdated)*

| Model | Measure | F16 | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 |
Expand Down

0 comments on commit 7eb355f

Please sign in to comment.