Name	Name	Last commit message	Last commit date
Latest commit thxCode docs: readme Aug 21, 2024 3bf700f · Aug 21, 2024 History 82 Commits
.github/workflows	.github/workflows	ci: adjust	Aug 11, 2024
cmd/gguf-parser	cmd/gguf-parser	fix: estimate	Aug 21, 2024
util	util	fix: failed to call cli	Aug 8, 2024
.gitattributes	.gitattributes	feat: first commit	May 29, 2024
.gitignore	.gitignore	feat: first commit	May 29, 2024
.golangci.yaml	.golangci.yaml	feat: first commit	May 29, 2024
Dockerfile	Dockerfile	fix: perm on bin	Aug 1, 2024
LICENSE	LICENSE	feat: first commit	May 29, 2024
Makefile	Makefile	refactor: estimate result	Aug 20, 2024
README.md	README.md	docs: readme	Aug 21, 2024
cache.go	cache.go	chore: accpet by gpustack	Jul 25, 2024
file.go	file.go	feat: support split	Aug 21, 2024
file_architecture.go	file_architecture.go	feat: detect embedding	Jul 30, 2024
file_architecture_test.go	file_architecture_test.go	refactor: simplify estimate	Jun 7, 2024
file_estimate.go	file_estimate.go	fix: estimate	Aug 21, 2024
file_estimate_option.go	file_estimate_option.go	refactor: estimate result	Aug 20, 2024
file_estimate_test.go	file_estimate_test.go	refactor: estimate	Jun 12, 2024
file_from_distro.go	file_from_distro.go	fix: failed to call cli	Aug 8, 2024
file_from_remote.go	file_from_remote.go	feat: support split	Aug 21, 2024
file_model.go	file_model.go	feat: support new quantization	Jul 11, 2024
file_model_test.go	file_model_test.go	refactor: simplify estimate	Jun 7, 2024
file_option.go	file_option.go	feat: support split	Aug 21, 2024
file_test.go	file_test.go	test: adjust	Aug 8, 2024
file_tokenizer.go	file_tokenizer.go	feat: display eom/eot in tokenizer	Aug 5, 2024
file_tokenizer_test.go	file_tokenizer_test.go	refactor: simplify estimate	Jun 7, 2024
filename.go	filename.go	feat: support split	Aug 21, 2024
filename_test.go	filename_test.go	feat: support split	Aug 21, 2024
gen.go	gen.go	feat: first commit	May 29, 2024
gen.stringer.go	gen.stringer.go	feat: first commit	May 29, 2024
ggml.go	ggml.go	refactor: padding context size	Jul 11, 2024
go.mod	go.mod	fix: failed to parse ollama model	Aug 7, 2024
go.sum	go.sum	fix: failed to parse ollama model	Aug 7, 2024
ollama_model.go	ollama_model.go	fix: failed to parse ollama model	Aug 7, 2024
ollama_model_test.go	ollama_model_test.go	feat: parse ollama model	Jul 4, 2024
ollama_registry_authenticate.go	ollama_registry_authenticate.go	fix: failed to call cli	Aug 8, 2024
zz_generated.ggmltype.stringer.go	zz_generated.ggmltype.stringer.go	feat: support new quantization	Jul 11, 2024
zz_generated.gguffiletype.stringer.go	zz_generated.gguffiletype.stringer.go	feat: support new quantization	Jul 11, 2024
zz_generated.ggufmagic.stringer.go	zz_generated.ggufmagic.stringer.go	feat: first commit	May 29, 2024
zz_generated.ggufmetadatavaluetype.stringer.go	zz_generated.ggufmetadatavaluetype.stringer.go	feat: first commit	May 29, 2024
zz_generated.ggufversion.stringer.go	zz_generated.ggufversion.stringer.go	feat: first commit	May 29, 2024

GGUF Parser

tl;dr, Review/Check GGUF files and estimate the memory usage.

GGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. Models are traditionally developed using PyTorch or another framework, and then converted to GGUF for use in GGML.

GGUF Parser helps in reviewing and estimating the usage of a GGUF format model without download it.

Key Features

No File Required: GGUF Parser uses chunking reading to parse the metadata of remote GGUF file, which means you don't need to download the entire file and load it.
Accurate Prediction: The evaluation results of GGUF Parser usually deviate from the actual usage by about 100MiB.
Fast: GGUF Parser is written in Go, which is fast and efficient.

Notes

Since v0.7.2, GGUF Parser supports retrieving the model's metadata via split file, which suffixes with something like -00001-of-00009.gguf.
The table result UMA indicates the memory usage of Apple MacOS only.
Since v0.7.0, GGUF Parser is going to support estimating the usage of multiple GPUs.
- The table result RAM means the system memory usage when running LLaMA.Cpp or LLaMA.Cpp like application.
- The table result VRAM 0 means the first visible GPU memory usage when serving the model.
- For example, --tensor-split=1,1,1 means the model will be split into 3 parts with 33% each, and results in VRAM 0, VRAM 1 and VRAM 2 cells.

Installation

Install from releases or go install github.com/gpustack/gguf-parser-go/cmd/gguf-parser@latest.

Overview

Parse

Parse Local File

$ gguf-parser --path="~/.cache/lm-studio/models/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/Hermes-2-Pro-Mistral-7B.Q5_K_M.gguf"
+-----------------------------------------------------------------------------------+
| MODEL                                                                             |
+-------+-------+----------------+---------------+----------+------------+----------+
|  NAME |  ARCH |  QUANTIZATION  | LITTLE ENDIAN |   SIZE   | PARAMETERS |    BPW   |
+-------+-------+----------------+---------------+----------+------------+----------+
| jeffq | llama | IQ3_XXS/Q5_K_M |      true     | 4.78 GiB |   7.24 B   | 5.67 bpw |
+-------+-------+----------------+---------------+----------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      32768      |      4096     |       4       |       true       |         32         |   32   |       14336      |      0     |      32032     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| llama |  450.50 KiB |    32032   |        N/A       |     1     |   32000   |    N/A    |    N/A    |      N/A      |       N/A       |      N/A      |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                  |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-------------------------+--------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |           RAM           |       VRAM 0       |
|       |              |                    |                 |           |                |                |                +------------+------------+--------+-----------+
|       |              |                    |                 |           |                |                |                |     UMA    |   NONUMA   |   UMA  |   NONUMA  |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+-----------+
| llama |     32768    |     2048 / 512     |     Disabled    | Supported |       No       |   33 (32 + 1)  |       Yes      | 176.25 MiB | 326.25 MiB |  4 GiB | 11.16 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+-----------+

$ # Retrieve the model's metadata via split file,
$ # which needs all split files has been downloaded.
$ gguf-parser --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf"

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| MODEL                                                                                                                                                                                                                                                       |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+--------------+---------------+-----------+------------+----------+
|                                                                                       NAME                                                                                       |  ARCH | QUANTIZATION | LITTLE ENDIAN |    SIZE   | PARAMETERS |    BPW   |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+--------------+---------------+-----------+------------+----------+
| 72b.5000B--cmix31-base100w-cpt32k_mega_v1_reflection_4_identity_2_if_ondare_beta0.09_lr_1e-6_bs128_epoch2-72B.qwen2B-bf16-mp8-pp4-lr-1e-6-minlr-1e-9-bs-128-seqlen-4096-step1350 | qwen2 |  IQ1_S/Q6_K  |      true     | 59.92 GiB |   72.71 B  | 7.08 bpw |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+--------------+---------------+-----------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      32768      |      8192     |       8       |       true       |         64         |   80   |       29568      |      0     |     152064     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
|  gpt2 |   2.47 MiB  |   152064   |        N/A       |   151643  |   151645  |    N/A    |    N/A    |      N/A      |       N/A       |     151643    |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                  |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-------------------------+--------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |           RAM           |       VRAM 0       |
|       |              |                    |                 |           |                |                |                +------------+------------+--------+-----------+
|       |              |                    |                 |           |                |                |                |     UMA    |   NONUMA   |   UMA  |   NONUMA  |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+-----------+
| qwen2 |     32768    |     2048 / 512     |     Disabled    | Supported |       No       |   81 (80 + 1)  |       Yes      | 307.38 MiB | 457.38 MiB | 10 GiB | 73.47 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+-----------+

Parse Remote File

$ gguf-parser --url="https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF/resolve/main/Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf"
+----------------------------------------------------------------------------------+
| MODEL                                                                            |
+----------+-------+--------------+---------------+--------+------------+----------+
|   NAME   |  ARCH | QUANTIZATION | LITTLE ENDIAN |  SIZE  | PARAMETERS |    BPW   |
+----------+-------+--------------+---------------+--------+------------+----------+
| emozilla | llama |  Q4_K/Q3_K_M |      true     | 21 GiB |   46.70 B  | 3.86 bpw |
+----------+-------+--------------+---------------+--------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      32768      |      4096     |       4       |       true       |         32         |   32   |       14336      |      8     |      32002     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| llama |  449.91 KiB |    32002   |        N/A       |     1     |   32000   |    N/A    |    N/A    |       0       |       N/A       |       2       |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                         |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-------------------------+-----------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION |   MMAP LOAD   | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |           RAM           |         VRAM 0        |
|       |              |                    |                 |               |                |                |                +------------+------------+-----------+-----------+
|       |              |                    |                 |               |                |                |                |     UMA    |   NONUMA   |    UMA    |   NONUMA  |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
| llama |     32768    |     2048 / 512     |     Disabled    | Not Supported |       No       |   33 (32 + 1)  |       Yes      | 174.54 MiB | 324.54 MiB | 24.94 GiB | 27.41 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+

$ # Retrieve the model's metadata via split file

$ gguf-parser --url="https://huggingface.co/MaziyarPanahi/Meta-Llama-3.1-405B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-405B-Instruct.Q2_K.gguf-00001-of-00009.gguf"
+----------------------------------------------------------------------------------------------------------------------------+
| MODEL                                                                                                                      |
+------------------------------------------------+-------+--------------+---------------+------------+------------+----------+
|                      NAME                      |  ARCH | QUANTIZATION | LITTLE ENDIAN |    SIZE    | PARAMETERS |    BPW   |
+------------------------------------------------+-------+--------------+---------------+------------+------------+----------+
| Models Meta Llama Meta Llama 3.1 405B Instruct | llama |     Q2_K     |      true     | 140.81 GiB |  410.08 B  | 2.95 bpw |
+------------------------------------------------+-------+--------------+---------------+------------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      131072     |     16384     |       8       |       true       |         128        |   126  |       53248      |      0     |     128256     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
|  gpt2 |    2 MiB    |   128256   |        N/A       |   128000  |   128009  |    N/A    |    N/A    |      N/A      |       N/A       |      N/A      |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                    |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-------------------------+----------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |           RAM           |        VRAM 0        |
|       |              |                    |                 |           |                |                |                +------------+------------+---------+------------+
|       |              |                    |                 |           |                |                |                |     UMA    |   NONUMA   |   UMA   |   NONUMA   |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+---------+------------+
| llama |    131072    |     2048 / 512     |     Disabled    | Supported |       No       |  127 (126 + 1) |       Yes      | 684.53 MiB | 834.53 MiB | 126 GiB | 299.79 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+---------+------------+

Parse From HuggingFace

$ gguf-parser --hf-repo="openbmb/MiniCPM-Llama3-V-2_5-gguf" --hf-file="ggml-model-Q5_K_M.gguf" --hf-mmproj-file="mmproj-model-f16.gguf"
+-----------------------------------------------------------------------------------+
| MODEL                                                                             |
+-------+-------+----------------+---------------+----------+------------+----------+
|  NAME |  ARCH |  QUANTIZATION  | LITTLE ENDIAN |   SIZE   | PARAMETERS |    BPW   |
+-------+-------+----------------+---------------+----------+------------+----------+
| model | llama | IQ3_XXS/Q5_K_M |      true     | 5.33 GiB |   8.03 B   | 5.70 bpw |
+-------+-------+----------------+---------------+----------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|       8192      |      4096     |       4       |       true       |         32         |   32   |       14336      |      0     |     128256     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
|  gpt2 |    2 MiB    |   128256   |        N/A       |   128000  |   128001  |    N/A    |    N/A    |     128002    |       N/A       |       0       |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                 |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-------------------------+-------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |           RAM           |       VRAM 0      |
|       |              |                    |                 |           |                |                |                +------------+------------+--------+----------+
|       |              |                    |                 |           |                |                |                |     UMA    |   NONUMA   |   UMA  |  NONUMA  |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+----------+
| llama |     8192     |     2048 / 512     |     Disabled    | Supported |       No       |   33 (32 + 1)  |       Yes      | 184.85 MiB | 334.85 MiB |  1 GiB | 7.78 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+----------+

$ # Retrieve the model's metadata via split file

$ gguf-parser --hf-repo="etemiz/Llama-3.1-405B-Inst-GGUF" --hf-file="llama-3.1-405b-IQ1_M-00019-of-00019.gguf"
+---------------------------------------------------------------------------------------------------------+
| MODEL                                                                                                   |
+------------------------------+-------+--------------+---------------+-----------+------------+----------+
|             NAME             |  ARCH | QUANTIZATION | LITTLE ENDIAN |    SIZE   | PARAMETERS |    BPW   |
+------------------------------+-------+--------------+---------------+-----------+------------+----------+
| Meta-Llama-3.1-405B-Instruct | llama |     IQ1_M    |      true     | 88.61 GiB |  410.08 B  | 1.86 bpw |
+------------------------------+-------+--------------+---------------+-----------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      131072     |     16384     |       8       |       true       |         128        |   126  |       53248      |      0     |     128256     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
|  gpt2 |    2 MiB    |   128256   |        N/A       |   128000  |   128009  |    N/A    |    N/A    |      N/A      |       N/A       |      N/A      |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                    |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-------------------------+----------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |           RAM           |        VRAM 0        |
|       |              |                    |                 |           |                |                |                +------------+------------+---------+------------+
|       |              |                    |                 |           |                |                |                |     UMA    |   NONUMA   |   UMA   |   NONUMA   |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+---------+------------+
| llama |    131072    |     2048 / 512     |     Disabled    | Supported |       No       |  127 (126 + 1) |       Yes      | 684.53 MiB | 834.53 MiB | 126 GiB | 247.59 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+---------+------------+

Parse From ModelScope

$ gguf-parser --ms-repo="shaowenchen/chinese-alpaca-2-13b-16k-gguf" --ms-file="chinese-alpaca-2-13b-16k.Q5_K.gguf"
+----------------------------------------------------------------------------------+
| MODEL                                                                            |
+------+-------+----------------+---------------+----------+------------+----------+
| NAME |  ARCH |  QUANTIZATION  | LITTLE ENDIAN |   SIZE   | PARAMETERS |    BPW   |
+------+-------+----------------+---------------+----------+------------+----------+
|  ..  | llama | IQ3_XXS/Q5_K_M |      true     | 8.76 GiB |   13.25 B  | 5.68 bpw |
+------+-------+----------------+---------------+----------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      16384      |      5120     |       1       |       true       |         N/A        |   40   |       13824      |      0     |      55296     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| llama |  769.83 KiB |    55296   |        N/A       |     1     |     2     |    N/A    |    N/A    |      N/A      |       N/A       |      N/A      |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                    |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------------------+-----------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |           RAM          |         VRAM 0        |
|       |              |                    |                 |           |                |                |                +-----------+------------+-----------+-----------+
|       |              |                    |                 |           |                |                |                |    UMA    |   NONUMA   |    UMA    |   NONUMA  |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-----------+------------+-----------+-----------+
| llama |     16384    |     2048 / 512     |     Disabled    | Supported |       No       |   41 (40 + 1)  |       Yes      | 60.95 MiB | 210.95 MiB | 12.50 GiB | 22.74 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-----------+------------+-----------+-----------+

Parse From Ollama Library

$ gguf-parser --ol-model="llama3.1"
+------------------------------------------------------------------------------------------------------+
| MODEL                                                                                                |
+----------------------------+-------+--------------+---------------+----------+------------+----------+
|            NAME            |  ARCH | QUANTIZATION | LITTLE ENDIAN |   SIZE   | PARAMETERS |    BPW   |
+----------------------------+-------+--------------+---------------+----------+------------+----------+
| Meta Llama 3.1 8B Instruct | llama |     Q4_0     |      true     | 4.33 GiB |   8.03 B   | 4.64 bpw |
+----------------------------+-------+--------------+---------------+----------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      131072     |      4096     |       4       |       true       |         32         |   32   |       14336      |      0     |     128256     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
|  gpt2 |    2 MiB    |   128256   |        N/A       |   128000  |   128009  |    N/A    |    N/A    |      N/A      |       N/A       |      N/A      |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                  |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-------------------------+--------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |           RAM           |       VRAM 0       |
|       |              |                    |                 |           |                |                |                +------------+------------+--------+-----------+
|       |              |                    |                 |           |                |                |                |     UMA    |   NONUMA   |   UMA  |   NONUMA  |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+-----------+
| llama |    131072    |     2048 / 512     |     Disabled    | Supported |       No       |   33 (32 + 1)  |       Yes      | 411.62 MiB | 561.62 MiB | 16 GiB | 29.08 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+-----------+

$ # Ollama Model includes the preset params and other artifacts, like multimodal projectors or LoRA adapters, 
$ # you can get the usage of Ollama running by using `--ol-usage` option.

$ gguf-parser --ol-model="llama3.1" --ol-usage
+------------------------------------------------------------------------------------------------------+
| MODEL                                                                                                |
+----------------------------+-------+--------------+---------------+----------+------------+----------+
|            NAME            |  ARCH | QUANTIZATION | LITTLE ENDIAN |   SIZE   | PARAMETERS |    BPW   |
+----------------------------+-------+--------------+---------------+----------+------------+----------+
| Meta Llama 3.1 8B Instruct | llama |     Q4_0     |      true     | 4.33 GiB |   8.03 B   | 4.64 bpw |
+----------------------------+-------+--------------+---------------+----------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      131072     |      4096     |       4       |       true       |         32         |   32   |       14336      |      0     |     128256     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
|  gpt2 |    2 MiB    |   128256   |        N/A       |   128000  |   128009  |    N/A    |    N/A    |      N/A      |       N/A       |      N/A      |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                     |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-------------------------+-----------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |           RAM           |         VRAM 0        |
|       |              |                    |                 |           |                |                |                +------------+------------+------------+----------+
|       |              |                    |                 |           |                |                |                |     UMA    |   NONUMA   |     UMA    |  NONUMA  |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+------------+----------+
| llama |     2048     |     2048 / 512     |     Disabled    | Supported |       No       |   33 (32 + 1)  |       Yes      | 159.62 MiB | 309.62 MiB | 256.50 MiB | 4.82 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+------------+----------+

Estimate

Full Layers Offload (default)

$ gguf-parser --hf-repo="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" --hf-file="Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf" --skip-model --skip-architecture --skip-tokenizer
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                         |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-------------------------+-----------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION |   MMAP LOAD   | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |           RAM           |         VRAM 0        |
|       |              |                    |                 |               |                |                |                +------------+------------+-----------+-----------+
|       |              |                    |                 |               |                |                |                |     UMA    |   NONUMA   |    UMA    |   NONUMA  |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
| llama |     32768    |     2048 / 512     |     Disabled    | Not Supported |       No       |   33 (32 + 1)  |       Yes      | 174.54 MiB | 324.54 MiB | 24.94 GiB | 27.41 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+

Zero Layers Offload

$ gguf-parser --hf-repo="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" --hf-file="Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf" --skip-model --skip-architecture --skip-tokenizer --gpu-layers=0
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                   |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------------------+-------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION |   MMAP LOAD   | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |          RAM          |       VRAM 0      |
|       |              |                    |                 |               |                |                |                +-----------+-----------+--------+----------+
|       |              |                    |                 |               |                |                |                |    UMA    |   NONUMA  |   UMA  |  NONUMA  |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------+-----------+--------+----------+
| llama |     32768    |     2048 / 512     |     Disabled    | Not Supported |       No       |        0       |       No       | 25.09 GiB | 25.24 GiB |   0 B  | 2.39 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------+-----------+--------+----------+

Specific Layers Offload

$ gguf-parser --hf-repo="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" --hf-file="Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf" --skip-model --skip-architecture --skip-tokenizer --gpu-layers=10
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                      |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------------------+----------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION |   MMAP LOAD   | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |          RAM          |        VRAM 0        |
|       |              |                    |                 |               |                |                |                +-----------+-----------+----------+-----------+
|       |              |                    |                 |               |                |                |                |    UMA    |   NONUMA  |    UMA   |   NONUMA  |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------+-----------+----------+-----------+
| llama |     32768    |     2048 / 512     |     Disabled    | Not Supported |       No       |       10       |       No       | 17.38 GiB | 17.52 GiB | 7.73 GiB | 10.19 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------+-----------+----------+-----------+

Specific Context Size

By default, the context size retrieved from the model's metadata.

Use --ctx-size to specify the context size.

$ gguf-parser --hf-repo="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" --hf-file="Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf" --skip-model --skip-architecture --skip-tokenizer --ctx-size=4096
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                         |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-------------------------+-----------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION |   MMAP LOAD   | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |           RAM           |         VRAM 0        |
|       |              |                    |                 |               |                |                |                +------------+------------+-----------+-----------+
|       |              |                    |                 |               |                |                |                |     UMA    |   NONUMA   |    UMA    |   NONUMA  |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
| llama |     4096     |     2048 / 512     |     Disabled    | Not Supported |       No       |   33 (32 + 1)  |       Yes      | 118.54 MiB | 268.54 MiB | 21.44 GiB | 21.99 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+

Enable Flash Attention

By default, LLaMA.cpp disables the Flash Attention.

Enable Flash Attention will reduce the VRAM usage, but it also increases the GPU/CPU usage.

Use --flash-attention to enable the Flash Attention.

Please note that not all models support Flash Attention, if the model does not support, the "FLASH ATTENTION" shows " Disabled" even if you enable it.

$ gguf-parser --hf-repo="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" --hf-file="Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf" --skip-model --skip-architecture --skip-tokenizer --flash-attention
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                         |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-------------------------+-----------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION |   MMAP LOAD   | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |           RAM           |         VRAM 0        |
|       |              |                    |                 |               |                |                |                +------------+------------+-----------+-----------+
|       |              |                    |                 |               |                |                |                |     UMA    |   NONUMA   |    UMA    |   NONUMA  |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
| llama |     32768    |     2048 / 512     |     Enabled     | Not Supported |       No       |   33 (32 + 1)  |       Yes      | 158.54 MiB | 308.54 MiB | 24.94 GiB | 25.43 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+

Disable MMap

By default, LLaMA.cpp loads the model via Memory-Mapped.

For Apple MacOS, Memory-Mapped is an efficient way to load the model, and results in a lower VRAM usage. For other platforms, Memory-Mapped affects the first-time model loading speed only.

Use --no-mmap to disable loading the model via Memory-Mapped.

Please note that some models require loading the whole weight into memory, if the model does not support MMap, the "MMAP LOAD" shows "Not Supported".

$ gguf-parser --hf-repo="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" --hf-file="Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf" --skip-model --skip-architecture --skip-tokenizer --gpu-layers=10 --no-mmap
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                      |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------------------+----------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION |   MMAP LOAD   | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |          RAM          |        VRAM 0        |
|       |              |                    |                 |               |                |                |                +-----------+-----------+----------+-----------+
|       |              |                    |                 |               |                |                |                |    UMA    |   NONUMA  |    UMA   |   NONUMA  |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------+-----------+----------+-----------+
| llama |     32768    |     2048 / 512     |     Disabled    | Not Supported |       No       |       10       |       No       | 17.38 GiB | 17.52 GiB | 7.73 GiB | 10.19 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------+-----------+----------+-----------+

Get Proper Offload Layers

Use --gpu-layers-step to get the proper offload layers number when the model is too large to fit into the GPUs memory.

$ gguf-parser --hf-repo="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" --hf-file="Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf" --skip-model --skip-architecture --skip-tokenizer --gpu-layers-step=5
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                         |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-------------------------+-----------------------+
|  ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION |   MMAP LOAD   | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED |           RAM           |         VRAM 0        |
|       |              |                    |                 |               |                |                |                +------------+------------+-----------+-----------+
|       |              |                    |                 |               |                |                |                |     UMA    |   NONUMA   |    UMA    |   NONUMA  |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
| llama |     32768    |     2048 / 512     |     Disabled    | Not Supported |       No       |        0       |       No       |  25.09 GiB |  25.24 GiB |    0 B    |  2.39 GiB |
|       |              |                    |                 |               |                +----------------+                +------------+------------+-----------+-----------+
|       |              |                    |                 |               |                |        5       |                |  21.24 GiB |  21.39 GiB |  3.86 GiB |  6.33 GiB |
|       |              |                    |                 |               |                +----------------+                +------------+------------+-----------+-----------+
|       |              |                    |                 |               |                |       10       |                |  17.38 GiB |  17.52 GiB |  7.73 GiB | 10.19 GiB |
|       |              |                    |                 |               |                +----------------+                +------------+------------+-----------+-----------+
|       |              |                    |                 |               |                |       15       |                |  13.51 GiB |  13.66 GiB | 11.59 GiB | 14.06 GiB |
|       |              |                    |                 |               |                +----------------+                +------------+------------+-----------+-----------+
|       |              |                    |                 |               |                |       20       |                |  9.65 GiB  |  9.79 GiB  | 15.46 GiB | 17.92 GiB |
|       |              |                    |                 |               |                +----------------+                +------------+------------+-----------+-----------+
|       |              |                    |                 |               |                |       25       |                |  5.78 GiB  |  5.93 GiB  | 19.32 GiB | 21.79 GiB |
|       |              |                    |                 |               |                +----------------+                +------------+------------+-----------+-----------+
|       |              |                    |                 |               |                |       30       |                |  1.92 GiB  |  2.06 GiB  | 23.19 GiB | 25.65 GiB |
|       |              |                    |                 |               |                +----------------+----------------+------------+------------+-----------+-----------+
|       |              |                    |                 |               |                |   33 (32 + 1)  |       Yes      | 174.54 MiB | 324.54 MiB | 24.94 GiB | 27.41 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GGUF Parser

Key Features

Agenda

Notes

Installation

Overview

Parse

Parse Local File

Parse Remote File

Parse From HuggingFace

Parse From ModelScope

Parse From Ollama Library

Estimate

Full Layers Offload (default)

Zero Layers Offload

Specific Layers Offload

Specific Context Size

Enable Flash Attention

Disable MMap

Get Proper Offload Layers

License

About

Releases 70

Languages

License

gpustack/gguf-parser-go

Folders and files

Latest commit

History

Repository files navigation

GGUF Parser

Key Features

Agenda

Notes

Installation

Overview

Parse

Parse Local File

Parse Remote File

Parse From HuggingFace

Parse From ModelScope

Parse From Ollama Library

Estimate

Full Layers Offload (default)

Zero Layers Offload

Specific Layers Offload

Specific Context Size

Enable Flash Attention

Disable MMap

Get Proper Offload Layers

License

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 70

Languages