tl;dr, Review/Check GGUF files and estimate the memory usage.
GGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. Models are traditionally developed using PyTorch or another framework, and then converted to GGUF for use in GGML.
GGUF Parser helps in reviewing and estimating the usage of a GGUF format model without download it.
- No File Required: GGUF Parser uses chunking reading to parse the metadata of remote GGUF file, which means you don't need to download the entire file and load it.
- Accurate Prediction: The evaluation results of GGUF Parser usually deviate from the actual usage by about 100MiB.
- Fast: GGUF Parser is written in Go, which is fast and efficient.
- Since v0.7.2, GGUF Parser supports retrieving the model's metadata via split file,
which suffixes with something like
-00001-of-00009.gguf
. - The table result
UMA
indicates the memory usage of Apple MacOS only. - Since v0.7.0, GGUF Parser is going to support estimating the usage of multiple GPUs.
- The table result
RAM
means the system memory usage when running LLaMA.Cpp or LLaMA.Cpp like application. - The table result
VRAM 0
means the first visible GPU memory usage when serving the model. - For example,
--tensor-split=1,1,1
means the model will be split into 3 parts with 33% each, and results inVRAM 0
,VRAM 1
andVRAM 2
cells.
- The table result
Install from releases
or go install github.com/gpustack/gguf-parser-go/cmd/gguf-parser@latest
.
$ gguf-parser --path="~/.cache/lm-studio/models/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/Hermes-2-Pro-Mistral-7B.Q5_K_M.gguf"
+-----------------------------------------------------------------------------------+
| MODEL |
+-------+-------+----------------+---------------+----------+------------+----------+
| NAME | ARCH | QUANTIZATION | LITTLE ENDIAN | SIZE | PARAMETERS | BPW |
+-------+-------+----------------+---------------+----------+------------+----------+
| jeffq | llama | IQ3_XXS/Q5_K_M | true | 4.78 GiB | 7.24 B | 5.67 bpw |
+-------+-------+----------------+---------------+----------+------------+----------+
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| 32768 | 4096 | 4 | true | 32 | 32 | 14336 | 0 | 32032 |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| llama | 450.50 KiB | 32032 | N/A | 1 | 32000 | N/A | N/A | N/A | N/A | N/A |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-------------------------+--------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +------------+------------+--------+-----------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+-----------+
| llama | 32768 | 2048 / 512 | Disabled | Supported | No | 33 (32 + 1) | Yes | 176.25 MiB | 326.25 MiB | 4 GiB | 11.16 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+-----------+
$ # Retrieve the model's metadata via split file,
$ # which needs all split files has been downloaded.
$ gguf-parser --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf"
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| MODEL |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+--------------+---------------+-----------+------------+----------+
| NAME | ARCH | QUANTIZATION | LITTLE ENDIAN | SIZE | PARAMETERS | BPW |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+--------------+---------------+-----------+------------+----------+
| 72b.5000B--cmix31-base100w-cpt32k_mega_v1_reflection_4_identity_2_if_ondare_beta0.09_lr_1e-6_bs128_epoch2-72B.qwen2B-bf16-mp8-pp4-lr-1e-6-minlr-1e-9-bs-128-seqlen-4096-step1350 | qwen2 | IQ1_S/Q6_K | true | 59.92 GiB | 72.71 B | 7.08 bpw |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+--------------+---------------+-----------+------------+----------+
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| 32768 | 8192 | 8 | true | 64 | 80 | 29568 | 0 | 152064 |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| gpt2 | 2.47 MiB | 152064 | N/A | 151643 | 151645 | N/A | N/A | N/A | N/A | 151643 |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-------------------------+--------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +------------+------------+--------+-----------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+-----------+
| qwen2 | 32768 | 2048 / 512 | Disabled | Supported | No | 81 (80 + 1) | Yes | 307.38 MiB | 457.38 MiB | 10 GiB | 73.47 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+-----------+
$ gguf-parser --url="https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF/resolve/main/Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf"
+----------------------------------------------------------------------------------+
| MODEL |
+----------+-------+--------------+---------------+--------+------------+----------+
| NAME | ARCH | QUANTIZATION | LITTLE ENDIAN | SIZE | PARAMETERS | BPW |
+----------+-------+--------------+---------------+--------+------------+----------+
| emozilla | llama | Q4_K/Q3_K_M | true | 21 GiB | 46.70 B | 3.86 bpw |
+----------+-------+--------------+---------------+--------+------------+----------+
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| 32768 | 4096 | 4 | true | 32 | 32 | 14336 | 8 | 32002 |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| llama | 449.91 KiB | 32002 | N/A | 1 | 32000 | N/A | N/A | 0 | N/A | 2 |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-------------------------+-----------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +------------+------------+-----------+-----------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
| llama | 32768 | 2048 / 512 | Disabled | Not Supported | No | 33 (32 + 1) | Yes | 174.54 MiB | 324.54 MiB | 24.94 GiB | 27.41 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
$ # Retrieve the model's metadata via split file
$ gguf-parser --url="https://huggingface.co/MaziyarPanahi/Meta-Llama-3.1-405B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-405B-Instruct.Q2_K.gguf-00001-of-00009.gguf"
+----------------------------------------------------------------------------------------------------------------------------+
| MODEL |
+------------------------------------------------+-------+--------------+---------------+------------+------------+----------+
| NAME | ARCH | QUANTIZATION | LITTLE ENDIAN | SIZE | PARAMETERS | BPW |
+------------------------------------------------+-------+--------------+---------------+------------+------------+----------+
| Models Meta Llama Meta Llama 3.1 405B Instruct | llama | Q2_K | true | 140.81 GiB | 410.08 B | 2.95 bpw |
+------------------------------------------------+-------+--------------+---------------+------------+------------+----------+
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| 131072 | 16384 | 8 | true | 128 | 126 | 53248 | 0 | 128256 |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| gpt2 | 2 MiB | 128256 | N/A | 128000 | 128009 | N/A | N/A | N/A | N/A | N/A |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-------------------------+----------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +------------+------------+---------+------------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+---------+------------+
| llama | 131072 | 2048 / 512 | Disabled | Supported | No | 127 (126 + 1) | Yes | 684.53 MiB | 834.53 MiB | 126 GiB | 299.79 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+---------+------------+
$ gguf-parser --hf-repo="openbmb/MiniCPM-Llama3-V-2_5-gguf" --hf-file="ggml-model-Q5_K_M.gguf" --hf-mmproj-file="mmproj-model-f16.gguf"
+-----------------------------------------------------------------------------------+
| MODEL |
+-------+-------+----------------+---------------+----------+------------+----------+
| NAME | ARCH | QUANTIZATION | LITTLE ENDIAN | SIZE | PARAMETERS | BPW |
+-------+-------+----------------+---------------+----------+------------+----------+
| model | llama | IQ3_XXS/Q5_K_M | true | 5.33 GiB | 8.03 B | 5.70 bpw |
+-------+-------+----------------+---------------+----------+------------+----------+
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| 8192 | 4096 | 4 | true | 32 | 32 | 14336 | 0 | 128256 |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| gpt2 | 2 MiB | 128256 | N/A | 128000 | 128001 | N/A | N/A | 128002 | N/A | 0 |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-------------------------+-------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +------------+------------+--------+----------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+----------+
| llama | 8192 | 2048 / 512 | Disabled | Supported | No | 33 (32 + 1) | Yes | 184.85 MiB | 334.85 MiB | 1 GiB | 7.78 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+----------+
$ # Retrieve the model's metadata via split file
$ gguf-parser --hf-repo="etemiz/Llama-3.1-405B-Inst-GGUF" --hf-file="llama-3.1-405b-IQ1_M-00019-of-00019.gguf"
+---------------------------------------------------------------------------------------------------------+
| MODEL |
+------------------------------+-------+--------------+---------------+-----------+------------+----------+
| NAME | ARCH | QUANTIZATION | LITTLE ENDIAN | SIZE | PARAMETERS | BPW |
+------------------------------+-------+--------------+---------------+-----------+------------+----------+
| Meta-Llama-3.1-405B-Instruct | llama | IQ1_M | true | 88.61 GiB | 410.08 B | 1.86 bpw |
+------------------------------+-------+--------------+---------------+-----------+------------+----------+
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| 131072 | 16384 | 8 | true | 128 | 126 | 53248 | 0 | 128256 |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| gpt2 | 2 MiB | 128256 | N/A | 128000 | 128009 | N/A | N/A | N/A | N/A | N/A |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-------------------------+----------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +------------+------------+---------+------------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+---------+------------+
| llama | 131072 | 2048 / 512 | Disabled | Supported | No | 127 (126 + 1) | Yes | 684.53 MiB | 834.53 MiB | 126 GiB | 247.59 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+---------+------------+
$ gguf-parser --ms-repo="shaowenchen/chinese-alpaca-2-13b-16k-gguf" --ms-file="chinese-alpaca-2-13b-16k.Q5_K.gguf"
+----------------------------------------------------------------------------------+
| MODEL |
+------+-------+----------------+---------------+----------+------------+----------+
| NAME | ARCH | QUANTIZATION | LITTLE ENDIAN | SIZE | PARAMETERS | BPW |
+------+-------+----------------+---------------+----------+------------+----------+
| .. | llama | IQ3_XXS/Q5_K_M | true | 8.76 GiB | 13.25 B | 5.68 bpw |
+------+-------+----------------+---------------+----------+------------+----------+
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| 16384 | 5120 | 1 | true | N/A | 40 | 13824 | 0 | 55296 |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| llama | 769.83 KiB | 55296 | N/A | 1 | 2 | N/A | N/A | N/A | N/A | N/A |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------------------+-----------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +-----------+------------+-----------+-----------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-----------+------------+-----------+-----------+
| llama | 16384 | 2048 / 512 | Disabled | Supported | No | 41 (40 + 1) | Yes | 60.95 MiB | 210.95 MiB | 12.50 GiB | 22.74 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-----------+------------+-----------+-----------+
$ gguf-parser --ol-model="llama3.1"
+------------------------------------------------------------------------------------------------------+
| MODEL |
+----------------------------+-------+--------------+---------------+----------+------------+----------+
| NAME | ARCH | QUANTIZATION | LITTLE ENDIAN | SIZE | PARAMETERS | BPW |
+----------------------------+-------+--------------+---------------+----------+------------+----------+
| Meta Llama 3.1 8B Instruct | llama | Q4_0 | true | 4.33 GiB | 8.03 B | 4.64 bpw |
+----------------------------+-------+--------------+---------------+----------+------------+----------+
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| 131072 | 4096 | 4 | true | 32 | 32 | 14336 | 0 | 128256 |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| gpt2 | 2 MiB | 128256 | N/A | 128000 | 128009 | N/A | N/A | N/A | N/A | N/A |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-------------------------+--------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +------------+------------+--------+-----------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+-----------+
| llama | 131072 | 2048 / 512 | Disabled | Supported | No | 33 (32 + 1) | Yes | 411.62 MiB | 561.62 MiB | 16 GiB | 29.08 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+--------+-----------+
$ # Ollama Model includes the preset params and other artifacts, like multimodal projectors or LoRA adapters,
$ # you can get the usage of Ollama running by using `--ol-usage` option.
$ gguf-parser --ol-model="llama3.1" --ol-usage
+------------------------------------------------------------------------------------------------------+
| MODEL |
+----------------------------+-------+--------------+---------------+----------+------------+----------+
| NAME | ARCH | QUANTIZATION | LITTLE ENDIAN | SIZE | PARAMETERS | BPW |
+----------------------------+-------+--------------+---------------+----------+------------+----------+
| Meta Llama 3.1 8B Instruct | llama | Q4_0 | true | 4.33 GiB | 8.03 B | 4.64 bpw |
+----------------------------+-------+--------------+---------------+----------+------------+----------+
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| 131072 | 4096 | 4 | true | 32 | 32 | 14336 | 0 | 128256 |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| gpt2 | 2 MiB | 128256 | N/A | 128000 | 128009 | N/A | N/A | N/A | N/A | N/A |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+-------------------------+-----------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +------------+------------+------------+----------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+------------+----------+
| llama | 2048 | 2048 / 512 | Disabled | Supported | No | 33 (32 + 1) | Yes | 159.62 MiB | 309.62 MiB | 256.50 MiB | 4.82 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+----------------+----------------+------------+------------+------------+----------+
$ gguf-parser --hf-repo="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" --hf-file="Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf" --skip-model --skip-architecture --skip-tokenizer
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-------------------------+-----------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +------------+------------+-----------+-----------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
| llama | 32768 | 2048 / 512 | Disabled | Not Supported | No | 33 (32 + 1) | Yes | 174.54 MiB | 324.54 MiB | 24.94 GiB | 27.41 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
$ gguf-parser --hf-repo="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" --hf-file="Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf" --skip-model --skip-architecture --skip-tokenizer --gpu-layers=0
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------------------+-------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +-----------+-----------+--------+----------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------+-----------+--------+----------+
| llama | 32768 | 2048 / 512 | Disabled | Not Supported | No | 0 | No | 25.09 GiB | 25.24 GiB | 0 B | 2.39 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------+-----------+--------+----------+
$ gguf-parser --hf-repo="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" --hf-file="Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf" --skip-model --skip-architecture --skip-tokenizer --gpu-layers=10
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------------------+----------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +-----------+-----------+----------+-----------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------+-----------+----------+-----------+
| llama | 32768 | 2048 / 512 | Disabled | Not Supported | No | 10 | No | 17.38 GiB | 17.52 GiB | 7.73 GiB | 10.19 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------+-----------+----------+-----------+
By default, the context size retrieved from the model's metadata.
Use --ctx-size
to specify the context size.
$ gguf-parser --hf-repo="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" --hf-file="Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf" --skip-model --skip-architecture --skip-tokenizer --ctx-size=4096
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-------------------------+-----------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +------------+------------+-----------+-----------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
| llama | 4096 | 2048 / 512 | Disabled | Not Supported | No | 33 (32 + 1) | Yes | 118.54 MiB | 268.54 MiB | 21.44 GiB | 21.99 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
By default, LLaMA.cpp disables the Flash Attention.
Enable Flash Attention will reduce the VRAM usage, but it also increases the GPU/CPU usage.
Use --flash-attention
to enable the Flash Attention.
Please note that not all models support Flash Attention, if the model does not support, the "FLASH ATTENTION" shows " Disabled" even if you enable it.
$ gguf-parser --hf-repo="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" --hf-file="Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf" --skip-model --skip-architecture --skip-tokenizer --flash-attention
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-------------------------+-----------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +------------+------------+-----------+-----------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
| llama | 32768 | 2048 / 512 | Enabled | Not Supported | No | 33 (32 + 1) | Yes | 158.54 MiB | 308.54 MiB | 24.94 GiB | 25.43 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
By default, LLaMA.cpp loads the model via Memory-Mapped.
For Apple MacOS, Memory-Mapped is an efficient way to load the model, and results in a lower VRAM usage. For other platforms, Memory-Mapped affects the first-time model loading speed only.
Use --no-mmap
to disable loading the model via Memory-Mapped.
Please note that some models require loading the whole weight into memory, if the model does not support MMap, the "MMAP LOAD" shows "Not Supported".
$ gguf-parser --hf-repo="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" --hf-file="Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf" --skip-model --skip-architecture --skip-tokenizer --gpu-layers=10 --no-mmap
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------------------+----------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +-----------+-----------+----------+-----------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------+-----------+----------+-----------+
| llama | 32768 | 2048 / 512 | Disabled | Not Supported | No | 10 | No | 17.38 GiB | 17.52 GiB | 7.73 GiB | 10.19 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-----------+-----------+----------+-----------+
Use --gpu-layers-step
to get the proper offload layers number when the model is too large to fit into the GPUs memory.
$ gguf-parser --hf-repo="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" --hf-file="Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf" --skip-model --skip-architecture --skip-tokenizer --gpu-layers-step=5
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+-------------------------+-----------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | +------------+------------+-----------+-----------+
| | | | | | | | | UMA | NONUMA | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
| llama | 32768 | 2048 / 512 | Disabled | Not Supported | No | 0 | No | 25.09 GiB | 25.24 GiB | 0 B | 2.39 GiB |
| | | | | | +----------------+ +------------+------------+-----------+-----------+
| | | | | | | 5 | | 21.24 GiB | 21.39 GiB | 3.86 GiB | 6.33 GiB |
| | | | | | +----------------+ +------------+------------+-----------+-----------+
| | | | | | | 10 | | 17.38 GiB | 17.52 GiB | 7.73 GiB | 10.19 GiB |
| | | | | | +----------------+ +------------+------------+-----------+-----------+
| | | | | | | 15 | | 13.51 GiB | 13.66 GiB | 11.59 GiB | 14.06 GiB |
| | | | | | +----------------+ +------------+------------+-----------+-----------+
| | | | | | | 20 | | 9.65 GiB | 9.79 GiB | 15.46 GiB | 17.92 GiB |
| | | | | | +----------------+ +------------+------------+-----------+-----------+
| | | | | | | 25 | | 5.78 GiB | 5.93 GiB | 19.32 GiB | 21.79 GiB |
| | | | | | +----------------+ +------------+------------+-----------+-----------+
| | | | | | | 30 | | 1.92 GiB | 2.06 GiB | 23.19 GiB | 25.65 GiB |
| | | | | | +----------------+----------------+------------+------------+-----------+-----------+
| | | | | | | 33 (32 + 1) | Yes | 174.54 MiB | 324.54 MiB | 24.94 GiB | 27.41 GiB |
+-------+--------------+--------------------+-----------------+---------------+----------------+----------------+----------------+------------+------------+-----------+-----------+
MIT