Skip to content

Commit

Permalink
feat: mmproj/draft support
Browse files Browse the repository at this point in the history
Signed-off-by: thxCode <[email protected]>
  • Loading branch information
thxCode committed Jul 22, 2024
1 parent 902a8c5 commit dc0fc66
Show file tree
Hide file tree
Showing 5 changed files with 263 additions and 112 deletions.
102 changes: 76 additions & 26 deletions cmd/gguf-parser/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,30 @@ Usage of gguf-parser ...:
Specify the size of prompt context, which is used to estimate the usage, default is equal to the model's maximum context size. (default -1)
-debug
Enable debugging, verbosity.
-draft-path string
Path where the GGUF file to load for the draft model, optional, e.g. ~/.cache/lm-studio/models/QuantFactory/Qwen2-1.5B-Instruct-GGUF/Qwen2-1.5B-Instruct.Q5_K_M.gguf
-draft-url string
Url where the GGUF file to load for the draft model, optional, e.g. https://huggingface.co/QuantFactory/Qwen2-1.5B-Instruct-GGUF/resolve/main/Qwen2-1.5B-Instruct.Q5_K_M.gguf. Note that gguf-parser does not need to download the entire GGUF file.
-flash-attention
Specify enabling Flash Attention, which is used to estimate the usage. Flash Attention can reduce the usage of RAM/VRAM.
-gpu-layers int
Specify how many layers to offload, which is used to estimate the usage, default is full offloaded. (default -1)
Specify how many layers of the main model to offload, which is used to estimate the usage, default is full offloaded. (default -1)
-gpu-layers-draft int
Specify how many layers of the draft model to offload, which is used to estimate the usage, default is full offloaded. (default -1)
-gpu-layers-step uint
Specify the step of layers to offload, works with --gpu-layers.
-hf-draft-file string
Model file below the --hf-draft-repo, optional, e.g. Qwen2-1.5B-Instruct.Q5_K_M.gguf.
-hf-draft-repo string
Repository of HuggingFace which the GGUF file store for the draft model, optional, e.g. QuantFactory/Qwen2-1.5B-Instruct-GGUF, works with --hf-draft-file.
-hf-file string
Model file below the --hf-repo, e.g. Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf.
Model file below the --hf-repo, e.g. Qwen2-7B-Instruct.Q5_K_M.gguf.
-hf-mmproj-file string
Multimodal projector file below the --hf-repo.
-hf-repo string
Repository of HuggingFace which the GGUF file store, e.g. NousResearch/Hermes-2-Theta-Llama-3-8B-GGUF, works with --hf-file.
Repository of HuggingFace which the GGUF file store for the main model, e.g. QuantFactory/Qwen2-7B-Instruct-GGUF, works with --hf-file.
-hf-token string
User access token of HuggingFace, optional, works with --hf-repo/--hf-file. See https://huggingface.co/settings/tokens.
User access token of HuggingFace, optional, works with --hf-repo/--hf-file pair or --hf-draft-repo/--hf-draft-file pair. See https://huggingface.co/settings/tokens.
-in-max-ctx-size
Limit the context size to the maximum context size of the model, if the context size is larger than the maximum context size.
-in-mib
Expand All @@ -33,26 +45,34 @@ Usage of gguf-parser ...:
Output as pretty JSON. (default true)
-kv-type string
Specify the type of Key-Value cache, which is used to estimate the usage, select from [f32, f16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1], default is f16. Use quantization type means enabling --flash-attention as well. (default "f16")
-mmproj-path string
Path where the GGUF file to load for the multimodal projector, optional.
-mmproj-url string
Url where the GGUF file to load for the multimodal projector, optional.
-ms-draft-file string
Model file below the --ms-draft-repo, optional, e.g. qwen1_5-1_8b-chat-q5_k_m.gguf.
-ms-draft-repo string
Repository of ModelScope which the GGUF file store for the draft model, optional, e.g. qwen/Qwen1.5-1.8B-Chat-GGUF, works with --ms-draft-file.
-ms-file string
Model file below the --ms-repo, e.g. qwen1.5-0.5b-chat.gguf.
Model file below the --ms-repo, e.g. qwen1_5-7b-chat-q5_k_m.gguf.
-ms-mmproj-file string
Multimodal projector file below the --ms-repo.
-ms-repo string
Repository of ModelScope which the GGUF file store, e.g. qwen/Qwen1.5-0.5B-Chat-GGUF, works with --ms-file.
Repository of ModelScope which the GGUF file store for the main model, e.g. qwen/Qwen1.5-7B-Chat-GGUF, works with --ms-file.
-ms-token string
Git access token of ModelScope, optional, works with --ms-repo/--ms-file. See https://modelscope.cn/my/myaccesstoken.
Git access token of ModelScope, optional, works with --ms-repo/--ms-file pair or --ms-draft-repo/--ms-draft-file pair. See https://modelscope.cn/my/myaccesstoken.
-no-kv-offload
Specify disabling Key-Value offloading, which is used to estimate the usage. Key-Value offloading can reduce the usage of VRAM.
-no-mmap
Specify disabling Memory-Mapped using, which is used to estimate the usage. Memory-Mapped can avoid loading the entire model weights into RAM.
-ol-crawl
Crawl the Ollama model instead of blobs fetching, works with --ol-model, which will be more efficient and faster, but lossy. [Deprecated, as Ollama Model layer page has changed, will be removed in v0.4.0.]
-ol-model string
Model name of Ollama, e.g. gemma2.
-ol-usage
Specify respecting the extending layers introduced by Ollama, works with --ol-model, which affects the usage estimation.
-parallel-size int
Specify the number of parallel sequences to decode, which is used to estimate the usage, default is 1. (default 1)
-path string
Path where the GGUF file to load, e.g. ~/.cache/lm-studio/models/NousResearch/Hermes-2-Theta-Llama-3-8B-GGUF/Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf.
Path where the GGUF file to load for the main model, e.g. ~/.cache/lm-studio/models/QuantFactory/Qwen2-7B-Instruct-GGUF/Qwen2-7B-Instruct.Q5_K_M.gguf.
-platform-footprint cudaMemGetInfo
Specify the platform footprint(RAM,VRAM) in MiB, which is used to estimate the NonUMA usage, default is 150,250. Different platform always gets different RAM and VRAM footprints, for example, within CUDA, cudaMemGetInfo would occupy some RAM and VRAM, see https://stackoverflow.com/questions/64854862/free-memory-occupied-by-cudamemgetinfo. (default "150,250")
-raw
Expand All @@ -76,11 +96,11 @@ Usage of gguf-parser ...:
-skip-tokenizer
Skip to display tokenizer metadata
-token string
Bearer auth token to load GGUF file, optional, works with --url.
Bearer auth token to load GGUF file, optional, works with --url/--draft-url.
-ubatch-size int
Specify the physical maximum batch size, which is used to estimate the usage, default is 512. (default 512)
-url string
Url where the GGUF file to load, e.g. https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B-GGUF/resolve/main/Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf. Note that gguf-parser does not need to download the entire GGUF file.
Url where the GGUF file to load for the main model, e.g. https://huggingface.co/QuantFactory/Qwen2-7B-Instruct-GGUF/resolve/main/Qwen2-7B-Instruct.Q5_K_M.gguf. Note that gguf-parser does not need to download the entire GGUF file.
-version
Show gguf-parser version.
Expand Down Expand Up @@ -110,11 +130,11 @@ $ gguf-parser --path="~/.cache/lm-studio/models/NousResearch/Hermes-2-Pro-Mistra
| TOKENIZER | llama | 450.50 KiB | 32032 | N/A | 1 | 32000 | N/A | N/A | N/A |
+--------------+-------+-------------+------------+------------------+-----------+-----------+---------------+-----------------+---------------+
+--------------+-------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
| \ | Arch | Context Size | Flash Attention | MMap Support | Offload Layers | Full Offloaded | UMA (RAM + VRAM) | NonUMA RAM | NonUMA VRAM |
+--------------+-------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
| ESTIMATE | llama | 32768 | false | true | 33 (32 + 1) | Yes | 88.39 MiB + 8.59 GiB = 8.68 GiB | 238.39 MiB | 11.06 GiB |
+--------------+-------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
+--------------+-------+--------------+-----------------+--------------+----------------+----------------+------------------------------+------------+-------------+
| \ | Arch | Context Size | Flash Attention | MMap Support | Offload Layers | Full Offloaded | UMA (RAM + VRAM) | NonUMA RAM | NonUMA VRAM |
+--------------+-------+--------------+-----------------+--------------+----------------+----------------+------------------------------+------------+-------------+
| ESTIMATE | llama | 32768 | false | true | 33 (32 + 1) | Yes | 88.39 MiB + 4 GiB = 4.09 GiB | 238.39 MiB | 11.06 GiB |
+--------------+-------+--------------+-----------------+--------------+----------------+----------------+------------------------------+------------+-------------+
```
Expand Down Expand Up @@ -151,7 +171,7 @@ $ gguf-parser --url="https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8
#### Parse HuggingFace GGUF file
```shell
$ gguf-parser --hf-repo="openbmb/MiniCPM-Llama3-V-2_5-gguf" --hf-file="ggml-model-Q5_K_M.gguf"
$ gguf-parser --hf-repo="openbmb/MiniCPM-Llama3-V-2_5-gguf" --hf-file="ggml-model-Q5_K_M.gguf" --hf-mmproj-file="mmproj-model-f16.gguf"
+--------------+-------+-------+----------------+---------------+----------+------------+----------+
| \ | Name | Arch | Quantization | Little Endian | Size | Parameters | BPW |
+--------------+-------+-------+----------------+---------------+----------+------------+----------+
Expand All @@ -173,7 +193,7 @@ $ gguf-parser --hf-repo="openbmb/MiniCPM-Llama3-V-2_5-gguf" --hf-file="ggml-mode
+--------------+-------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
| \ | Arch | Context Size | Flash Attention | MMap Support | Offload Layers | Full Offloaded | UMA (RAM + VRAM) | NonUMA RAM | NonUMA VRAM |
+--------------+-------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
| ESTIMATE | llama | 8192 | false | true | 33 (32 + 1) | Yes | 84.61 MiB + 5.59 GiB = 5.68 GiB | 234.61 MiB | 6.49 GiB |
| ESTIMATE | llama | 8192 | false | true | 33 (32 + 1) | Yes | 97.36 MiB + 1.96 GiB = 2.06 GiB | 247.36 MiB | 7.45 GiB |
+--------------+-------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
```
Expand Down Expand Up @@ -203,7 +223,7 @@ $ gguf-parser --ms-repo="shaowenchen/chinese-alpaca-2-13b-16k-gguf" --ms-file="c
+--------------+-------+--------------+-----------------+--------------+----------------+----------------+-----------------------------------+------------+-------------+
| \ | Arch | Context Size | Flash Attention | MMap Support | Offload Layers | Full Offloaded | UMA (RAM + VRAM) | NonUMA RAM | NonUMA VRAM |
+--------------+-------+--------------+-----------------+--------------+----------------+----------------+-----------------------------------+------------+-------------+
| ESTIMATE | llama | 16384 | false | true | 41 (40 + 1) | Yes | 61.18 MiB + 20.87 GiB = 20.92 GiB | 211.18 MiB | 22.74 GiB |
| ESTIMATE | llama | 16384 | false | true | 41 (40 + 1) | Yes | 61.18 MiB + 12.50 GiB = 12.56 GiB | 211.18 MiB | 22.74 GiB |
+--------------+-------+--------------+-----------------+--------------+----------------+----------------+-----------------------------------+------------+-------------+
```
Expand All @@ -212,11 +232,11 @@ $ gguf-parser --ms-repo="shaowenchen/chinese-alpaca-2-13b-16k-gguf" --ms-file="c
```shell
$ gguf-parser --ol-model="gemma2"
+--------------+--------+--------+--------------+---------------+----------+------------+----------+
| \ | Name | Arch | Quantization | Little Endian | Size | Parameters | BPW |
+--------------+--------+--------+--------------+---------------+----------+------------+----------+
| MODEL | gemma2 | gemma2 | Q4_0 | true | 5.06 GiB | 9.24 B | 4.71 bpw |
+--------------+--------+--------+--------------+---------------+----------+------------+----------+
+--------------+---------------+--------+--------------+---------------+----------+------------+----------+
| \ | Name | Arch | Quantization | Little Endian | Size | Parameters | BPW |
+--------------+---------------+--------+--------------+---------------+----------+------------+----------+
| MODEL | gemma-2-9b-it | gemma2 | Q4_0 | true | 5.06 GiB | 9.24 B | 4.71 bpw |
+--------------+---------------+--------+--------------+---------------+----------+------------+----------+
+--------------+-----------------+---------------+---------------+--------------------+--------+------------------+------------+----------------+
| \ | Max Context Len | Embedding Len | Embedding GQA | Attention Head Cnt | Layers | Feed Forward Len | Expert Cnt | Vocabulary Len |
Expand All @@ -233,11 +253,41 @@ $ gguf-parser --ol-model="gemma2"
+--------------+--------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
| \ | Arch | Context Size | Flash Attention | MMap Support | Offload Layers | Full Offloaded | UMA (RAM + VRAM) | NonUMA RAM | NonUMA VRAM |
+--------------+--------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
| ESTIMATE | gemma2 | 8192 | false | true | 43 (42 + 1) | Yes | 65.97 MiB + 6.99 GiB = 7.05 GiB | 215.97 MiB | 8.43 GiB |
| ESTIMATE | gemma2 | 8192 | false | true | 43 (42 + 1) | Yes | 65.97 MiB + 2.62 GiB = 2.69 GiB | 215.97 MiB | 8.43 GiB |
+--------------+--------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
```
##### Parse Ollama model with its extending layers
```shell
$ gguf-parser --ol-model="gemma2" --ol-usage
+--------------+---------------+--------+--------------+---------------+----------+------------+----------+
| \ | Name | Arch | Quantization | Little Endian | Size | Parameters | BPW |
+--------------+---------------+--------+--------------+---------------+----------+------------+----------+
| MODEL | gemma-2-9b-it | gemma2 | Q4_0 | true | 5.06 GiB | 9.24 B | 4.71 bpw |
+--------------+---------------+--------+--------------+---------------+----------+------------+----------+
+--------------+-----------------+---------------+---------------+--------------------+--------+------------------+------------+----------------+
| \ | Max Context Len | Embedding Len | Embedding GQA | Attention Head Cnt | Layers | Feed Forward Len | Expert Cnt | Vocabulary Len |
+--------------+-----------------+---------------+---------------+--------------------+--------+------------------+------------+----------------+
| ARCHITECTURE | 8192 | 3584 | 2 | 16 | 42 | 14336 | 0 | 256000 |
+--------------+-----------------+---------------+---------------+--------------------+--------+------------------+------------+----------------+
+--------------+-------+-------------+------------+------------------+-----------+-----------+---------------+-----------------+---------------+
| \ | Model | Tokens Size | Tokens Len | Added Tokens Len | BOS Token | EOS Token | Unknown Token | Separator Token | Padding Token |
+--------------+-------+-------------+------------+------------------+-----------+-----------+---------------+-----------------+---------------+
| TOKENIZER | llama | N/A | 256000 | N/A | 2 | 1 | 3 | N/A | 0 |
+--------------+-------+-------------+------------+------------------+-----------+-----------+---------------+-----------------+---------------+
+--------------+--------+--------------+-----------------+--------------+----------------+----------------+----------------------------------+------------+-------------+
| \ | Arch | Context Size | Flash Attention | MMap Support | Offload Layers | Full Offloaded | UMA (RAM + VRAM) | NonUMA RAM | NonUMA VRAM |
+--------------+--------+--------------+-----------------+--------------+----------------+----------------+----------------------------------+------------+-------------+
| ESTIMATE | gemma2 | 2048 | false | true | 43 (42 + 1) | Yes | 53.97 MiB + 672 MiB = 725.97 MiB | 203.97 MiB | 6.46 GiB |
+--------------+--------+--------------+-----------------+--------------+----------------+----------------+----------------------------------+------------+-------------+
```
#### Parse Clip model
```shell
Expand Down
Loading

0 comments on commit dc0fc66

Please sign in to comment.