feat: mmproj/draft support

Signed-off-by: thxCode <[email protected]>
gpustack · Jul 22, 2024 · dc0fc66 · dc0fc66
1 parent 902a8c5
commit dc0fc66
Show file tree

Hide file tree

Showing 5 changed files with 263 additions and 112 deletions.
diff --git a/cmd/gguf-parser/README.md b/cmd/gguf-parser/README.md
@@ -11,18 +11,30 @@ Usage of gguf-parser ...:
         Specify the size of prompt context, which is used to estimate the usage, default is equal to the model's maximum context size. (default -1)
   -debug
         Enable debugging, verbosity.
+  -draft-path string
+        Path where the GGUF file to load for the draft model, optional, e.g. ~/.cache/lm-studio/models/QuantFactory/Qwen2-1.5B-Instruct-GGUF/Qwen2-1.5B-Instruct.Q5_K_M.gguf
+  -draft-url string
+        Url where the GGUF file to load for the draft model, optional, e.g. https://huggingface.co/QuantFactory/Qwen2-1.5B-Instruct-GGUF/resolve/main/Qwen2-1.5B-Instruct.Q5_K_M.gguf. Note that gguf-parser does not need to download the entire GGUF file.
   -flash-attention
         Specify enabling Flash Attention, which is used to estimate the usage. Flash Attention can reduce the usage of RAM/VRAM.
   -gpu-layers int
-        Specify how many layers to offload, which is used to estimate the usage, default is full offloaded. (default -1)
+        Specify how many layers of the main model to offload, which is used to estimate the usage, default is full offloaded. (default -1)
+  -gpu-layers-draft int
+        Specify how many layers of the draft model to offload, which is used to estimate the usage, default is full offloaded. (default -1)
   -gpu-layers-step uint
         Specify the step of layers to offload, works with --gpu-layers.
+  -hf-draft-file string
+        Model file below the --hf-draft-repo, optional, e.g. Qwen2-1.5B-Instruct.Q5_K_M.gguf.
+  -hf-draft-repo string
+        Repository of HuggingFace which the GGUF file store for the draft model, optional, e.g. QuantFactory/Qwen2-1.5B-Instruct-GGUF, works with --hf-draft-file.
   -hf-file string
-        Model file below the --hf-repo, e.g. Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf.
+        Model file below the --hf-repo, e.g. Qwen2-7B-Instruct.Q5_K_M.gguf.
+  -hf-mmproj-file string
+        Multimodal projector file below the --hf-repo.
   -hf-repo string
-        Repository of HuggingFace which the GGUF file store, e.g. NousResearch/Hermes-2-Theta-Llama-3-8B-GGUF, works with --hf-file.
+        Repository of HuggingFace which the GGUF file store for the main model, e.g. QuantFactory/Qwen2-7B-Instruct-GGUF, works with --hf-file.
   -hf-token string
-        User access token of HuggingFace, optional, works with --hf-repo/--hf-file. See https://huggingface.co/settings/tokens.
+        User access token of HuggingFace, optional, works with --hf-repo/--hf-file pair or --hf-draft-repo/--hf-draft-file pair. See https://huggingface.co/settings/tokens.
   -in-max-ctx-size
         Limit the context size to the maximum context size of the model, if the context size is larger than the maximum context size.
   -in-mib
@@ -33,26 +45,34 @@ Usage of gguf-parser ...:
         Output as pretty JSON. (default true)
   -kv-type string
         Specify the type of Key-Value cache, which is used to estimate the usage, select from [f32, f16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1], default is f16. Use quantization type means enabling --flash-attention as well. (default "f16")
+  -mmproj-path string
+        Path where the GGUF file to load for the multimodal projector, optional.
+  -mmproj-url string
+        Url where the GGUF file to load for the multimodal projector, optional.
+  -ms-draft-file string
+        Model file below the --ms-draft-repo, optional, e.g. qwen1_5-1_8b-chat-q5_k_m.gguf.
+  -ms-draft-repo string
+        Repository of ModelScope which the GGUF file store for the draft model, optional, e.g. qwen/Qwen1.5-1.8B-Chat-GGUF, works with --ms-draft-file.
   -ms-file string
-        Model file below the --ms-repo, e.g. qwen1.5-0.5b-chat.gguf.
+        Model file below the --ms-repo, e.g. qwen1_5-7b-chat-q5_k_m.gguf.
+  -ms-mmproj-file string
+        Multimodal projector file below the --ms-repo.
   -ms-repo string
-        Repository of ModelScope which the GGUF file store, e.g. qwen/Qwen1.5-0.5B-Chat-GGUF, works with --ms-file.
+        Repository of ModelScope which the GGUF file store for the main model, e.g. qwen/Qwen1.5-7B-Chat-GGUF, works with --ms-file.
   -ms-token string
-        Git access token of ModelScope, optional, works with --ms-repo/--ms-file. See https://modelscope.cn/my/myaccesstoken.
+        Git access token of ModelScope, optional, works with --ms-repo/--ms-file pair or --ms-draft-repo/--ms-draft-file pair. See https://modelscope.cn/my/myaccesstoken.
   -no-kv-offload
         Specify disabling Key-Value offloading, which is used to estimate the usage. Key-Value offloading can reduce the usage of VRAM.
   -no-mmap
         Specify disabling Memory-Mapped using, which is used to estimate the usage. Memory-Mapped can avoid loading the entire model weights into RAM.
-  -ol-crawl
-        Crawl the Ollama model instead of blobs fetching, works with --ol-model, which will be more efficient and faster, but lossy. [Deprecated, as Ollama Model layer page has changed, will be removed in v0.4.0.]
   -ol-model string
         Model name of Ollama, e.g. gemma2.
   -ol-usage
         Specify respecting the extending layers introduced by Ollama, works with --ol-model, which affects the usage estimation.
   -parallel-size int
         Specify the number of parallel sequences to decode, which is used to estimate the usage, default is 1. (default 1)
   -path string
-        Path where the GGUF file to load, e.g. ~/.cache/lm-studio/models/NousResearch/Hermes-2-Theta-Llama-3-8B-GGUF/Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf.
+        Path where the GGUF file to load for the main model, e.g. ~/.cache/lm-studio/models/QuantFactory/Qwen2-7B-Instruct-GGUF/Qwen2-7B-Instruct.Q5_K_M.gguf.
   -platform-footprint cudaMemGetInfo
         Specify the platform footprint(RAM,VRAM) in MiB, which is used to estimate the NonUMA usage, default is 150,250. Different platform always gets different RAM and VRAM footprints, for example, within CUDA, cudaMemGetInfo would occupy some RAM and VRAM, see https://stackoverflow.com/questions/64854862/free-memory-occupied-by-cudamemgetinfo. (default "150,250")
   -raw
@@ -76,11 +96,11 @@ Usage of gguf-parser ...:
   -skip-tokenizer
         Skip to display tokenizer metadata
   -token string
-        Bearer auth token to load GGUF file, optional, works with --url.
+        Bearer auth token to load GGUF file, optional, works with --url/--draft-url.
   -ubatch-size int
         Specify the physical maximum batch size, which is used to estimate the usage, default is 512. (default 512)
   -url string
-        Url where the GGUF file to load, e.g. https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B-GGUF/resolve/main/Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf. Note that gguf-parser does not need to download the entire GGUF file.
+        Url where the GGUF file to load for the main model, e.g. https://huggingface.co/QuantFactory/Qwen2-7B-Instruct-GGUF/resolve/main/Qwen2-7B-Instruct.Q5_K_M.gguf. Note that gguf-parser does not need to download the entire GGUF file.
   -version
         Show gguf-parser version.
 
@@ -110,11 +130,11 @@ $ gguf-parser --path="~/.cache/lm-studio/models/NousResearch/Hermes-2-Pro-Mistra
 |  TOKENIZER   | llama | 450.50 KiB  |   32032    |       N/A        |     1     |   32000   |      N/A      |       N/A       |      N/A      |
 +--------------+-------+-------------+------------+------------------+-----------+-----------+---------------+-----------------+---------------+
 
-+--------------+-------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
-|      \       | Arch  | Context Size | Flash Attention | MMap Support | Offload Layers | Full Offloaded |        UMA (RAM + VRAM)         | NonUMA RAM | NonUMA VRAM |
-+--------------+-------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
-|   ESTIMATE   | llama |    32768     |      false      |     true     |  33 (32 + 1)   |      Yes       | 88.39 MiB + 8.59 GiB = 8.68 GiB | 238.39 MiB |  11.06 GiB  |
-+--------------+-------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
++--------------+-------+--------------+-----------------+--------------+----------------+----------------+------------------------------+------------+-------------+
+|      \       | Arch  | Context Size | Flash Attention | MMap Support | Offload Layers | Full Offloaded |       UMA (RAM + VRAM)       | NonUMA RAM | NonUMA VRAM |
++--------------+-------+--------------+-----------------+--------------+----------------+----------------+------------------------------+------------+-------------+
+|   ESTIMATE   | llama |    32768     |      false      |     true     |  33 (32 + 1)   |      Yes       | 88.39 MiB + 4 GiB = 4.09 GiB | 238.39 MiB |  11.06 GiB  |
++--------------+-------+--------------+-----------------+--------------+----------------+----------------+------------------------------+------------+-------------+
 
 ```
 
@@ -151,7 +171,7 @@ $ gguf-parser --url="https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8
 #### Parse HuggingFace GGUF file
 
 ```shell
-$ gguf-parser --hf-repo="openbmb/MiniCPM-Llama3-V-2_5-gguf" --hf-file="ggml-model-Q5_K_M.gguf" 
+$ gguf-parser --hf-repo="openbmb/MiniCPM-Llama3-V-2_5-gguf" --hf-file="ggml-model-Q5_K_M.gguf" --hf-mmproj-file="mmproj-model-f16.gguf"
 +--------------+-------+-------+----------------+---------------+----------+------------+----------+
 |      \       | Name  | Arch  |  Quantization  | Little Endian |   Size   | Parameters |   BPW    |
 +--------------+-------+-------+----------------+---------------+----------+------------+----------+
@@ -173,7 +193,7 @@ $ gguf-parser --hf-repo="openbmb/MiniCPM-Llama3-V-2_5-gguf" --hf-file="ggml-mode
 +--------------+-------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
 |      \       | Arch  | Context Size | Flash Attention | MMap Support | Offload Layers | Full Offloaded |        UMA (RAM + VRAM)         | NonUMA RAM | NonUMA VRAM |
 +--------------+-------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
-|   ESTIMATE   | llama |     8192     |      false      |     true     |  33 (32 + 1)   |      Yes       | 84.61 MiB + 5.59 GiB = 5.68 GiB | 234.61 MiB |  6.49 GiB   |
+|   ESTIMATE   | llama |     8192     |      false      |     true     |  33 (32 + 1)   |      Yes       | 97.36 MiB + 1.96 GiB = 2.06 GiB | 247.36 MiB |  7.45 GiB   |
 +--------------+-------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
 
 ```
@@ -203,7 +223,7 @@ $ gguf-parser --ms-repo="shaowenchen/chinese-alpaca-2-13b-16k-gguf" --ms-file="c
 +--------------+-------+--------------+-----------------+--------------+----------------+----------------+-----------------------------------+------------+-------------+
 |      \       | Arch  | Context Size | Flash Attention | MMap Support | Offload Layers | Full Offloaded |         UMA (RAM + VRAM)          | NonUMA RAM | NonUMA VRAM |
 +--------------+-------+--------------+-----------------+--------------+----------------+----------------+-----------------------------------+------------+-------------+
-|   ESTIMATE   | llama |    16384     |      false      |     true     |  41 (40 + 1)   |      Yes       | 61.18 MiB + 20.87 GiB = 20.92 GiB | 211.18 MiB |  22.74 GiB  |
+|   ESTIMATE   | llama |    16384     |      false      |     true     |  41 (40 + 1)   |      Yes       | 61.18 MiB + 12.50 GiB = 12.56 GiB | 211.18 MiB |  22.74 GiB  |
 +--------------+-------+--------------+-----------------+--------------+----------------+----------------+-----------------------------------+------------+-------------+
 
 ```
@@ -212,11 +232,11 @@ $ gguf-parser --ms-repo="shaowenchen/chinese-alpaca-2-13b-16k-gguf" --ms-file="c
 
 ```shell
 $ gguf-parser --ol-model="gemma2"
-+--------------+--------+--------+--------------+---------------+----------+------------+----------+
-|      \       |  Name  |  Arch  | Quantization | Little Endian |   Size   | Parameters |   BPW    |
-+--------------+--------+--------+--------------+---------------+----------+------------+----------+
-|    MODEL     | gemma2 | gemma2 |     Q4_0     |     true      | 5.06 GiB |   9.24 B   | 4.71 bpw |
-+--------------+--------+--------+--------------+---------------+----------+------------+----------+
++--------------+---------------+--------+--------------+---------------+----------+------------+----------+
+|      \       |     Name      |  Arch  | Quantization | Little Endian |   Size   | Parameters |   BPW    |
++--------------+---------------+--------+--------------+---------------+----------+------------+----------+
+|    MODEL     | gemma-2-9b-it | gemma2 |     Q4_0     |     true      | 5.06 GiB |   9.24 B   | 4.71 bpw |
++--------------+---------------+--------+--------------+---------------+----------+------------+----------+
 
 +--------------+-----------------+---------------+---------------+--------------------+--------+------------------+------------+----------------+
 |      \       | Max Context Len | Embedding Len | Embedding GQA | Attention Head Cnt | Layers | Feed Forward Len | Expert Cnt | Vocabulary Len |
@@ -233,11 +253,41 @@ $ gguf-parser --ol-model="gemma2"
 +--------------+--------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
 |      \       |  Arch  | Context Size | Flash Attention | MMap Support | Offload Layers | Full Offloaded |        UMA (RAM + VRAM)         | NonUMA RAM | NonUMA VRAM |
 +--------------+--------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
-|   ESTIMATE   | gemma2 |     8192     |      false      |     true     |  43 (42 + 1)   |      Yes       | 65.97 MiB + 6.99 GiB = 7.05 GiB | 215.97 MiB |  8.43 GiB   |
+|   ESTIMATE   | gemma2 |     8192     |      false      |     true     |  43 (42 + 1)   |      Yes       | 65.97 MiB + 2.62 GiB = 2.69 GiB | 215.97 MiB |  8.43 GiB   |
 +--------------+--------+--------------+-----------------+--------------+----------------+----------------+---------------------------------+------------+-------------+
 
 ```
 
+##### Parse Ollama model with its extending layers
+
+```shell
+$ gguf-parser --ol-model="gemma2" --ol-usage
++--------------+---------------+--------+--------------+---------------+----------+------------+----------+
+|      \       |     Name      |  Arch  | Quantization | Little Endian |   Size   | Parameters |   BPW    |
++--------------+---------------+--------+--------------+---------------+----------+------------+----------+
+|    MODEL     | gemma-2-9b-it | gemma2 |     Q4_0     |     true      | 5.06 GiB |   9.24 B   | 4.71 bpw |
++--------------+---------------+--------+--------------+---------------+----------+------------+----------+
+
++--------------+-----------------+---------------+---------------+--------------------+--------+------------------+------------+----------------+
+|      \       | Max Context Len | Embedding Len | Embedding GQA | Attention Head Cnt | Layers | Feed Forward Len | Expert Cnt | Vocabulary Len |
++--------------+-----------------+---------------+---------------+--------------------+--------+------------------+------------+----------------+
+| ARCHITECTURE |      8192       |     3584      |       2       |         16         |   42   |      14336       |     0      |     256000     |
++--------------+-----------------+---------------+---------------+--------------------+--------+------------------+------------+----------------+
+
++--------------+-------+-------------+------------+------------------+-----------+-----------+---------------+-----------------+---------------+
+|      \       | Model | Tokens Size | Tokens Len | Added Tokens Len | BOS Token | EOS Token | Unknown Token | Separator Token | Padding Token |
++--------------+-------+-------------+------------+------------------+-----------+-----------+---------------+-----------------+---------------+
+|  TOKENIZER   | llama |     N/A     |   256000   |       N/A        |     2     |     1     |       3       |       N/A       |       0       |
++--------------+-------+-------------+------------+------------------+-----------+-----------+---------------+-----------------+---------------+
+
++--------------+--------+--------------+-----------------+--------------+----------------+----------------+----------------------------------+------------+-------------+
+|      \       |  Arch  | Context Size | Flash Attention | MMap Support | Offload Layers | Full Offloaded |         UMA (RAM + VRAM)         | NonUMA RAM | NonUMA VRAM |
++--------------+--------+--------------+-----------------+--------------+----------------+----------------+----------------------------------+------------+-------------+
+|   ESTIMATE   | gemma2 |     2048     |      false      |     true     |  43 (42 + 1)   |      Yes       | 53.97 MiB + 672 MiB = 725.97 MiB | 203.97 MiB |  6.46 GiB   |
++--------------+--------+--------------+-----------------+--------------+----------------+----------------+----------------------------------+------------+-------------+
+
+```
+
 #### Parse Clip model
 
 ```shell