add chatglm3-6b and glm-4-9b-chat model support #6999

mnlife · 2024-04-30T06:16:18Z

This pull request adds support for chatglm3-6b-chat and glm-4-9b-chat models. Fixes [#7778]

somethings I'm not sure about:

The prompt must include both a prefix and a suffix for the inference results to be correct. such as below:

./build/bin/llama-cli -m ~/models/glm-4-9b-chat-Q4_K_M.gguf --verbose-prompt -p "[gMASK]<sop><|user|>hi<|assistant|>"

When I add my chat template to examples/server/public/prompt-formats.js and run llama-server, start the browser and input http://localhost:8080/ and change prompt style. The assistant always starts a new line before speaking.
The inference results are incorrect with the CUDA version.

below is some link about chatglm model:

The Hugging Face model path for chatglm3-6b: https://huggingface.co/THUDM/chatglm3-6b
glm-4-9b-chat: https://huggingface.co/THUDM/glm-4-9b-chat
convert it to gguf model and quantize it, reference from llama.cpp README.md :

./convert-hf-to-gguf.py "--outfile" ~/models/xxx-f16.gguf "--outtype" "f16" ~/os/llm/xxx
./build/bin/quantize ~/models/xxx-f16.gguf ~/models/xxx-Q4_K_M.gguf 15 8

gguf model: https://modelscope.cn/api/v1/models/mnlife/chatglm3-6b-gguf/repo?Revision=master&FilePath=chatglm3-6b-Q4_K_M.gguf

github-actions · 2024-04-30T06:44:48Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 537 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8704.42ms p(95)=20543.59ms fails=, finish reason: stop=477 truncated=60
Prompt processing (pp): avg=97.85tk/s p(95)=416.02tk/s
Token generation (tg): avg=32.52tk/s p(95)=47.83tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=chatglm3 commit=f3bc337f432a5f8d7391bd7af7bacfa55778d210

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716966257 --> 1716966887
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 909.91, 909.91, 909.91, 909.91, 909.91, 940.56, 940.56, 940.56, 940.56, 940.56, 945.17, 945.17, 945.17, 945.17, 945.17, 939.65, 939.65, 939.65, 939.65, 939.65, 914.89, 914.89, 914.89, 914.89, 914.89, 910.92, 910.92, 910.92, 910.92, 910.92, 919.35, 919.35, 919.35, 919.35, 919.35, 927.86, 927.86, 927.86, 927.86, 927.86, 921.6, 921.6, 921.6, 921.6, 921.6, 933.02, 933.02, 933.02, 933.02, 933.02, 926.24, 926.24, 926.24, 926.24, 926.24, 938.48, 938.48, 938.48, 938.48, 938.48, 952.31, 952.31, 952.31, 952.31, 952.31, 956.25, 956.25, 956.25, 956.25, 956.25, 976.45, 976.45, 976.45, 976.45, 976.45, 978.04, 978.04, 978.04, 978.04, 978.04, 978.62, 978.62, 978.62, 978.62, 978.62, 974.09, 974.09, 974.09, 974.09, 974.09, 988.49, 988.49, 988.49, 988.49, 988.49, 985.13, 985.13, 985.13, 985.13, 985.13, 988.7, 988.7, 988.7, 988.7, 988.7, 979.7, 979.7, 979.7, 979.7, 979.7, 980.1, 980.1, 980.1, 980.1, 980.1, 990.16, 990.16, 990.16, 990.16, 990.16, 948.19, 948.19, 948.19, 948.19, 948.19, 963.37, 963.37, 963.37, 963.37, 963.37, 958.14, 958.14, 958.14, 958.14, 958.14, 957.13, 957.13, 957.13, 957.13, 957.13, 955.28, 955.28, 955.28, 955.28, 955.28, 954.39, 954.39, 954.39, 954.39, 954.39, 955.65, 955.65, 955.65, 955.65, 955.65, 951.82, 951.82, 951.82, 951.82, 951.82, 952.9, 952.9, 952.9, 952.9, 952.9, 961.35, 961.35, 961.35, 961.35, 961.35, 960.08, 960.08, 960.08, 960.08, 960.08, 959.68, 959.68, 959.68, 959.68, 959.68, 877.63, 877.63, 877.63, 877.63, 877.63, 874.8, 874.8, 874.8, 874.8, 874.8, 877.0, 877.0, 877.0, 877.0, 877.0, 878.83, 878.83, 878.83, 878.83, 878.83, 878.56, 878.56, 878.56, 878.56, 878.56, 878.47, 878.47, 878.47, 878.47, 878.47, 828.69, 828.69, 828.69, 828.69, 828.69, 829.68, 829.68, 829.68, 829.68, 829.68, 828.01, 828.01, 828.01, 828.01, 828.01, 826.18, 826.18, 826.18, 826.18, 826.18, 826.97, 826.97, 826.97, 826.97, 826.97, 832.66, 832.66, 832.66, 832.66, 832.66, 831.77, 831.77, 831.77, 831.77, 831.77, 837.36, 837.36, 837.36, 837.36, 837.36, 835.06, 835.06, 835.06, 835.06, 835.06, 837.99, 837.99, 837.99, 837.99, 837.99, 820.89, 820.89, 820.89, 820.89, 820.89, 820.26, 820.26, 820.26, 820.26, 820.26, 821.84, 821.84, 821.84, 821.84, 821.84, 822.54, 822.54, 822.54, 822.54, 822.54, 822.64, 822.64, 822.64, 822.64, 822.64, 823.04, 823.04, 823.04, 823.04, 823.04, 823.02, 823.02, 823.02, 823.02, 823.02, 824.99, 824.99, 824.99, 824.99, 824.99, 825.22]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716966257 --> 1716966887
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 42.95, 42.95, 42.95, 42.95, 42.95, 31.35, 31.35, 31.35, 31.35, 31.35, 28.51, 28.51, 28.51, 28.51, 28.51, 30.79, 30.79, 30.79, 30.79, 30.79, 32.02, 32.02, 32.02, 32.02, 32.02, 33.19, 33.19, 33.19, 33.19, 33.19, 33.97, 33.97, 33.97, 33.97, 33.97, 34.56, 34.56, 34.56, 34.56, 34.56, 34.93, 34.93, 34.93, 34.93, 34.93, 34.46, 34.46, 34.46, 34.46, 34.46, 34.28, 34.28, 34.28, 34.28, 34.28, 33.8, 33.8, 33.8, 33.8, 33.8, 32.91, 32.91, 32.91, 32.91, 32.91, 31.84, 31.84, 31.84, 31.84, 31.84, 31.73, 31.73, 31.73, 31.73, 31.73, 31.01, 31.01, 31.01, 31.01, 31.01, 30.36, 30.36, 30.36, 30.36, 30.36, 30.55, 30.55, 30.55, 30.55, 30.55, 30.83, 30.83, 30.83, 30.83, 30.83, 30.57, 30.57, 30.57, 30.57, 30.57, 30.51, 30.51, 30.51, 30.51, 30.51, 30.63, 30.63, 30.63, 30.63, 30.63, 30.77, 30.77, 30.77, 30.77, 30.77, 30.82, 30.82, 30.82, 30.82, 30.82, 30.95, 30.95, 30.95, 30.95, 30.95, 30.89, 30.89, 30.89, 30.89, 30.89, 30.78, 30.78, 30.78, 30.78, 30.78, 30.88, 30.88, 30.88, 30.88, 30.88, 31.04, 31.04, 31.04, 31.04, 31.04, 31.25, 31.25, 31.25, 31.25, 31.25, 31.38, 31.38, 31.38, 31.38, 31.38, 31.44, 31.44, 31.44, 31.44, 31.44, 31.56, 31.56, 31.56, 31.56, 31.56, 31.66, 31.66, 31.66, 31.66, 31.66, 31.51, 31.51, 31.51, 31.51, 31.51, 30.87, 30.87, 30.87, 30.87, 30.87, 30.8, 30.8, 30.8, 30.8, 30.8, 30.83, 30.83, 30.83, 30.83, 30.83, 31.08, 31.08, 31.08, 31.08, 31.08, 31.13, 31.13, 31.13, 31.13, 31.13, 31.22, 31.22, 31.22, 31.22, 31.22, 31.32, 31.32, 31.32, 31.32, 31.32, 31.11, 31.11, 31.11, 31.11, 31.11, 31.11, 31.11, 31.11, 31.11, 31.11, 30.53, 30.53, 30.53, 30.53, 30.53, 29.29, 29.29, 29.29, 29.29, 29.29, 29.23, 29.23, 29.23, 29.23, 29.23, 29.2, 29.2, 29.2, 29.2, 29.2, 29.17, 29.17, 29.17, 29.17, 29.17, 29.18, 29.18, 29.18, 29.18, 29.18, 29.19, 29.19, 29.19, 29.19, 29.19, 29.25, 29.25, 29.25, 29.25, 29.25, 29.3, 29.3, 29.3, 29.3, 29.3, 29.06, 29.06, 29.06, 29.06, 29.06, 29.05, 29.05, 29.05, 29.05, 29.05, 29.07, 29.07, 29.07, 29.07, 29.07, 29.16, 29.16, 29.16, 29.16, 29.16, 29.29, 29.29, 29.29, 29.29, 29.29, 29.42, 29.42, 29.42, 29.42, 29.42, 29.43, 29.43, 29.43, 29.43, 29.43, 29.54]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716966257 --> 1716966887
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.38, 0.38, 0.38, 0.38, 0.38, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.24, 0.24, 0.24, 0.24, 0.24, 0.33, 0.33, 0.33, 0.33, 0.33, 0.18, 0.18, 0.18, 0.18, 0.18, 0.34, 0.34, 0.34, 0.34, 0.34, 0.25, 0.25, 0.25, 0.25, 0.25, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.11, 0.11, 0.11, 0.11, 0.11, 0.29, 0.29, 0.29, 0.29, 0.29, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.26, 0.26, 0.26, 0.26, 0.26, 0.27, 0.27, 0.27, 0.27, 0.27, 0.3, 0.3, 0.3, 0.3, 0.3, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.33, 0.33, 0.33, 0.33, 0.33, 0.57, 0.57, 0.57, 0.57, 0.57, 0.64, 0.64, 0.64, 0.64, 0.64, 0.63, 0.63, 0.63, 0.63, 0.63, 0.43, 0.43, 0.43, 0.43, 0.43, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.08, 0.08, 0.08, 0.08, 0.08, 0.23, 0.23, 0.23, 0.23, 0.23, 0.29, 0.29, 0.29, 0.29, 0.29, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716966257 --> 1716966887
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0]

convert-hf-to-gguf.py

arch-btw · 2024-06-13T10:54:05Z

Is there any way to support glm-4 ? #7778

mnlife · 2024-06-17T02:11:12Z

Is there any way to support glm-4 ? #7778

under development
https://github.com/mnlife/llama.cpp/tree/glm4

…m/THUDM/chatglm3-6b Signed-off-by: XingXing Qiao <[email protected]>

Signed-off-by: XingXing Qiao <[email protected]>

legraphista · 2024-06-20T09:16:55Z

Not sure if this is a model or an implementation issue, but computing the imatrix of https://huggingface.co/THUDM/glm-4-9b-chat (fp16, q8_0, ...) always results in nans (dataset)

$ ../llama.cpp/llama-imatrix -m glm-4-9b-chat.Q8_0.gguf.link.gguf -f imatrix.dataset -c 512 -b 512 --threads 32 -ngl 999 -o imatrix.dat

llama_model_loader: loaded meta data with 24 key-value pairs and 283 tensors from glm-4-9b-chat-IMat-GGUF/glm-4-9b-chat.Q8_0.gguf.hardlink.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = chatglm
llama_model_loader: - kv   1:                               general.name str              = glm-4-9b-chat
llama_model_loader: - kv   2:                     chatglm.context_length u32              = 131072
llama_model_loader: - kv   3:                   chatglm.embedding_length u32              = 4096
llama_model_loader: - kv   4:                chatglm.feed_forward_length u32              = 13696
llama_model_loader: - kv   5:                        chatglm.block_count u32              = 40
llama_model_loader: - kv   6:               chatglm.attention.head_count u32              = 32
llama_model_loader: - kv   7:            chatglm.attention.head_count_kv u32              = 2
llama_model_loader: - kv   8:   chatglm.attention.layer_norm_rms_epsilon f32              = 0.000000
llama_model_loader: - kv   9:                          general.file_type u32              = 7
llama_model_loader: - kv  10:               chatglm.rope.dimension_count u32              = 64
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = chatglm-bpe
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,151073]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151329
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  20:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = ChatGLM4
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q8_0:  162 tensors
llm_load_vocab: special tokens cache size = 223
llm_load_vocab: token to piece cache size = 0.9732 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = chatglm
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151552
llm_load_print_meta: n_merges         = 151073
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.6e-07
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 9.40 B
llm_load_print_meta: model size       = 9.30 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = glm-4-9b-chat
llm_load_print_meta: BOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '<|user|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.31 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =   629.00 MiB
llm_load_tensors:      CUDA0 buffer size =  8897.23 MiB
.................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    20.00 MiB
llama_new_context_with_model: KV self size  =   20.00 MiB, K (f16):   10.00 MiB, V (f16):   10.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   304.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1606
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 25 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 122.54 ms
compute_imatrix: computing over 125 chunks with batch_size 512
compute_imatrix: 0.65 seconds per pass - ETA 1.35 minutes
[1]7.9954,[2]6.0663,[3]5.9242,[4]7.2853,[5]7.2095,[6]6.0079,[7]6.4837,[8]6.8565,[9]7.0213,
save_imatrix: stored collected data after 10 chunks in glm-4-9b-chat-IMat-GGUF/imatrix.dat
[10]6.1378,[11]6.7411,[12]7.3944,[13]7.8688,[14]8.1976,[15]8.6583,[16]9.1342,[17]9.4154,[18]9.1001,[19]8.6114,
save_imatrix: stored collected data after 20 chunks in glm-4-9b-chat-IMat-GGUF/imatrix.dat
[20]8.5978,nan detected in blk.18.attn_output.weight

Edit: Looks like running it on CPU instead of CUDA gets it past chunk 21

choyakawa · 2024-06-20T11:55:29Z

will the vision model of glm-4 also be considered?

mnlife · 2024-06-21T01:54:57Z

under development

mnlife · 2024-06-21T01:55:22Z

will the vision model of glm-4 also be considered?

under development

CsBoBoNice · 2024-06-22T09:35:48Z

你好，我使用您的分支编译，使用NVIDIA显卡进行推理，使用模型为glm-4-9b-chat.Q5_K_S.gguf，

能够回答类似：你好；你是谁；写一首诗；这些简短的问题。

但是当提问变长时会出现回复乱码，例如：将以下中文翻译为英文: 生活和天气一样，有晴，有阴，偶尔还会下点雨，自然规律，生活不简单尽量简单过。

以下是执行的日志：

.\build\bin\Release\llama-cli.exe -m D:\models\glm-4-9b-chat.Q5_K_S.gguf -p "[gMASK]<|user|>hi<|assistant|>" -t 16 --keep -1 -c 1024 -b 1024 -n -1 -s 123 -ngl 18 --color -i
Log start
main: build = 3187 (de3c909)
main: built with MSVC 19.39.33523.0 for x64
main: seed = 123
llama_model_loader: loaded meta data with 24 key-value pairs and 283 tensors from D:\models\glm-4-9b-chat.Q5_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = chatglm
llama_model_loader: - kv 1: general.name str = glm-4-9b-chat
llama_model_loader: - kv 2: chatglm.context_length u32 = 131072
llama_model_loader: - kv 3: chatglm.embedding_length u32 = 4096
llama_model_loader: - kv 4: chatglm.feed_forward_length u32 = 13696
llama_model_loader: - kv 5: chatglm.block_count u32 = 40
llama_model_loader: - kv 6: chatglm.attention.head_count u32 = 32
llama_model_loader: - kv 7: chatglm.attention.head_count_kv u32 = 2
llama_model_loader: - kv 8: chatglm.attention.layer_norm_rms_epsilon f32 = 0.000000
llama_model_loader: - kv 9: general.file_type u32 = 16
llama_model_loader: - kv 10: chatglm.rope.dimension_count u32 = 64
llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 13: tokenizer.ggml.pre str = chatglm-bpe
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,151552] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,151073] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151329
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151329
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 151329
llama_model_loader: - kv 20: tokenizer.ggml.eot_token_id u32 = 151336
llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 151329
llama_model_loader: - kv 22: tokenizer.chat_template str = ChatGLM4
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q5_1: 40 tensors
llama_model_loader: - type q5_K: 121 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens cache size = 223
llm_load_vocab: token to piece cache size = 0.9732 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = chatglm
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151552
llm_load_print_meta: n_merges = 151073
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 16
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.6e-07
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 13696
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q5_K - Small
llm_load_print_meta: model params = 9.40 B
llm_load_print_meta: model size = 6.23 GiB (5.69 BPW)
llm_load_print_meta: general.name = glm-4-9b-chat
llm_load_print_meta: BOS token = 151329 '<|endoftext|>'
llm_load_print_meta: EOS token = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token = 151329 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 151336 '<|user|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1650, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size = 0.31 MiB
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloaded 18/41 layers to GPU
llm_load_tensors: CPU buffer size = 6377.09 MiB
llm_load_tensors: CUDA0 buffer size = 2468.01 MiB
...................................................................................
llama_new_context_with_model: n_ctx = 1024
llama_new_context_with_model: n_batch = 1024
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 22.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 18.00 MiB
llama_new_context_with_model: KV self size = 40.00 MiB, K (f16): 20.00 MiB, V (f16): 20.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 789.62 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 10.01 MiB
llama_new_context_with_model: graph nodes = 1606
llama_new_context_with_model: graph splits = 202

== Running in interactive mode. ==

Press Ctrl+C to interject at any time.
Press Return to return control to the AI.
To return control without starting a new line, end your input with '/'.
If you want to submit another line, end your input with ''.

hi
你好！我是人工智能助手，很高兴能帮助你。请问有什么可以帮到你的吗？
你是谁
我是一个名为 ChatGLM 的人工智能助手，我是基于清华大学 KEG 实验室和智谱 AI 公司于 2024 年共同训练的语言模型 GLM-4 开发的人工智能助手。
将以下中文翻译为英文: 生活和天气一样，有晴，有阴，偶尔还会下点雨，自然规律，生活不简单尽量简单过。
514%89,5!/"18A(;8!9)$0H-)::G.+74=()47:-46%=2:>2$*+",&1!@e:6==:0,E.45/EF+G4!%72-++"

youth123 · 2024-06-25T06:52:23Z

你好，我使用您的分支编译，使用NVIDIA显卡进行推理，使用模型为glm-4-9b-chat.Q5_K_S.gguf，
能够回答类似：你好；你是谁；写一首诗；这些简短的问题。
但是当提问变长时会出现回复乱码，例如：将以下中文翻译为英文: 生活和天气一样，有晴，有阴，偶尔还会下点雨，自然规律，生活不简单尽量简单过。

I have already solved the incorrect answers issue based on this PR. Here is the PR.
#8031

mnlife · 2024-06-27T05:49:50Z

你好，我使用您的分支编译，使用NVIDIA显卡进行推理，使用模型为glm-4-9b-chat.Q5_K_S.gguf，

能够回答类似：你好；你是谁；写一首诗；这些简短的问题。

但是当提问变长时会出现回复乱码，例如：将以下中文翻译为英文: 生活和天气一样，有晴，有阴，偶尔还会下点雨，自然规律，生活不简单尽量简单过。

以下是执行的日志：

.\build\bin\Release\llama-cli.exe -m D:\models\glm-4-9b-chat.Q5_K_S.gguf -p "[gMASK]<|user|>hi<|assistant|>" -t 16 --keep -1 -c 1024 -b 1024 -n -1 -s 123 -ngl 18 --color -i Log start main: build = 3187 (de3c909) main: built with MSVC 19.39.33523.0 for x64 main: seed = 123 llama_model_loader: loaded meta data with 24 key-value pairs and 283 tensors from D:\models\glm-4-9b-chat.Q5_K_S.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = chatglm llama_model_loader: - kv 1: general.name str = glm-4-9b-chat llama_model_loader: - kv 2: chatglm.context_length u32 = 131072 llama_model_loader: - kv 3: chatglm.embedding_length u32 = 4096 llama_model_loader: - kv 4: chatglm.feed_forward_length u32 = 13696 llama_model_loader: - kv 5: chatglm.block_count u32 = 40 llama_model_loader: - kv 6: chatglm.attention.head_count u32 = 32 llama_model_loader: - kv 7: chatglm.attention.head_count_kv u32 = 2 llama_model_loader: - kv 8: chatglm.attention.layer_norm_rms_epsilon f32 = 0.000000 llama_model_loader: - kv 9: general.file_type u32 = 16 llama_model_loader: - kv 10: chatglm.rope.dimension_count u32 = 64 llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 13: tokenizer.ggml.pre str = chatglm-bpe llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,151552] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,151073] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151329 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151329 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 151329 llama_model_loader: - kv 20: tokenizer.ggml.eot_token_id u32 = 151336 llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 151329 llama_model_loader: - kv 22: tokenizer.chat_template str = ChatGLM4 llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 121 tensors llama_model_loader: - type q5_1: 40 tensors llama_model_loader: - type q5_K: 121 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 223 llm_load_vocab: token to piece cache size = 0.9732 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = chatglm llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151552 llm_load_print_meta: n_merges = 151073 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 2 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 16 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.6e-07 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 13696 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q5_K - Small llm_load_print_meta: model params = 9.40 B llm_load_print_meta: model size = 6.23 GiB (5.69 BPW) llm_load_print_meta: general.name = glm-4-9b-chat llm_load_print_meta: BOS token = 151329 '<|endoftext|>' llm_load_print_meta: EOS token = 151329 '<|endoftext|>' llm_load_print_meta: UNK token = 151329 '<|endoftext|>' llm_load_print_meta: PAD token = 151329 '<|endoftext|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 151336 '<|user|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1650, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.31 MiB llm_load_tensors: offloading 18 repeating layers to GPU llm_load_tensors: offloaded 18/41 layers to GPU llm_load_tensors: CPU buffer size = 6377.09 MiB llm_load_tensors: CUDA0 buffer size = 2468.01 MiB ................................................................................... llama_new_context_with_model: n_ctx = 1024 llama_new_context_with_model: n_batch = 1024 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 22.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 18.00 MiB llama_new_context_with_model: KV self size = 40.00 MiB, K (f16): 20.00 MiB, V (f16): 20.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB llama_new_context_with_model: CUDA0 compute buffer size = 789.62 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 10.01 MiB llama_new_context_with_model: graph nodes = 1606 llama_new_context_with_model: graph splits = 202

system_info: n_threads = 16 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | main: interactive mode on. sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 1024, n_batch = 1024, n_predict = -1, n_keep = 5

== Running in interactive mode. ==

Press Ctrl+C to interject at any time.

Press Return to return control to the AI.

To return control without starting a new line, end your input with '/'.

If you want to submit another line, end your input with ''.

hi 你好！我是人工智能助手，很高兴能帮助你。请问有什么可以帮到你的吗？你是谁我是一个名为 ChatGLM 的人工智能助手，我是基于清华大学 KEG 实验室和智谱 AI 公司于 2024 年共同训练的语言模型 GLM-4 开发的人工智能助手。将以下中文翻译为英文: 生活和天气一样，有晴，有阴，偶尔还会下点雨，自然规律，生活不简单尽量简单过。 514%89,5!/"18A(;8!9)$0H-)::G.+74=()47:-46%=2:>2$*+",&1!@e:6==:0,E.45/EF+G4!%72-++"

有人来做这些工作了，这个pr #8031

0wwafa · 2024-07-02T07:47:48Z

why is still pending?

mnlife force-pushed the chatglm3 branch from bb6bc4a to 9a8db6b Compare April 30, 2024 07:10

mofosyne added help wanted Extra attention is needed enhancement New feature or request Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 9, 2024

compilade reviewed May 9, 2024

View reviewed changes

convert-hf-to-gguf.py Outdated Show resolved Hide resolved

mofosyne self-assigned this May 10, 2024

mofosyne force-pushed the chatglm3 branch from c3804d0 to 95dabb1 Compare May 10, 2024 12:31

mofosyne removed their assignment May 10, 2024

mnlife force-pushed the chatglm3 branch from 95dabb1 to 398fecb Compare May 15, 2024 02:45

mofosyne marked this pull request as ready for review May 15, 2024 03:12

compilade reviewed May 15, 2024

View reviewed changes

convert-hf-to-gguf.py Outdated Show resolved Hide resolved

convert-hf-to-gguf.py Outdated Show resolved Hide resolved

convert-hf-to-gguf.py Outdated Show resolved Hide resolved

mnlife force-pushed the chatglm3 branch 2 times, most recently from 8aee20e to cb324f4 Compare May 15, 2024 05:28

compilade reviewed May 16, 2024

View reviewed changes

convert-hf-to-gguf.py Outdated Show resolved Hide resolved

convert-hf-to-gguf.py Outdated Show resolved Hide resolved

convert-hf-to-gguf.py Outdated Show resolved Hide resolved

convert-hf-to-gguf.py Show resolved Hide resolved

convert-hf-to-gguf.py Outdated Show resolved Hide resolved

mnlife force-pushed the chatglm3 branch 2 times, most recently from ed1d3ff to 9226518 Compare May 23, 2024 08:14

github-actions bot added testing Everything test related python python script changes labels May 23, 2024

mnlife force-pushed the chatglm3 branch 4 times, most recently from d523390 to f3bc337 Compare May 29, 2024 06:26

arch-btw mentioned this pull request Jun 5, 2024

Feature Request: GLM-4 9B Support #7778

Closed

4 tasks

mnlife force-pushed the chatglm3 branch 2 times, most recently from bccb68f to a096383 Compare June 11, 2024 07:47

xingxingqiao added 2 commits June 19, 2024 15:09

add chatglm3-6b model support huggingface model: https://hf-mirror.co…

8edec93

…m/THUDM/chatglm3-6b Signed-off-by: XingXing Qiao <[email protected]>

remove .rotary_pos_emb.inv_freq and unuse code for chatglm3 model

ad896fa

Signed-off-by: XingXing Qiao <[email protected]>

xingxingqiao added 2 commits June 19, 2024 15:09

fix lint error

8240833

Signed-off-by: XingXing Qiao <[email protected]>

optimize convert-hf-to-gguf.py for chatglm model

35c6887

Signed-off-by: XingXing Qiao <[email protected]>

mnlife changed the title ~~add chatglm3-6b model support [help wanted]~~ add chatglm3-6b and glm-4-9b-chat model support Jun 19, 2024

mnlife force-pushed the chatglm3 branch 2 times, most recently from a03cbca to bf430d6 Compare June 19, 2024 10:51

github-actions bot added examples server labels Jun 19, 2024

mnlife force-pushed the chatglm3 branch from bf430d6 to 3dbeba4 Compare June 20, 2024 03:42

support glm-4-9b-chat

de3c909

Signed-off-by: XingXing Qiao <[email protected]>

mnlife force-pushed the chatglm3 branch from 3dbeba4 to de3c909 Compare June 20, 2024 03:47

youth123 mentioned this pull request Jun 20, 2024

Support glm3 and glm4. #8031

Merged

4 tasks

mnlife closed this Jun 21, 2024

mnlife reopened this Jun 21, 2024

mnlife closed this Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add chatglm3-6b and glm-4-9b-chat model support #6999

add chatglm3-6b and glm-4-9b-chat model support #6999

mnlife commented Apr 30, 2024 •

edited

Loading

github-actions bot commented Apr 30, 2024 •

edited

Loading

arch-btw commented Jun 13, 2024 •

edited

Loading

mnlife commented Jun 17, 2024

legraphista commented Jun 20, 2024 •

edited

Loading

choyakawa commented Jun 20, 2024

mnlife commented Jun 21, 2024

mnlife commented Jun 21, 2024

CsBoBoNice commented Jun 22, 2024

youth123 commented Jun 25, 2024

mnlife commented Jun 27, 2024

0wwafa commented Jul 2, 2024

add chatglm3-6b and glm-4-9b-chat model support #6999

add chatglm3-6b and glm-4-9b-chat model support #6999

Conversation

mnlife commented Apr 30, 2024 • edited Loading

github-actions bot commented Apr 30, 2024 • edited Loading

arch-btw commented Jun 13, 2024 • edited Loading

mnlife commented Jun 17, 2024

legraphista commented Jun 20, 2024 • edited Loading

choyakawa commented Jun 20, 2024

mnlife commented Jun 21, 2024

mnlife commented Jun 21, 2024

CsBoBoNice commented Jun 22, 2024

youth123 commented Jun 25, 2024

mnlife commented Jun 27, 2024

0wwafa commented Jul 2, 2024

mnlife commented Apr 30, 2024 •

edited

Loading

github-actions bot commented Apr 30, 2024 •

edited

Loading

arch-btw commented Jun 13, 2024 •

edited

Loading

legraphista commented Jun 20, 2024 •

edited

Loading