-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add chatglm3-6b and glm-4-9b-chat model support #6999
Conversation
8aee20e
to
cb324f4
Compare
ed1d3ff
to
9226518
Compare
d523390
to
f3bc337
Compare
bccb68f
to
a096383
Compare
Is there any way to support glm-4 ? #7778 |
under development |
…m/THUDM/chatglm3-6b Signed-off-by: XingXing Qiao <[email protected]>
Signed-off-by: XingXing Qiao <[email protected]>
Signed-off-by: XingXing Qiao <[email protected]>
Signed-off-by: XingXing Qiao <[email protected]>
a03cbca
to
bf430d6
Compare
Signed-off-by: XingXing Qiao <[email protected]>
Not sure if this is a model or an implementation issue, but computing the imatrix of
Edit: Looks like running it on CPU instead of CUDA gets it past chunk 21 |
will the vision model of glm-4 also be considered? |
|
under development |
你好,我使用您的分支编译,使用NVIDIA显卡进行推理,使用模型为glm-4-9b-chat.Q5_K_S.gguf, 能够回答类似:你好;你是谁;写一首诗;这些简短的问题。 但是当提问变长时会出现回复乱码,例如:将以下中文翻译为英文: 生活和天气一样,有晴,有阴,偶尔还会下点雨,自然规律,生活不简单尽量简单过。 以下是执行的日志: .\build\bin\Release\llama-cli.exe -m D:\models\glm-4-9b-chat.Q5_K_S.gguf -p "[gMASK]<|user|>hi<|assistant|>" -t 16 --keep -1 -c 1024 -b 1024 -n -1 -s 123 -ngl 18 --color -i system_info: n_threads = 16 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | == Running in interactive mode. ==
hi |
I have already solved the incorrect answers issue based on this PR. Here is the PR. |
有人来做这些工作了,这个pr #8031 |
why is still pending? |
This pull request adds support for chatglm3-6b-chat and glm-4-9b-chat models. Fixes [#7778]
somethings I'm not sure about:
When I add my chat template to examples/server/public/prompt-formats.js and run llama-server, start the browser and input http://localhost:8080/ and change prompt style. The assistant always starts a new line before speaking.
The inference results are incorrect with the CUDA version.
below is some link about chatglm model: