b1601 #21

Nexesenex · 2023-12-01T20:13:49Z

No description provided.

happens with multi-threaded quantization of Qwen-72B ggml-ci

* enable qwen to llama.cpp * llama : do not GPU split bias tensors --------- Co-authored-by: Georgi Gerganov <[email protected]>

* Support attention_bias on LLaMA architecture QKVO bias, should fix InternLM (#3133) and works for LLaMAfied Qwen models (#3743 (comment)). * check existence of qkvo bias while loading llama models Tested on LLaMA2, CUDA and CPU. * Update llama.cpp

* Fix token_to_piece implementation in Swift * Fix errors

This allows for a better comparison between different models or different tensors of the same model where the magnitude of the model weights may differ. Co-authored-by: Iwan Kawrakow <[email protected]>

ggerganov and others added 7 commits December 1, 2023 18:42

llama : fix integer overflow during quantization (#4284)

880f579

happens with multi-threaded quantization of Qwen-72B ggml-ci

llama : add Qwen support (#4281)

37c746d

* enable qwen to llama.cpp * llama : do not GPU split bias tensors --------- Co-authored-by: Georgi Gerganov <[email protected]>

build : enable libstdc++ assertions for debug builds (#4275)

511f52c

swift : fix token_to_piece implementation (#4278)

b220222

* Fix token_to_piece implementation in Swift * Fix errors

llama : support optional tensors (#4283)

d5a1cbd

llama : avoid using "optional" keyword (#4283)

5a7d312

Nexesenex merged commit 6632a22 into Nexesenex:master_experimental Dec 1, 2023
20 of 22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b1601 #21

b1601 #21

Nexesenex commented Dec 1, 2023

b1601 #21

b1601 #21

Conversation

Nexesenex commented Dec 1, 2023