b2249 #89

Nexesenex · 2024-02-23T02:13:05Z

No description provided.

Update for MPT with optional bias parameters: to work with PhoGPT and SEA-LION models that were pre-trained with 'bias'.

* server: fallback to chatml * add new chat template * server: add AlphaMonarch to test chat template * server: only check model template if there is no custom tmpl * remove TODO

GitHub does not expose environment and repository variables to PRs coming from forks implies that we've been disabling the Nix CI actions for most PRs. The `if:` also didn't make much sense, because we can always pull from cachix, and there's no point (albeit no risk either) in pushing cache for the untrusted code.

* add gemma chat template * gemma: only apply system_prompt on non-model message

Exposes a few attributes demonstrating how to build [singularity](https://docs.sylabs.io/guides/latest/user-guide/)/[apptainer](https://apptainer.org/) and Docker images re-using llama.cpp's Nix expression. Built locally on `x86_64-linux` with `nix build github:someoneserge/llama.cpp/feat/nix/images#llamaPackages.{docker,docker-min,sif,llama-cpp}` and it's fast and effective.

* ggml : 32-bit arm compat * ggml : add ggml_vqtbl1q_s8 impl * ggml : cont

* ggml : always define ggml_fp16_t as uint16_t ggml-ci * ggml : cont ggml-ci * ggml : cont * ggml : cont ggml-ci * ggml : cont ggml-ci * cuda : no longer ggml headers last ggml-ci * ggml : fix q6_K FP16 -> FP32 conversion ggml-ci * ggml : more FP16 -> FP32 conversion fixes ggml-ci

* py : add gemma conversion from HF models * Update convert-hf-to-gguf.py Co-authored-by: Aarni Koskela <[email protected]> * Update convert-hf-to-gguf.py Co-authored-by: Aarni Koskela <[email protected]> * Update convert-hf-to-gguf.py Co-authored-by: Jared Van Bortel <[email protected]> --------- Co-authored-by: Aarni Koskela <[email protected]> Co-authored-by: Jared Van Bortel <[email protected]>

* gemma : use Q8_0 for the token_embd.weight tensor * llama : quantize token_embd.weight using output type

* iq4_kss: WIP * iq4_kss: CUDA dequantize works So we can run perplexity. Sadly, the result does not look good on the bpw vs quantization error plot. * iq4_kss: slightly better quantization * iq4_kss: another small quantization improvement * iq4_kss: CUDA works TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B. In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks. I.e., the reduced model size more than offsets the additional bit fiddling required for iq4_kss. * iq4_kss: new bit arrangement - CUDA and Zen4 work Did not lose performance on CUDA. Zen4 is decent, but not great: PP-512(LLaMA-3.1-8B) = 163 t/s. TG-128 is of course better than other 4-bit quants due to smaller model size. We get 14.5 t/s @ 8 threads. * iq4_kss: ARM_NEON. Predictably very slow * iq4_kss: Metal PP is not too bad - just 10% slower than q4_0. But TG is 30% slower, i.e., predictably bad. * iq4_kss: somewhat faster Metal dot product 45.75 t/s -> 48.75 t/s. Still 22% slower than q4_0 * iq4_kss: AVX2 Bad, but better than I expected. PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X. I.e., with 32 AVX2 threads we get the performance of 16 Zen4 threads. * iq4_kss: very slightly faster Metal dot product 48.7 t/s -> 49.3 t/s --------- Co-authored-by: Iwan Kawrakow <[email protected]>

datquocnguyen and others added 15 commits February 22, 2024 10:15

mpt : add optional bias tensors (#5638)

4ef245a

Update for MPT with optional bias parameters: to work with PhoGPT and SEA-LION models that were pre-trained with 'bias'.

server : clarify some params in the docs (#5640)

c5688c6

server : fallback to chatml, add AlphaMonarch chat template (#5628)

a46f507

* server: fallback to chatml * add new chat template * server: add AlphaMonarch to test chat template * server: only check model template if there is no custom tmpl * remove TODO

readme : update hot topics

56d03d9

minor : fix trailing whitespace (#5638)

3a03541

Add Gemma chat template (#5665)

373ee3f

* add gemma chat template * gemma: only apply system_prompt on non-model message

py : minor fixes (#5668)

5a9e2f6

ggml : 32-bit arm compat (whisper/1891)

efd56b1

* ggml : 32-bit arm compat * ggml : add ggml_vqtbl1q_s8 impl * ggml : cont

sync : ggml

334f76f

gemma : use more bits for the token_embd.weight tensor (#5650)

96633ee

* gemma : use Q8_0 for the token_embd.weight tensor * llama : quantize token_embd.weight using output type

mpt : do not duplicate token_embd.weight on disk (#5670)

15499eb

Nexesenex merged commit 00dee4f into Nexesenex:_master_up Feb 23, 2024
38 of 54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b2249 #89

b2249 #89

Nexesenex commented Feb 23, 2024

b2249 #89

b2249 #89

Conversation

Nexesenex commented Feb 23, 2024