Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b2249 #89

Merged
merged 15 commits into from
Feb 23, 2024
Merged

b2249 #89

merged 15 commits into from
Feb 23, 2024

Conversation

Nexesenex
Copy link
Owner

No description provided.

datquocnguyen and others added 15 commits February 22, 2024 10:15
Update for MPT with optional bias parameters: to work with PhoGPT and SEA-LION models that were pre-trained with 'bias'.
* server: fallback to chatml

* add new chat template

* server: add AlphaMonarch to test chat template

* server: only check model template if there is no custom tmpl

* remove TODO
GitHub does not expose environment and repository variables to PRs coming from forks implies that we've been disabling the Nix CI actions for most PRs. 

The `if:` also didn't make much sense, because we can always pull from cachix, and there's no point (albeit no risk either) in pushing cache for the untrusted code.
* add gemma chat template

* gemma: only apply system_prompt on non-model message
Exposes a few attributes demonstrating how to build [singularity](https://docs.sylabs.io/guides/latest/user-guide/)/[apptainer](https://apptainer.org/) and Docker images re-using llama.cpp's Nix expression.

Built locally on `x86_64-linux` with `nix build github:someoneserge/llama.cpp/feat/nix/images#llamaPackages.{docker,docker-min,sif,llama-cpp}` and it's fast and effective.
* ggml : 32-bit arm compat

* ggml : add ggml_vqtbl1q_s8 impl

* ggml : cont
* ggml : always define ggml_fp16_t as uint16_t

ggml-ci

* ggml : cont

ggml-ci

* ggml : cont

* ggml : cont

ggml-ci

* ggml : cont

ggml-ci

* cuda : no longer ggml headers last

ggml-ci

* ggml : fix q6_K FP16 -> FP32 conversion

ggml-ci

* ggml : more FP16 -> FP32 conversion fixes

ggml-ci
* py : add gemma conversion from HF models

* Update convert-hf-to-gguf.py

Co-authored-by: Aarni Koskela <[email protected]>

* Update convert-hf-to-gguf.py

Co-authored-by: Aarni Koskela <[email protected]>

* Update convert-hf-to-gguf.py

Co-authored-by: Jared Van Bortel <[email protected]>

---------

Co-authored-by: Aarni Koskela <[email protected]>
Co-authored-by: Jared Van Bortel <[email protected]>
* gemma : use Q8_0 for the token_embd.weight tensor

* llama : quantize token_embd.weight using output type
@Nexesenex Nexesenex merged commit 00dee4f into Nexesenex:_master_up Feb 23, 2024
38 of 54 checks passed
Nexesenex pushed a commit that referenced this pull request Dec 22, 2024
* iq4_kss: WIP

* iq4_kss: CUDA dequantize works

So we can run perplexity. Sadly, the result does not look good
on the bpw vs quantization error plot.

* iq4_kss: slightly better quantization

* iq4_kss: another small quantization improvement

* iq4_kss: CUDA works

TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B.
In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks.
I.e., the reduced model size more than offsets the additional
bit fiddling required for iq4_kss.

* iq4_kss: new bit arrangement - CUDA and Zen4 work

Did not lose performance on CUDA. Zen4 is decent, but not great:
PP-512(LLaMA-3.1-8B) = 163 t/s.
TG-128 is of course better than other 4-bit quants due to smaller model size.
We get 14.5 t/s @ 8 threads.

* iq4_kss: ARM_NEON. Predictably very slow

* iq4_kss: Metal

PP is not too bad - just 10% slower than q4_0.
But TG is 30% slower, i.e., predictably bad.

* iq4_kss: somewhat faster Metal dot product

45.75 t/s -> 48.75 t/s.
Still 22% slower than q4_0

* iq4_kss: AVX2

Bad, but better than I expected.
PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X.
I.e., with 32 AVX2 threads we get the performance of
16 Zen4 threads.

* iq4_kss: very slightly faster Metal dot product

48.7 t/s -> 49.3 t/s

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants