Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b2128 #83

Merged
merged 12 commits into from
Feb 11, 2024
Merged

b2128 #83

merged 12 commits into from
Feb 11, 2024

Conversation

Nexesenex
Copy link
Owner

No description provided.

ngxson and others added 12 commits February 11, 2024 12:16
* server: add mistral chat template

* server: fix typo

* server: rename template mistral to llama2

* server: format_llama2: remove BOS

* server: validate "--chat-template" argument

* server: clean up using_chatml variable

Co-authored-by: Jared Van Bortel <[email protected]>

---------

Co-authored-by: Jared Van Bortel <[email protected]>
* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: update unit tests for the new vec_dot interface

* llama.cpp: add MATMUL_INT8 capability to system_info
* server: allow to specify tokens as strings in logit_bias

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* common: use enums for sampler types

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <[email protected]>

* minor : spaces

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* vulkan: refactor guess_matmul_pipeline for vendor

Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor
conditionals.

Signed-off-by: Sergio Lopez <[email protected]>

* vulkan: only use M-sized matmul on Apple GPUs

L-sized and S-sized matmuls are broken on Apple GPUs, force using
M-size with this vendor.

Signed-off-by: Sergio Lopez <[email protected]>

---------

Signed-off-by: Sergio Lopez <[email protected]>
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
  → 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)
* BERT model graph construction (build_bert)
* WordPiece tokenizer (llm_tokenize_wpm)
* Add flag for non-causal attention models
* Allow for models that only output embeddings
* Support conversion of BERT models to GGUF
* Based on prior work by @xyzhang626 and @skeskinen

---------

Co-authored-by: Jared Van Bortel <[email protected]>
Co-authored-by: Jared Van Bortel <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
* CUDA: mul_mat_vec_q tiling, refactor mul mat logic

Co-authored-by: slaren <[email protected]>

---------

Co-authored-by: slaren <[email protected]>
@Nexesenex Nexesenex deleted the branch Nexesenex:_master_up February 11, 2024 23:28
@Nexesenex Nexesenex closed this Feb 11, 2024
@Nexesenex Nexesenex reopened this Feb 11, 2024
@Nexesenex Nexesenex merged commit c875614 into Nexesenex:_master_up Feb 11, 2024
14 of 23 checks passed
Nexesenex pushed a commit that referenced this pull request Dec 22, 2024
* iq4_k_xxs: basics

* WIP + adding iq3_kl quantization mix

* iq4_xxs: this looks very viable compared to iq4_xs

At the same 4.25 bpw PPL is always better, for some models
significantly better. I'll rename to iq4_ks and keep it.

* iq4_xxs: CUDA dot product

We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0.

* iq4_xxs: scalar CPU dot product

Also fix the breakage I caused with the dedicated work buffer
quantization portion when the multiplication is not done
via iqk_mul_mat.

* iq4_xxs: Zen4

I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2).
Again the same mistake of packing int32_t back to int16_t,
which overflows occasionally (just occasionally, that's why the
result doesn't look completely wrong, so I didn't notice).

* Fix iq4_xs (Zen4)

* iq4_xxs: AVX2

* iq4_xxs: ARM_NEON

* iq4_xxs: Metal

* iq4_xxs: slightly faster TG on Metal

* iq4_xxs: rename to iq4_ks

After all, tt is a smaller variant of iq4_k.

* iq3_kl: use iq4_ks instead of iq4_k/iq4_xs

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants