b2128 #83

Nexesenex · 2024-02-11T23:13:13Z

No description provided.

* server: add mistral chat template * server: fix typo * server: rename template mistral to llama2 * server: format_llama2: remove BOS * server: validate "--chat-template" argument * server: clean up using_chatml variable Co-authored-by: Jared Van Bortel <[email protected]> --------- Co-authored-by: Jared Van Bortel <[email protected]>

* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q8_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_1_q8_1 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: update unit tests for the new vec_dot interface * llama.cpp: add MATMUL_INT8 capability to system_info

* server: allow to specify tokens as strings in logit_bias * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

* common: use enums for sampler types * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * minor : spaces --------- Co-authored-by: Georgi Gerganov <[email protected]>

* vulkan: refactor guess_matmul_pipeline for vendor Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor conditionals. Signed-off-by: Sergio Lopez <[email protected]> * vulkan: only use M-sized matmul on Apple GPUs L-sized and S-sized matmuls are broken on Apple GPUs, force using M-size with this vendor. Signed-off-by: Sergio Lopez <[email protected]> --------- Signed-off-by: Sergio Lopez <[email protected]>

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31) → 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)

@xyzhang626

* BERT model graph construction (build_bert) * WordPiece tokenizer (llm_tokenize_wpm) * Add flag for non-causal attention models * Allow for models that only output embeddings * Support conversion of BERT models to GGUF * Based on prior work by @xyzhang626 and @skeskinen --------- Co-authored-by: Jared Van Bortel <[email protected]> Co-authored-by: Jared Van Bortel <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* CUDA: mul_mat_vec_q tiling, refactor mul mat logic Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

* iq4_k_xxs: basics * WIP + adding iq3_kl quantization mix * iq4_xxs: this looks very viable compared to iq4_xs At the same 4.25 bpw PPL is always better, for some models significantly better. I'll rename to iq4_ks and keep it. * iq4_xxs: CUDA dot product We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0. * iq4_xxs: scalar CPU dot product Also fix the breakage I caused with the dedicated work buffer quantization portion when the multiplication is not done via iqk_mul_mat. * iq4_xxs: Zen4 I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2). Again the same mistake of packing int32_t back to int16_t, which overflows occasionally (just occasionally, that's why the result doesn't look completely wrong, so I didn't notice). * Fix iq4_xs (Zen4) * iq4_xxs: AVX2 * iq4_xxs: ARM_NEON * iq4_xxs: Metal * iq4_xxs: slightly faster TG on Metal * iq4_xxs: rename to iq4_ks After all, tt is a smaller variant of iq4_k. * iq3_kl: use iq4_ks instead of iq4_k/iq4_xs --------- Co-authored-by: Iwan Kawrakow <[email protected]>

ngxson and others added 12 commits February 11, 2024 12:16

lookup: add print for drafting performance (#5450)

e4640d8

ggml : fix compile warnings (unused vars) (#4966)

0f2411f

common : fix compile warning

139b62a

main : ctrl+C print timing in non-interactive mode (#3873)

85910c5

server : allow to specify tokens as strings in logit_bias (#5003)

6847801

* server: allow to specify tokens as strings in logit_bias * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

common : use enums for sampler types (#5418)

a803333

* common: use enums for sampler types * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * minor : spaces --------- Co-authored-by: Georgi Gerganov <[email protected]>

flake.lock: Update

97a3365

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31) → 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)

CUDA: mul_mat_vec_q tiling, refactor mul mat logic (#5434)

3bdc4cd

* CUDA: mul_mat_vec_q tiling, refactor mul mat logic Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

Nexesenex deleted the branch Nexesenex:_master_up February 11, 2024 23:28

Nexesenex closed this Feb 11, 2024

Nexesenex reopened this Feb 11, 2024

Nexesenex merged commit c875614 into Nexesenex:_master_up Feb 11, 2024
14 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b2128 #83

b2128 #83

Nexesenex commented Feb 11, 2024

b2128 #83

b2128 #83

Conversation

Nexesenex commented Feb 11, 2024