b3028 #139

Nexesenex · 2024-05-28T20:57:17Z

No description provided.

* github: add refactor issue template [no ci] * Update 07-refactor.yml

* common : increase max number of experts to 160 * common : add tensors ATTN_Q_A, ATTN_Q_A_NORM, ATTN_Q_B, ATTN_KV_A_MQA, ATTN_KV_A_NORM, ATTN_KV_B needed by DeepSeek-V2 MLA (multi-head latent attention) architecture * common : add model header parameters: leading_dense_block_count, expert_feed_forward_length, expert_shared_count, expert_weights_scale, attention.q_lora_rank, attention.kv_lora_rank, rope.scaling.yarn_log_multiplier * convert-hf : add model conversion support for DeepseekV2ForCausalLM * llama : add model types for DeepSeek-V2 and DeepSeek-V2-Lite models * llama : add two new llm_build_moe_ffn() arguments: scale_w (whether to scale weights of selected MoE experts) and w_scale (numerical value of the scaling factor) * llama : add inference support for LLM_ARCH_DEEPSEEK2 --------- Co-authored-by: Stanisław Szymczyk <[email protected]>

* rpc : resource management rework * address review comments

…7552)

* Add optional MLP bias for Granite models Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses /issues/7116 Still needs some more changes to properly support Granite. * llama: honor add_space_prefix from the model configuration propagate the add_space_prefix configuration from the HF model configuration to the gguf file and honor it with the gpt2 tokenizer. Signed-off-by: Giuseppe Scrivano <[email protected]> * llama: add support for small granite models it works only for the small models 3b and 8b. The convert-hf-to-gguf.py script uses the vocabulary size of the granite models to detect granite and set the correct configuration. Signed-off-by: Giuseppe Scrivano <[email protected]> --------- Signed-off-by: Giuseppe Scrivano <[email protected]> Co-authored-by: Steffen Roecker <[email protected]>

* Update random test: add_bos_token. * Update random test: add WPM models for testing. * Build vocab.special_tokens_cache using vocab token types. * Fix and improve WPM preprocessing. - Fix unicode edge case combinations. - Split by whitspace in the same pass. * Discard all tokens when no matching found.

* q3_k_r4: faster Zen4 * q3_k_r4: faster Zen4 256.2 -> 272.7 t/s for PP-512 * q6_k_r4: faster Zen4 243.2 -> 261.3 t/s for PP-512 * q4_k_r4: slightly faster Zen4 262.4 t/s -> 268.1 t/s * q5_k_r4: slightly faster Zen4 248.3 t/s -> 256.7 t/s * iq4_xs_r4: slightly faster Zen4 256.8 t/s -> 272.0 t/s --------- Co-authored-by: Iwan Kawrakow <[email protected]>

mofosyne and others added 9 commits May 28, 2024 20:27

github: add refactor to issue template (#7561)

271ff3f

* github: add refactor issue template [no ci] * Update 07-refactor.yml

llama : handle unknown utf8 bytes (#7588)

8b99e2a

tests : fix test-tokenizer-0.sh

edc2943

rpc : resource management rework (#7562)

2b737ca

* rpc : resource management rework * address review comments

vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE (#…

56411a9

…7552)

sycl : fix assert (#7563)

6bd12ce

github-actions bot added testing python devops SYCL Vulkan labels May 28, 2024

Nexesenex merged commit 469040a into Nexesenex:downstream May 28, 2024
58 of 67 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b3028 #139

b3028 #139

Nexesenex commented May 28, 2024

b3028 #139

b3028 #139

Conversation

Nexesenex commented May 28, 2024