Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b3028 #139

Merged
merged 9 commits into from
May 28, 2024
Merged

b3028 #139

merged 9 commits into from
May 28, 2024

Conversation

Nexesenex
Copy link
Owner

No description provided.

mofosyne and others added 9 commits May 28, 2024 20:27
* github: add refactor issue template [no ci]

* Update 07-refactor.yml
* common : increase max number of experts to 160

* common : add tensors ATTN_Q_A, ATTN_Q_A_NORM, ATTN_Q_B, ATTN_KV_A_MQA, ATTN_KV_A_NORM, ATTN_KV_B needed by DeepSeek-V2 MLA (multi-head latent attention) architecture

* common : add model header parameters: leading_dense_block_count, expert_feed_forward_length, expert_shared_count, expert_weights_scale, attention.q_lora_rank, attention.kv_lora_rank, rope.scaling.yarn_log_multiplier

* convert-hf : add model conversion support for DeepseekV2ForCausalLM

* llama : add model types for DeepSeek-V2 and DeepSeek-V2-Lite models

* llama : add two new llm_build_moe_ffn() arguments: scale_w (whether to scale weights of selected MoE experts) and w_scale (numerical value of the scaling factor)

* llama : add inference support for LLM_ARCH_DEEPSEEK2

---------

Co-authored-by: Stanisław Szymczyk <[email protected]>
* rpc : resource management rework

* address review comments
* Add optional MLP bias for Granite models

Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses /issues/7116
Still needs some more changes to properly support Granite.

* llama: honor add_space_prefix from the model configuration

propagate the add_space_prefix configuration from the HF model
configuration to the gguf file and honor it with the gpt2 tokenizer.

Signed-off-by: Giuseppe Scrivano <[email protected]>

* llama: add support for small granite models

it works only for the small models 3b and 8b.

The convert-hf-to-gguf.py script uses the vocabulary size of the
granite models to detect granite and set the correct configuration.

Signed-off-by: Giuseppe Scrivano <[email protected]>

---------

Signed-off-by: Giuseppe Scrivano <[email protected]>
Co-authored-by: Steffen Roecker <[email protected]>
* Update random test: add_bos_token.
* Update random test: add WPM models for testing.
* Build vocab.special_tokens_cache using vocab token types.
* Fix and improve WPM preprocessing.
  - Fix unicode edge case combinations.
  - Split by whitspace in the same pass.
* Discard all tokens when no matching found.
@Nexesenex Nexesenex merged commit 469040a into Nexesenex:downstream May 28, 2024
58 of 67 checks passed
Nexesenex pushed a commit that referenced this pull request Dec 22, 2024
* q3_k_r4: faster Zen4

* q3_k_r4: faster Zen4

256.2 -> 272.7 t/s for PP-512

* q6_k_r4: faster Zen4

243.2 -> 261.3 t/s for PP-512

* q4_k_r4: slightly faster Zen4

262.4 t/s -> 268.1 t/s

* q5_k_r4: slightly faster Zen4

248.3 t/s -> 256.7 t/s

* iq4_xs_r4: slightly faster Zen4

256.8 t/s -> 272.0 t/s

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants