Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update: support Qwen2-57B-A14B #7835

Merged
merged 4 commits into from
Jun 17, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 2 additions & 43 deletions convert-hf-to-gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -1624,54 +1624,13 @@ class Qwen2MoeModel(Model):
model_arch = gguf.MODEL_ARCH.QWEN2MOE

def set_gguf_parameters(self):
self.gguf_writer.add_name(self.dir_model.name)
self.gguf_writer.add_block_count(self.block_count)

if (n_ctx := self.find_hparam(["max_position_embeddings", "n_ctx"], optional=True)) is not None:
self.gguf_writer.add_context_length(n_ctx)
logger.info(f"gguf: context length = {n_ctx}")

n_embd = self.find_hparam(["hidden_size", "n_embd"])
self.gguf_writer.add_embedding_length(n_embd)
logger.info(f"gguf: embedding length = {n_embd}")

n_head = self.find_hparam(["num_attention_heads", "n_head"])
self.gguf_writer.add_head_count(n_head)
logger.info(f"gguf: head count = {n_head}")

if (n_head_kv := self.hparams.get("num_key_value_heads")) is not None:
self.gguf_writer.add_head_count_kv(n_head_kv)
logger.info(f"gguf: key-value head count = {n_head_kv}")

if (rope_theta := self.hparams.get("rope_theta")) is not None:
self.gguf_writer.add_rope_freq_base(rope_theta)
logger.info(f"gguf: rope theta = {rope_theta}")
if (f_rms_eps := self.hparams.get("rms_norm_eps")) is not None:
self.gguf_writer.add_layer_norm_rms_eps(f_rms_eps)
logger.info(f"gguf: rms norm epsilon = {f_rms_eps}")
if (f_norm_eps := self.find_hparam(["layer_norm_eps", "layer_norm_epsilon", "norm_epsilon"], optional=True)) is not None:
self.gguf_writer.add_layer_norm_eps(f_norm_eps)
logger.info(f"gguf: layer norm epsilon = {f_norm_eps}")

if (n_experts_used := self.hparams.get("num_experts_per_tok")) is not None:
self.gguf_writer.add_expert_used_count(n_experts_used)
logger.info(f"gguf: experts used count = {n_experts_used}")

if (n_experts := self.find_hparam(["num_experts", "num_local_experts"])) is not None:
super().set_gguf_parameters()
if (n_experts := self.hparams.get("num_experts")) is not None:
self.gguf_writer.add_expert_count(n_experts)

if (moe_intermediate_size := self.hparams.get("moe_intermediate_size")) is not None:
self.gguf_writer.add_expert_feed_forward_length(moe_intermediate_size)
logger.info(f"gguf: expert feed forward length = {moe_intermediate_size}")

if (shared_expert_intermediate_size := self.find_hparam(["shared_expert_intermediate_size","intermediate_size", "n_inner"])) is not None:
self.gguf_writer.add_feed_forward_length(shared_expert_intermediate_size)
logger.info(f"gguf: feed forward length = {shared_expert_intermediate_size}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed?

Copy link
Contributor Author

@legraphista legraphista Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not removed, but cleaned up.
add_feed_forward_length is called by super().set_gguf_parameters() as it was before my PR.

add_expert_feed_forward_length is the only new thing that is added, and is the one taken into account here https://github.com/ggerganov/llama.cpp/pull/7835/files#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR5814 and here https://github.com/ggerganov/llama.cpp/pull/7835/files#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR5820

since LLM_KV_EXPERT_FEED_FORWARD_LENGTH is the only one taken into account, if felt irrelevant to set LLM_KV_FEED_FORWARD_LENGTH from shared_expert_intermediate_size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

However, looking at Qwen1.5-MoE-A2.7Bs config.json:

"intermediate_size": 5632,
"moe_intermediate_size": 1408,
"shared_expert_intermediate_size": 5632,
"num_experts_per_tok": 4,

and Qwen2-57B-A14B has the following values:

"intermediate_size": 18944,
"moe_intermediate_size": 2560,
"shared_expert_intermediate_size": 20480,
"num_experts_per_tok": 8,

Although I'm still not sure what Qwen2's intermediate_size refers to, since we are now keeping that as feed_forward_length, would it not make sense to store shared_expert_intermediate_size also?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure as what though, @ggerganov any thoughts?

Trying to see if there are any good docs on these values...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure what intermediate_size is either. For Qwen1.5-MoE-A2.7B the values look correct, that's why I'm inclined to say that intermediate_size = 18944 could have been a mistake.

As for storing shared_expert_intermediate_size, I see no specific slot for it, apart from overriding LLM_KV_FEED_FORWARD_LENGTH which might not be the best option (even tho' it's unused)

Copy link
Contributor Author

@legraphista legraphista Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CISC What do you think the next steps should be?

We can leave this PR as is since it works, and then start a new one with a proper impl?

Otherwise, adding mlp_only_layers: int[], shared_expert_intermediate_size: int and decoder_sparse_step: int to the config (and converter) is probably out of scope.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a shared_expert_feed_forward_length is within the scope of this PR.

The sparse layer stuff should be another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, would you like to simply override LLM_KV_FEED_FORWARD_LENGTH with shared_expert_intermediate_size or create a new one, say LLM_KV_SHARED_EXPERT_FEED_FORWARD_LENGTH?

Copy link
Contributor

@CISC CISC Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter since you need to keep the original value for the other PR. Or, well, at least it will make things a little less confusing.

Copy link
Collaborator

@compilade compilade Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mlp_only_layers: int[], and decoder_sparse_step: int

@legraphista I've had to deal with models which have both MoE layers and MLP-only layers in #7531 (Jamba). A new metadata key-value pair is not needed to identify these layers. The easy way is to check for the presence of layers.ffn_gate_inp, which is only there on MoE layers, and when building the compute graph, build_llama can be a good inspiration for how to check this per-layer.


self.gguf_writer.add_file_type(self.ftype)
logger.info(f"gguf: file type = {self.ftype}")


_experts: list[dict[str, Tensor]] | None = None

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
Expand Down
Loading