-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IBM Granite MoE Architecture #9438
Conversation
5f37be3
to
3219f58
Compare
1b235d0
to
2615459
Compare
2615459
to
474c7fb
Compare
Hi @compilade @ggerganov! This PR is now ready for full review. We're eager to get the (also, thanks for the great project and all the work you do!) |
It looks like the failing test is on the windows server's |
Yes, this is unrelated to the PR, no need to investigate. |
31ed122
to
1349625
Compare
Thanks for the detailed review @compilade! I believe I have all of the comments addressed at this point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the first few chunks of wikitext-2-raw with llama-perplexity
and https://huggingface.co/ibm/PowerMoE-3b at Q8_0
, I get [1]4.4570,[2]5.1116,[3]5.3469,[4]5.9955
, so this does appear to work correctly.
This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]>
GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]>
… and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>
GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]>
Co-Authored-By: [email protected] Co-authored-by: Georgi Gerganov <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>
e071bc8
to
1c8b3e4
Compare
This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]>
@compilade After you pointed out that I was missing |
This was added recently to llama.cpp: ggerganov/llama.cpp#9438 Signed-off-by: Eric Curtin <[email protected]>
* feat(gguf-py): Add granitemoe architecture This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * feat(convert_hf_to_gguf): Add GraniteMoeModel GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * fix(granitemoe convert): Split the double-sized input layer into gate and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * feat(granitemoe): Implement granitemoe GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * Typo fix in docstring Co-Authored-By: [email protected] Co-authored-by: Georgi Gerganov <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> * fix(conversion): Simplify tensor name mapping in conversion Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Remove unused tensor name mappings Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Sanity check on merged FFN tensor sizes Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix: Allow "output" layer in granite moe architecture (convert and cpp) Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(granite): Add missing 'output' tensor for Granite This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
This is a port of the work done in llama.cpp directly ggerganov/llama.cpp#9438 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <[email protected]>
This is a port of the work done in llama.cpp directly ggerganov/llama.cpp#9438 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <[email protected]>
* feat(gguf-py): Add granitemoe architecture This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * feat(convert_hf_to_gguf): Add GraniteMoeModel GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * fix(granitemoe convert): Split the double-sized input layer into gate and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * feat(granitemoe): Implement granitemoe GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * Typo fix in docstring Co-Authored-By: [email protected] Co-authored-by: Georgi Gerganov <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> * fix(conversion): Simplify tensor name mapping in conversion Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Remove unused tensor name mappings Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Sanity check on merged FFN tensor sizes Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix: Allow "output" layer in granite moe architecture (convert and cpp) Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(granite): Add missing 'output' tensor for Granite This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
* feat(gguf-py): Add granitemoe architecture This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * feat(convert_hf_to_gguf): Add GraniteMoeModel GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * fix(granitemoe convert): Split the double-sized input layer into gate and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * feat(granitemoe): Implement granitemoe GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * Typo fix in docstring Co-Authored-By: [email protected] Co-authored-by: Georgi Gerganov <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> * fix(conversion): Simplify tensor name mapping in conversion Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Remove unused tensor name mappings Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Sanity check on merged FFN tensor sizes Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix: Allow "output" layer in granite moe architecture (convert and cpp) Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(granite): Add missing 'output' tensor for Granite This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
This is a port of the work done in llama.cpp directly ggerganov/llama.cpp#9438 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <[email protected]>
Dependencies
GraniteLM
PR (IBM Granite Architecture #9412)Description
This PR introduces the
granitemoe
model architecture from IBM. It emulates thetransformers
changes in this PR.The
granitemoe
architecture follows a very similar pattern to thegranite
architecture and its changes relative tollama
. For the MoE variant, the base architecture ismixtral
(MoE branch ofllama
here inllama.cpp
). The same four additional multipliers are added (embeddings_multiplier
,attention_multiplier
,residual_multiplier
, andlogits_scale
).Testing
This PR can be tested using ibm/PowerMoE-3b from huggingface following the same testing steps used for
granite
(here).