IBM Granite MoE Architecture #9438

gabe-l-hart · 2024-09-11T16:22:47Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Dependencies

This PR is dependent on merging the initial GraniteLM PR (IBM Granite Architecture #9412)

Description

This PR introduces the granitemoe model architecture from IBM. It emulates the transformers changes in this PR.

The granitemoe architecture follows a very similar pattern to the granite architecture and its changes relative to llama. For the MoE variant, the base architecture is mixtral (MoE branch of llama here in llama.cpp). The same four additional multipliers are added (embeddings_multiplier, attention_multiplier, residual_multiplier, and logits_scale).

Testing

This PR can be tested using ibm/PowerMoE-3b from huggingface following the same testing steps used for granite (here).

gguf-py/gguf/gguf_writer.py

gabe-l-hart · 2024-09-23T14:17:34Z

Hi @compilade @ggerganov! This PR is now ready for full review.

We're eager to get the granitemoe architecture fully supported in llama.cpp (and then following up with support in ollama). I'm sure you are perpetually swamped, so I just want to get a quick check on if this is in the review queue for you at this point and if you have any targets for merging support.

(also, thanks for the great project and all the work you do!)

convert_hf_to_gguf.py

gabe-l-hart · 2024-09-23T16:32:55Z

It looks like the failing test is on the windows server's Erase Slot server logs scenario. This seems like it should be unrelated to this PR. Without knowing the tests well, is there any likelihood that this is a false negative? I can dig further if needed.

convert_hf_to_gguf.py

ggerganov · 2024-09-23T16:48:39Z

is there any likelihood that this is a false negative?

Yes, this is unrelated to the PR, no need to investigate.

gguf-py/gguf/tensor_mapping.py

convert_hf_to_gguf.py

gguf-py/gguf/constants.py

src/llama.cpp

gabe-l-hart · 2024-09-23T20:00:50Z

Thanks for the detailed review @compilade! I believe I have all of the comments addressed at this point.

compilade

From the first few chunks of wikitext-2-raw with llama-perplexity and https://huggingface.co/ibm/PowerMoE-3b at Q8_0, I get [1]4.4570,[2]5.1116,[3]5.3469,[4]5.9955, so this does appear to work correctly.

This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]>

GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]>

… and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>

GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]>

Co-Authored-By: [email protected] Co-authored-by: Georgi Gerganov <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]>

Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>

This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2024-09-24T16:31:28Z

@compilade After you pointed out that I was missing output in the recent granitemoe snapshots, I dug a little deeper and it seems that the model team has added this for granite (dense) as well. I've added another commit to this PR to fix that as well. I'm not sure the preferred PR hygiene, so I'm happy to move that to a separate fix PR if the preference is for more well-encapsulated changes.

This was added recently to llama.cpp: ggerganov/llama.cpp#9438 Signed-off-by: Eric Curtin <[email protected]>

* feat(gguf-py): Add granitemoe architecture This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * feat(convert_hf_to_gguf): Add GraniteMoeModel GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * fix(granitemoe convert): Split the double-sized input layer into gate and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * feat(granitemoe): Implement granitemoe GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * Typo fix in docstring Co-Authored-By: [email protected] Co-authored-by: Georgi Gerganov <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> * fix(conversion): Simplify tensor name mapping in conversion Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Remove unused tensor name mappings Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Sanity check on merged FFN tensor sizes Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix: Allow "output" layer in granite moe architecture (convert and cpp) Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(granite): Add missing 'output' tensor for Granite This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

This is a port of the work done in llama.cpp directly ggerganov/llama.cpp#9438 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <[email protected]>

* feat(gguf-py): Add granitemoe architecture This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * feat(convert_hf_to_gguf): Add GraniteMoeModel GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * fix(granitemoe convert): Split the double-sized input layer into gate and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * feat(granitemoe): Implement granitemoe GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> * Typo fix in docstring Co-Authored-By: [email protected] Co-authored-by: Georgi Gerganov <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]> * fix(conversion): Simplify tensor name mapping in conversion Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Remove unused tensor name mappings Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(convert): Sanity check on merged FFN tensor sizes Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix: Allow "output" layer in granite moe architecture (convert and cpp) Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]> * fix(granite): Add missing 'output' tensor for Granite This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

This is a port of the work done in llama.cpp directly ggerganov/llama.cpp#9438 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart force-pushed the GraniteMoE branch 2 times, most recently from 5f37be3 to 3219f58 Compare September 11, 2024 16:29

github-actions bot added the python python script changes label Sep 11, 2024

gabe-l-hart mentioned this pull request Sep 11, 2024

IBM granite/granitemoe architecture support ollama/ollama#6760

Merged

2 tasks

compilade reviewed Sep 14, 2024

View reviewed changes

gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved

gabe-l-hart force-pushed the GraniteMoE branch 3 times, most recently from 1b235d0 to 2615459 Compare September 17, 2024 12:46

gabe-l-hart marked this pull request as ready for review September 17, 2024 12:46

gabe-l-hart force-pushed the GraniteMoE branch from 2615459 to 474c7fb Compare September 23, 2024 14:17

ggerganov approved these changes Sep 23, 2024

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

compilade reviewed Sep 23, 2024

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

gabe-l-hart force-pushed the GraniteMoE branch from 31ed122 to 1349625 Compare September 23, 2024 17:03

compilade reviewed Sep 23, 2024

View reviewed changes

gguf-py/gguf/tensor_mapping.py Outdated Show resolved Hide resolved

gguf-py/gguf/tensor_mapping.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

gguf-py/gguf/constants.py Show resolved Hide resolved

src/llama.cpp Show resolved Hide resolved

compilade approved these changes Sep 23, 2024

View reviewed changes

gabe-l-hart and others added 9 commits September 24, 2024 10:24

feat(convert_hf_to_gguf): Add GraniteMoeModel

e0b7229

GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]>

Typo fix in docstring

71bc4c1

Co-Authored-By: [email protected] Co-authored-by: Georgi Gerganov <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]>

fix(conversion): Simplify tensor name mapping in conversion

5eb28c4

Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>

fix(convert): Remove unused tensor name mappings

f236099

Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>

fix(convert): Sanity check on merged FFN tensor sizes

317b15b

Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>

fix: Allow "output" layer in granite moe architecture (convert and cpp)

1c8b3e4

Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart force-pushed the GraniteMoE branch from e071bc8 to 1c8b3e4 Compare September 24, 2024 16:24

fix(granite): Add missing 'output' tensor for Granite

a843f1f

This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]>

compilade added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Sep 24, 2024

ggerganov merged commit 3d6bf69 into ggerganov:master Sep 25, 2024
55 checks passed

gabe-l-hart deleted the GraniteMoE branch September 25, 2024 12:45

ericcurtin added a commit to containers/ramalama that referenced this pull request Oct 21, 2024

Update llama.cpp to fix granite3-moe models

883a9d4

This was added recently to llama.cpp: ggerganov/llama.cpp#9438 Signed-off-by: Eric Curtin <[email protected]>

ericcurtin mentioned this pull request Oct 21, 2024

Update llama.cpp to fix granite3-moe models containers/ramalama#340

Merged

gabe-l-hart mentioned this pull request Nov 4, 2024

Granite three support Mozilla-Ocho/llamafile#608

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IBM Granite MoE Architecture #9438

IBM Granite MoE Architecture #9438

gabe-l-hart commented Sep 11, 2024 •

edited

Loading

gabe-l-hart commented Sep 23, 2024

gabe-l-hart commented Sep 23, 2024

ggerganov commented Sep 23, 2024

gabe-l-hart commented Sep 23, 2024

compilade left a comment

gabe-l-hart commented Sep 24, 2024

IBM Granite MoE Architecture #9438

IBM Granite MoE Architecture #9438

Conversation

gabe-l-hart commented Sep 11, 2024 • edited Loading

Dependencies

Description

Testing

gabe-l-hart commented Sep 23, 2024

gabe-l-hart commented Sep 23, 2024

ggerganov commented Sep 23, 2024

gabe-l-hart commented Sep 23, 2024

compilade left a comment

Choose a reason for hiding this comment

gabe-l-hart commented Sep 24, 2024

gabe-l-hart commented Sep 11, 2024 •

edited

Loading