Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Enable GPTQModel to handle GraniteMoeParallelExperts #122

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from .gpt_bigcode import GPTBigCodeGPTQ
from .gpt_neox import GPTNeoXGPTQ
from .granite import GraniteGPTQ
from .granitemoe import GraniteMoeGPTQ
from .llama import LlamaGPTQ
from .mistral import MistralGPTQ
from .mixtral import MixtralGPTQ
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
"granite",
"gemma",
"dbrx_converted",
"granitemoe"
]

EXLLAMA_DEFAULT_MAX_INPUT_LENGTH = 2048
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
from .gpt_bigcode import GPTBigCodeGPTQ
from .gpt_neox import GPTNeoXGPTQ
from .granite import GraniteGPTQ
from .granitemoe import GraniteMoeGPTQ
from .llama import LlamaGPTQ
from .mistral import MistralGPTQ
from .mixtral import MixtralGPTQ
Expand All @@ -43,6 +44,7 @@
"granite": GraniteGPTQ,
"dbrx": DbrxGPTQ,
"dbrx_converted": DbrxConvertedGPTQ,
"granitemoe": GraniteMoeGPTQ
}

at_least_one_cuda_v6 = any(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -558,7 +558,7 @@ def save_quantized(
self.quantize_config.meta_set_versionable(
key=META_FIELD_QUANTIZER,
value=META_QUANTIZER_GPTQMODEL,
version=__version__,
version="1.0.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to be changed?

)

# The config, quantize_config and model may be edited in place in save_quantized.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
###############################################################################
# Adapted from https://github.com/ModelCloud/GPTQModel
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
###############################################################################
# Local
from .base import BaseGPTQModel


class GraniteMoeGPTQ(BaseGPTQModel):
base_modules = ["model.embed_tokens", "model.norm"]

layers_node = "model.layers"
layer_type = "GraniteMoeDecoderLayer"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you add some simple key to inform the format of input_linear and output_linear, that these are 3D tensors.

Also in the granitemoe case, another compilation is that input_linear fuses w1 and w3. it might be ok for a first cut just to leave them as fused.

Copy link
Contributor

@fabianlim fabianlim Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so basically the simple key needs to know what do look for to convert it to 3D tensor, and then when you write layer_modules you write it as though they have been converrted

class GraniteMoeGPTQ(BaseGPTQModel):
    
    convert3dToModuleList = ["block_sparse_moe.input_linear", "block_sparse_moe.output_linear"]

    layer_modules = [

        [
             "block_sparse_moe.input_linear.0.weight",
              "block_sparse_moe.input_linear.1.weight",
              ...
        ], [
             "block_sparse_moe.output_linear.0.weight",
              "block_sparse_moe.output_linear.1.weight",
              ...
        ]
    ]

layer_modules = [
["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"],
["self_attn.o_proj"],
["block_sparse_moe.input_linear", "block_sparse_moe.output_linear"],
Copy link
Contributor

@fabianlim fabianlim Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference MixtralGPTQ, you will see that they split up w1+ w3 and w2, which means we should split block_sparse_moe.input_linear and "block_sparse_moe.output_linear", see above

["input_layernorm", "post_attention_layernorm"]
]
Loading
Loading