HIGGS Quantization Support #34997

BlackSamorez · 2024-11-28T14:12:00Z

HIGGS 0-Shot Quantization

HIGGS is a new 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper.

Runtime support for HIGGS is implemented through FLUTE, and its library.

This PR adds support for HIGGS+FLUTE into transformers allowing for low-error 0-shot quantization and fast LLM inference.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Rocketknight1 · 2024-11-28T14:19:58Z

cc @SunMarc @MekkCyber

SunMarc · 2024-11-28T14:51:55Z

cc @MekkCyber

BlackSamorez · 2024-11-28T14:52:11Z

Failed tests look like a problem on the runner's end

SunMarc

Thanks for integrating this new quantization method so fast! I left some comments and don't forget to also update the documentation so that the users knows how to use it !

SunMarc · 2024-11-28T14:58:05Z

src/transformers/integrations/higgs.py

+    if weight.device.type != "cuda":
+        raise ValueError(
+            "You are attempting to load a HIGGS model with a device_map that contains a CPU or disk device."
+        )


not necessary to put it here. The check on device_map when we initialize the quantizer would be enough.

src/transformers/integrations/higgs.py

src/transformers/quantizers/quantizer_higgs.py

SunMarc · 2024-11-28T16:54:31Z

src/transformers/quantizers/quantizer_higgs.py

+            else:
+                raise NotImplementedError(
+                    "HIGGS quantization is only supported on GPU. Please use a different quantizer."
+                )


let's check if cuda is available in validate_environment instead

SunMarc · 2024-11-28T16:57:19Z

src/transformers/quantizers/quantizer_higgs.py

+                    flute_workspaces[module.weight.device] = flute.utils.make_workspace_streamk(
+                        device=module.weight.device
+                    )
+                module.workspace = flute_workspaces[module.weight.device]


could you add a comment on what we are doing here ?

Added comments to this and possible repacking happening afterwards

SunMarc · 2024-11-28T17:06:24Z

src/transformers/utils/quantization_config.py

+        self.bits = bits
+        self.p = p
+        self.linear_weights_not_to_quantize = linear_weights_not_to_quantize
+        self.num_sms_packed = 128


Can you add a description of what this is used for ? The user shouldn't have to worry about that ?

Updated the docstring to better reflect what those are

SunMarc · 2024-11-28T17:06:51Z

src/transformers/utils/quantization_config.py

+    def post_init(self):
+        r"""
+        Safety checker that arguments are correct - also replaces some NoneType arguments with their default values.
+        """
+        return
+


add in post_init checks for bits and p

SunMarc · 2024-11-28T17:07:08Z

tests/quantization/higgs/test_higgs.py

+# @require_torch_gpu
+# class HiggsConfigTest(unittest.TestCase):
+#     def test_to_dict(self):
+#         """
+#         Simple test that checks if one uses a config and converts it to a dict, the dict is the same as the config object
+#         """
+#         quantization_config = HiggsConfig()
+#         config_to_dict = quantization_config.to_dict()
+
+#         for key in config_to_dict:
+#             self.assertEqual(getattr(quantization_config, key), config_to_dict[key])
+
+#     def test_from_dict(self):
+#         """
+#         Simple test that checks if one uses a dict and converts it to a config object, the config object is the same as the dict
+#         """
+#         dict = {"linear_weights_not_to_quantize": ["embed_tokens.weight", "lm_head.weight"], "quant_method": "higgs"}
+#         quantization_config = HiggsConfig.from_dict(dict)
+
+#         self.assertEqual(dict["linear_weights_not_to_quantize"], quantization_config.linear_weights_not_to_quantize)
+#         self.assertEqual(dict["quant_method"], quantization_config.quant_method)
+


to remove or uncomment

Uncommented this

SunMarc · 2024-11-28T17:08:01Z

tests/quantization/higgs/test_higgs.py

+@require_accelerate
+# @require_read_token
+class HiggsTest(unittest.TestCase):
+    model_name = "meta-llama/Meta-Llama-3.1-8B"


can we use a smaller model like a tiny llama ? This will be better for our CI thanks !

Sadly, no. FLUTE is only compiled for specific matrix shapes, for now.
TinyLlama is not among those shape. Nor is any model smaller than 8B.

SunMarc · 2024-11-28T17:09:00Z

tests/quantization/higgs/test_higgs.py

+    offload_device_map = {
+        "model.embed_tokens": 0,
+        "model.layers.0": 0,
+        "model.layers.1": 0,
+        "model.layers.2": 0,
+        "model.layers.3": 0,
+        "model.layers.4": 0,
+        "model.layers.5": 0,
+        "model.layers.6": 0,
+        "model.layers.7": 0,
+        "model.layers.8": 0,
+        "model.layers.9": 0,
+        "model.layers.10": 0,
+        "model.layers.11": 0,
+        "model.layers.12": 0,
+        "model.layers.13": 0,
+        "model.layers.14": 0,
+        "model.layers.15": 0,
+        "model.layers.16": "cpu",
+        "model.layers.17": "cpu",
+        "model.layers.18": "cpu",
+        "model.layers.19": "cpu",
+        "model.layers.20": "disk",
+        "model.layers.21": "disk",
+        "model.layers.22": "disk",
+        "model.layers.23": "disk",
+        "model.layers.24": "disk",
+        "model.layers.25": "disk",
+        "model.layers.26": "disk",
+        "model.layers.27": "disk",
+        "model.layers.28": "disk",
+        "model.layers.29": "disk",
+        "model.layers.30": "disk",
+        "model.layers.31": "disk",
+        "model.norm": "disk",
+        "lm_head": "disk",
+    }


to remove or perform some tests with this device_map. I think we shouldn't allow users to pass this kind of device_map and some check should be added in validate_environment. Check for example the awq integration code

removed this from tests, added device map assertions to validate_environment

Co-authored-by: Marc Sun <[email protected]>

MekkCyber · 2024-11-29T09:58:33Z

Hey @BlackSamorez, Thanks for adding this quantization method so quickly ! I added some very small nits

MekkCyber · 2024-11-29T08:39:00Z

docker/transformers-quantization-latest-gpu/Dockerfile

@@ -66,6 +66,10 @@ RUN python3 -m pip install --no-cache-dir optimum-quanto
 # Add eetq for quantization testing
 RUN python3 -m pip install git+https://github.com/NetEase-FuXi/EETQ.git

+# Add flute-kernel and fast_hadamard_transform for quantization testing
+RUN python3 -m pip install --no-cache-dir flute-kernel==0.2.6


The docker image will be deployed on an instance with cuda 11.8 but on the flute github I noticed you need to specify https://flute-ai.github.io/whl/cu118 in that case

Thanks, updated.

MekkCyber · 2024-11-29T08:54:53Z

src/transformers/quantizers/quantizer_higgs.py

+class HiggsHfQuantizer(HfQuantizer):
+    """
+    Quantizer of the HIGGS method. Enables the loading of prequantized models.
+    """


just a small nit, I think we should specify that it enables both loading and quantization of models because there are other quantizers that only enable loading

I added and in-flight quantization of full-precision models.

src/transformers/quantizers/quantizer_higgs.py

MekkCyber · 2024-11-29T09:14:56Z

src/transformers/quantizers/quantizer_higgs.py

+        module.num_sms_packed = torch.nn.Parameter(
+            torch.tensor(get_num_sms_from_device(target_device), device=target_device, dtype=torch.int32),
+            requires_grad=False,
+        )


Just for my understanding why do we need the num_sms_packed ?

Codes packing is sms dependent. We need to remember what was the sms of the machine on which the codes were packed on (num_sms_packed) to be able to check if we need to repack or not. Moreover, we need num_sms_packed to do the repacking itself.

MekkCyber · 2024-11-29T09:23:55Z

src/transformers/quantizers/quantizer_higgs.py

+                            num_bits=module.num_bits,
+                            group_size=256,
+                            num_sms_packed=module.num_sms_packed.item(),
+                        )


Just a small question, is the group_size a constant ?

Yes, there are a few hard-coded constants right now, including the group size. I think I will do a small refactoring to spell them out more explicitly.

MekkCyber · 2024-11-29T09:39:07Z

src/transformers/quantizers/quantizer_higgs.py

+        module, tensor_name = get_module_from_name(model, param_name)
+        if isinstance(module, HiggsLinear) and tensor_name == "weight" and param_value.dtype != torch.int16:
+            # Add here check for loaded components' dtypes once serialization is implemented
+            return True


Do you mean that serialization is not implemented yet ? so we can't save a quantized model and load it ?

No, serialization is fully functional. This message got copied with bnb code I borrowed and I forgot to remove it.
By the way, bnb implemented serialization quite some time ago as well.

MekkCyber · 2024-11-29T09:50:20Z

tests/quantization/higgs/test_higgs.py

+        nb_fbgemm_linear = 0
+        for module in model.modules():
+            if isinstance(module, HiggsLinear):
+                nb_fbgemm_linear += 1


I think you meant nb_higgs_linear 😉

Sure. Fixed

MekkCyber · 2024-11-29T09:54:37Z

src/transformers/quantizers/quantizer_higgs.py

+    for m in module_tree:
+        parent = parent._modules[m]
+    return parent
+


sorry if i'm mistaken, I don't believe we use this function anywhere

Removed the unused function. Thanks!

Co-authored-by: Mohamed Mekkouri <[email protected]>

BlackSamorez · 2024-11-29T15:30:35Z

@SunMarc @MekkCyber thanks for your feedback!
I think I addressed all of your concerns.

BlackSamorez added 9 commits November 27, 2024 14:35

higgs init

08b347c

working with crunches

14a0c82

per-model workspaces

1c5b9e7

style

9f2ef77

style 2

0ff58c3

Merge branch 'huggingface:main' into main

2e9adc6

tests and style

b6bad71

higgs tests passing

c2bcf39

protecting torch import

a1e7b35

BlackSamorez added 4 commits November 28, 2024 15:22

removed torch.Tensor type annotations

8f1a0a6

torch.nn.Module inheritance fix maybe

120f360

hide inputs inside quantizer calls

fdb71a5

style structure something

127c5f0

SunMarc requested a review from MekkCyber November 28, 2024 14:43

Merge branch 'main' into main

947a53d

Merge branch 'main' into main

e6ddc41

SunMarc reviewed Nov 28, 2024

View reviewed changes

BlackSamorez and others added 4 commits November 28, 2024 20:30

Update src/transformers/quantizers/quantizer_higgs.py

0de97f1

Co-authored-by: Marc Sun <[email protected]>

reworked num_sms

1f08cb0

Merge branch 'main' of github.com:BlackSamorez/transformers

ed369be

Update src/transformers/integrations/higgs.py

96023ab

Co-authored-by: Marc Sun <[email protected]>

MekkCyber reviewed Nov 29, 2024

View reviewed changes

BlackSamorez and others added 4 commits November 29, 2024 15:52

revamped device checks

60ce44b

docstring upd

8142443

Update src/transformers/quantizers/quantizer_higgs.py

1d636ac

Co-authored-by: Mohamed Mekkouri <[email protected]>

edited tests and device map assertions

66ece1d

BlackSamorez added 6 commits November 29, 2024 16:10

Merge branch 'main' of github.com:BlackSamorez/transformers

5a68cd6

minor edits

1cb9f0c

updated flute cuda version in docker

257c39b

Added p=1 and 2,3bit HIGGS

f82d1a3

flute version check update

b747980

incorporated modules_to_not_convert

398d5b1

less hardcoding

0ede69c

HIGGS Quantization Support #34997

Are you sure you want to change the base?

HIGGS Quantization Support #34997

Conversation

BlackSamorez commented Nov 28, 2024 • edited Loading

HIGGS 0-Shot Quantization

Before submitting

Who can review?

Rocketknight1 commented Nov 28, 2024

SunMarc commented Nov 28, 2024

BlackSamorez commented Nov 28, 2024

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MekkCyber commented Nov 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BlackSamorez commented Nov 29, 2024

BlackSamorez commented Nov 28, 2024 •

edited

Loading

SunMarc left a comment •

edited

Loading