Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace auto_gptq by gptqmodel in HuggingFace/Optimum #536

Open
jiqing-feng opened this issue Nov 6, 2024 · 13 comments
Open

Replace auto_gptq by gptqmodel in HuggingFace/Optimum #536

jiqing-feng opened this issue Nov 6, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@jiqing-feng
Copy link
Contributor

Hi @Qubitium . Since the CPU path is already in gptqmodel, when do you plan to replace auto_gptq to gptqmodel in HuggingFace/optimum? I think we can start an issue in Optimum to let the maintainer know as early as possible.

Please let me know if there is anything I can do to move on to the goal. Thx.

@jiqing-feng jiqing-feng added the bug Something isn't working label Nov 6, 2024
@Qubitium
Copy link
Collaborator

Qubitium commented Nov 7, 2024

Version 1.2 with ipex should be released within the next 24 hours after I merege some pr changes that will affect/simplify core api for end-user when loading and saving models. v1.2 should be stable enough for us to move forward with optimum Pr.

@jiqing-feng
Copy link
Contributor Author

Version 1.2 with ipex should be released within the next 24 after I merege some pr changes that will affect/simplify core api for end-user when loading and saving models. v1.2 should be stable enough for us to move forward with optimum Pr.

Great, I only left some minus fixes for examples, please merge #540 .
Please let me know when the stable version is ready. Thanks!

@Qubitium
Copy link
Collaborator

v1.2.1 released. We now need to plot what code/features in optimum and transformers are dependent on old auto-gptq so we can create to do list and check off each one.

@jiqing-feng
Copy link
Contributor Author

The core function is here huggingface/optimum/blob/main/optimum/gptq/quantizer.py. The others are mostly lib checks or guidance in readme or code comments.

@Qubitium
Copy link
Collaborator

Qubitium commented Nov 14, 2024

@jiqing-feng transformers calls optimum so we need to PR both at the same time.

We have another issue, which is hf gptq loading code in from_pretrained and how GPTQConfig is used is very detached from reality in my view and quite messy. From a dev and user perspective that does both quant and loading of quants, the current code in transformers doesn't make much sense as far as how it uses GPTQConfig which does strange config merges. Once a model is quantized, there is no reason, nor possible, to override the model quantization config other to select the backend kernel.

We are looking at this right now and plan out which code we need change first in gptqmodel so can adapt any changes in tranformers/optium.

@jiqing-feng
Copy link
Contributor Author

@jiqing-feng transformers calls optimum so we need to PR both at the same time.

We have another issue, which is hf gptq loading code in from_pretrained and how GPTQConfig is used is very detached from currently reality in my view and quite messy. From a dev and user perspective that does both quant and loading of quants, the current code in transformers doesn't make much sense as far as how it uses GPTQConfig which and doing strange strange config merges. Once a model is quantized, there is no reason, nor possible, to override the model quantization config other to select the backend kernel.

We are looking at this right now and plan out which code we need change first in gptqmodel so can adapt any changes in tranformers/optium.

Actually, for ipex, we definitely need to rewrite the quantization config so we can use our IPEX API. The IPEX API adapted the original GPTQ weight format even if you quantize the model in the cuda backend.

If it is not easy to understand, we can discuss it in a Teams meeting if you're convenient, and give me your email and your available time slot.

@Qubitium
Copy link
Collaborator

Qubitium commented Nov 14, 2024

We have identified a problem with hf transformer, and our gptqmodel code too, in which the separation of quantization temp attributes used only for the quantization process and the persistent attributes of quantized model.

For example, damp is a ephemeral attribute that only exist in the quantization stage and should not persist in the config post quantization, or should only exist in the meta attribute if any. bits and group_size are persistent attributes that is both a quantization process attribute and a quantized model attribute (used for loading and dequant). The analogy for this is batch of model training where the attribute batch is not saved, nor should it be, or used in the saving/loading of quantized/trained models.

I plan to address this part in our PRs.

Actually, for ipex, we definitely need to rewrite the quantization config so we can use our IPEX API. The IPEX API adapted the original GPTQ weight format even if you quantize the model in the cuda backend.

Can you give me a code example of where IPEX would need to alter the persistent quantized config attributes post-quantization? (as it related to the quantization_config that persist in the json file or config.json) One example will help a lot to see where IPEX's usage case is coming from. Thanks.

If it is not easy to understand, we can discuss it in a Teams meeting if you're convenient, and give me your email and your available time slot.

You can email me at [email protected] and my time is pretty flexible.

@jiqing-feng
Copy link
Contributor Author

I will take AWQ as an example because it's already integrated into transformers. Please install transformers and AutoAWQ from the main repo, and run the following script on an Intel Xeon CPU. If you don't have such a device, I will show this case in our meeting, maybe 2pm in Beijing time tomorrow (11/15) ?

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AwqConfig

model_id = "PrunaAI/JackFram-llama-68m-AWQ-4bit-smashed"

text = ["I am happy because", "This is"]
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
input_ids = tokenizer(text, return_tensors="pt", padding=True)

quantization_config = AwqConfig(version="ipex")

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cpu", quantization_config=quantization_config)
model.generation_config.cache_implementation = "static"
model.generate(**input_ids)

@Qubitium
Copy link
Collaborator

quantization_config = AwqConfig(version="ipex")

And GPTQModel equivalent, if we change transformer code, would be GPTQConfig(backend="ipex"), This is what IPEX needs right? A way to pass in a backend selector?

run the following script on an Intel Xeon CPU

We only have consumer Intel 13th gen and EPYC 7003 (Zen3) and 7950X (zen4 desktop) both has AVX512. Which intel instructions does IPEX require?

maybe 2pm in Beijing time tomorrow (11/15)

Sure. Please email me your contacts and we can take from there.

@jiqing-feng
Copy link
Contributor Author

  1. Yes, we need to pass backend when selecting quant layer here: optimum/gptq/quantizer.py
  2. Intel CPU with AVX512 should work.
  3. I have sent you the invitation, my email is [email protected]

@Qubitium
Copy link
Collaborator

Qubitium commented Nov 14, 2024

Going to list the issues/diffs that we found here: (Will update as more are found)

REF: First PR that AutoGPTQ was partially merged into optimum: huggingface/optimum#1216

  • Saving config.json injection of {exallam_config: 1}.
  • Does not use triton, use_triton always False. Optimum defaults to cuda + exllama v1. GPTQModel, deprecated cuda and exallama v1 kernel and only using triton_v2 for quantization stage.
  • Optimum quantization imports quant_linear from auto_gptq but quant logic code is customized/generic code to adapt to more modules (and not model specific like gptqmodel).
  • GPTQConfig is a mixed a param passing object for both quantization and loading/kernel selection.
  • Auto**.from_pretrained will auto start quantization if GPTQConfig is passed in.

Kernels:

AutoGPTQ has: Cuda/Packer, Triton v1/Packer, Triton v2/Packer, Exllama v1/Packer Exllama v2/(no-packer), Marlin/(Marlin packer)

GPTQModel has: Triton v2/Packer, Exllama v2, NM Marlin/(Marlin Packer)

Need to retest cuda vs triton v2 to see which is faster for quant and pack including with torch.compile() in torch 2.5.1 since we need to re-add back this kernel for hf/optimum compat. Unsure they will accept another, triton depend.

History:

  • When the auto_gptq was merged in to optimum, Triton v2 did not exist or was considered unstable. Time has shown Triton v2 is very stable and faster than v1.

Cuda kernel: With torch 2.5.1 changes, it may be faster or as fast as Triton v2. Again, we need to test now since optimum relies on cuda kernel by default.

@Qubitium
Copy link
Collaborator

Cuda kernel has been re-added back for pending optimum compat.

@Qubitium
Copy link
Collaborator

@jiqing-feng Instead of a optimum calling gptqmodel internal api such as select_quant_linear for selecting quant_linear, we will expose an api specifically for optimum that accepts GPTQConfig as input. Something like optimum_quant_linear(GPTQConfig).

This will introduce separation of optimum hooking into internal apis that we may chnage in the future while at the same time improve api stability of these separate projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants