Add torchao to optimum as a pytorch backend configuration #297

jerryzh168 · 2024-11-22T00:38:06Z

Summary:
att.
for now we just added int4 weight only quantization, using TorchAoConfig from https://huggingface.co/docs/transformers/main/en/quantization/torchao

Test Plan:
python examples/pytorch_llama.py

gpt2 model

Base:
prefill: 9486.25 decode: 83.75

AWQ:
prefill: 9496.02 decode: 83.62

GPTQ
prefill: 9814.27 decode: 97.31

torchao int4wo
prefill: 10007.93 decode: 84.66

llama2

Base:
prefill: 2275.32 decode: 18.92

AWQ:
prefill: 2344.19 decode: 18.21

GPTQ:
prefill: 2881.87 decode: 26.47

torchao int4wo
prefill: 3035.82 decode: 24.51

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: att. for now we just added int4 weight only quantization, using `TorchAoConfig` from https://huggingface.co/docs/transformers/main/en/quantization/torchao Test Plan: python examples/pytorch_llama.py gpt2 model Base: prefill: 9486.25 decode: 83.75 AWQ: prefill: 9496.02 decode: 83.62 GPTQ prefill: 9814.27 decode: 97.31 torchao int4wo prefill: 10007.93 decode: 84.66 llama2 Base: prefill: 2275.32 decode: 18.92 AWQ: prefill: 2344.19 decode: 18.21 GPTQ: prefill: 2881.87 decode: 26.47 torchao int4wo prefill: 3035.82 decode: 24.51 Reviewers: Subscribers: Tasks: Tags:

IlyasMoutawwakil · 2024-11-22T09:25:48Z

@jerryzh168 great prefill performance !
can you also add torchao to the installed libs in our docker images.

jerryzh168 · 2024-11-22T18:54:37Z

@IlyasMoutawwakil I feel we should be able to get better performance if qkv are fused in llama model, which is what people typically do today, not sure if there is a huggingface model in the hub that has it though

by adding torchao to docker, do you mean these: https://github.com/huggingface/optimum-benchmark/tree/main/docker ?

IlyasMoutawwakil · 2024-11-25T13:42:15Z

@IlyasMoutawwakil I feel we should be able to get better performance if qkv are fused in llama model, which is what people typically do today, not sure if there is a huggingface model in the hub that has it though

Would require changes in transformers modeling to support that, would make for a great PR there, some frameworks (like ipex in optimum-intel) do that manually.

by adding torchao to docker, do you mean these: https://github.com/huggingface/optimum-benchmark/tree/main/docker ?

yes exactly, the same as torchvision and torchaudio, I would rather have torchao installed there to avoid any cpu/cuda version issues.

jerryzh168 · 2024-11-25T18:56:51Z

yeah we are doing that manually in our benchmarks as well, maybe we can worry about it a bit later. I'll make docker changes first

jerryzh168 · 2024-11-25T22:31:42Z

@IlyasMoutawwakil I added torchao to cpu and cuda dockers, it's not available for rocm 5.7, but it's available for 6.1 and 6.2: https://download.pytorch.org/whl/nightly/torchao/

docker

cb00c0f

IlyasMoutawwakil merged commit 7f5d486 into huggingface:main Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add torchao to optimum as a pytorch backend configuration #297

Add torchao to optimum as a pytorch backend configuration #297

jerryzh168 commented Nov 22, 2024

IlyasMoutawwakil commented Nov 22, 2024

jerryzh168 commented Nov 22, 2024

IlyasMoutawwakil commented Nov 25, 2024

jerryzh168 commented Nov 25, 2024

jerryzh168 commented Nov 25, 2024 •

edited

Loading

Add torchao to optimum as a pytorch backend configuration #297

Add torchao to optimum as a pytorch backend configuration #297

Conversation

jerryzh168 commented Nov 22, 2024

IlyasMoutawwakil commented Nov 22, 2024

jerryzh168 commented Nov 22, 2024

IlyasMoutawwakil commented Nov 25, 2024

jerryzh168 commented Nov 25, 2024

jerryzh168 commented Nov 25, 2024 • edited Loading

jerryzh168 commented Nov 25, 2024 •

edited

Loading