[Needs more investigation] `int8_weight_only` via `quantize_()` API on `torch.float16` models results in NaN values across multiple CPU architectures #1662

vmpuri · 2025-02-04T21:34:04Z

Note: I'll work on seeing if this reproduces with a non-torchchat example.

While working on migrating torchchat's WeightOnlyInt8Quantizer to AO's quantize_(model, int8_weight_only()) API, I ran into issues where values would go to NaN after a few layers if the model's dtype was initially float16. This seems to occur across multiple platforms (tested with MPS, Mac CPU, x86 CPU), so I'm not sure if it's a hardware-specific issue.

Interestingly, setting the model dtype to bfloat16 does not encounter this error.

To repro, you can check out this PR with the migration in torchchat

and run a model using:

python3 torchchat.py generate llama3.1 --quantize '{"linear:int8": {"groupsize": 256}, "executor":{"accelerator":"mps"}}' --prompt "King in the castle, king in the castle, i have a chair." --num-samples 3 --dtype float16

You'll notice the model just outputs "!" tokens - representing NaN. If you add a debug hook to the model, you can identify that some values in the intermediate tensors get very close to 0 just before NaN values are detected.

python3 torchchat.py generate llama3.1 --quantize '{"linear:int8": {"groupsize": 256}, "executor":{"accelerator":"mps"}}' --prompt "King in the castle, king in the castle, i have a chair." --num-samples 3 --dtype float16

The text was updated successfully, but these errors were encountered:

psinger · 2025-02-05T11:59:26Z

I can confirm this. I also noticed it the other day but did not dig deeper.

If the base weights are in float16, int8_weight_only completely breaks the outputs. If the base weights are bfloat16 the output is as expected in inference only mode.

leslie-fang-intel · 2025-02-11T02:42:49Z

Thanks for the reporting this issue. I will take a look of this issue.

leslie-fang-intel · 2025-02-11T07:14:02Z

It seems like a overflow issue. Hi @vmpuri @psinger did GPU meet same issue?

Draft a PR to fix it: #1698
After that the output with above cmd is:

PyTorch: 43a00d73b36494e052a82418182c63e18e9d9f69
AO: https://github.com/pytorch/ao/pull/1698
TorchChat: https://github.com/pytorch/torchchat/pull/1328

rm -rf /tmp/torchinductor_leslie/* && clear && TORCH_LOGS="+schedule,+inductor,+output_code" TORCH_COMPILE_DEBUG=1 numactl -C 56-111 -m 1 python3 torchchat.py generate llama3.1 --quantize '{"linear:int8": {"groupsize": 256}, "executor":{"accelerator":"cpu"}}' --prompt "King in the castle, king in the castle, i have a chair." --num-samples 3 --dtype float16

I0210 23:07:56.458000 3265051 torch/_inductor/config.py:669] compile_threads set to 32
import error: No module named 'triton'
Note: detected 224 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
Note: NumExpr detected 224 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
NumExpr defaulting to 16 threads.
PyTorch version 2.7.0a0+git43a00d7 available.
Warning: PTEModel (ExecuTorch) not available with exception: No module named 'executorch'
Unable to import torchao experimental quant_api with error:  [Errno 2] No such file or directory: '/localdisk/leslie/torch_miniforge/torchchat/torchao-build/src/ao/torchao/experimental/quant_api.py'
Using device=cpu Intel (R) Xeon (R) CPU Max 9480
Loading model...
Time to load model: 1.51 seconds
Quantizing the model with: {'linear:int8': {'groupsize': 256}, 'executor': {'accelerator': 'cpu'}}
quantizer is linear int8
Time to quantize model: 0.86 seconds
-----------------------------------------------------------
King in the castle, king in the castle, i have a chair. I have a table, a king in the castle, king in the castle, i have a friend. I have a house, a king's in the castle, king in the castle, i have a heart.
King in the castle, king in the castle, i have a chair. I am the king in the castle, king in the castle, you have a heart. I have a table, a king in the castle,undefined king in the castle, i have a friend.
King in the castle, king in the castle, we have a chair. We have a table, a king in the castle, king in the castle, we have a heart. I have a house, a king in the castle, king in the castle, we have a friend.
King in the castle is presenting the concept of a "king's kingdom" while using rhymes to express his perspective that he is the rightful ruler.
King in the castle, king in the castle, i have a
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Generated 199 tokens
Time for inference 1: 55.1958 sec total
Time to first token: 0.4283 sec with parallel prefill.

      Total throughput: 3.6235 tokens/sec, 0.2760 s/token
First token throughput: 2.3350 tokens/sec, 0.4283 s/token
 Next token throughput: 3.6335 tokens/sec, 0.2752 s/token

Bandwidth achieved: 58.19 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================

King in the castle, king in the castle, i have a chair. King in the castle, king in the castle, i have a chair.
The king is in the castle, the king is in the castle, he has a horse. The king is in the castle, the king is in the castle, he has a horse.
King in the castle, king in the castle, he has a queen. King in the castle, king in the castle, he has a queen.
The king is in the castle, the king is in the castle, he has a crown. The king is in the castle, the king is in the castle, he has a crown. King in the castle, king in the castle, he has a sword. King in the castle, king in the castle, he has a sword.
The king is in the castle, the king is in the castle, he has a throne. The king is in the castle, the king is in the castle, he has a throne.
King in the castle, king in the castle,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Generated 199 tokens
Time for inference 2: 55.3544 sec total
Time to first token: 0.4365 sec with parallel prefill.

      Total throughput: 3.6131 tokens/sec, 0.2768 s/token
First token throughput: 2.2910 tokens/sec, 0.4365 s/token
 Next token throughput: 3.6236 tokens/sec, 0.2760 s/token

Bandwidth achieved: 58.03 GB/s

========================================

King in the castle, king in the castle, i have a chair. I'm sitting in my chair. In my castle. I'm the king of the castle. My chair is tall. And my castle is grand. My chair is strong. And my castle is secure. Where are you?
What are you doing? You want to sit in my chair? No, no, no! You can't sit in my chair. It's mine. I'm the king. I'm in my castle. You have to leave now. Go and find your own castle to sit in your own chair. This one is mine. I'm the king of the castle. My chair is tall. And my castle is grand. I'll make you a cup of tea. Would you like a cup of tea? No, no, no. You can't sit in my chair. You must leave.
I'm the king. I have a castle. And I have a chair. I don't care if you are the king of another castle. I'm the king of
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Generated 199 tokens
Time for inference 3: 55.2402 sec total
Time to first token: 0.4357 sec with parallel prefill.

      Total throughput: 3.6206 tokens/sec, 0.2762 s/token
First token throughput: 2.2950 tokens/sec, 0.4357 s/token
 Next token throughput: 3.6311 tokens/sec, 0.2754 s/token

Bandwidth achieved: 58.15 GB/s

========================================


Warning: Excluding compile in calculations
      Average tokens/sec (total): 3.62
Average tokens/sec (first token): 2.31
Average tokens/sec (next tokens): 3.63

I0210 23:10:47.859000 3265051 torch/_inductor/remote_cache.py:417] Cache Metrics: None
I0210 23:10:47.859000 3265051 torch/_inductor/remote_cache.py:417]

drisspg added bug Something isn't working quantize labels Feb 5, 2025

leslie-fang-intel self-assigned this Feb 11, 2025

leslie-fang-intel mentioned this issue Feb 11, 2025

fix overflow fp16 act woq int8 #1698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Needs more investigation] `int8_weight_only` via `quantize_()` API on `torch.float16` models results in NaN values across multiple CPU architectures #1662

[Needs more investigation] `int8_weight_only` via `quantize_()` API on `torch.float16` models results in NaN values across multiple CPU architectures #1662

vmpuri commented Feb 4, 2025

psinger commented Feb 5, 2025

leslie-fang-intel commented Feb 11, 2025

leslie-fang-intel commented Feb 11, 2025 •

edited

Loading

[Needs more investigation] int8_weight_only via quantize_() API on torch.float16 models results in NaN values across multiple CPU architectures #1662

[Needs more investigation] int8_weight_only via quantize_() API on torch.float16 models results in NaN values across multiple CPU architectures #1662

Comments

vmpuri commented Feb 4, 2025

psinger commented Feb 5, 2025

leslie-fang-intel commented Feb 11, 2025

leslie-fang-intel commented Feb 11, 2025 • edited Loading

[Needs more investigation] `int8_weight_only` via `quantize_()` API on `torch.float16` models results in NaN values across multiple CPU architectures #1662

[Needs more investigation] `int8_weight_only` via `quantize_()` API on `torch.float16` models results in NaN values across multiple CPU architectures #1662

leslie-fang-intel commented Feb 11, 2025 •

edited

Loading