Add INT4 quant/de-quant kernels #620

rahulbatra85 · 2024-07-29T18:09:28Z

No description provided.

Add Perf Kernels This is a combination of 2 commits. Add Perf Kernels Add Perf Kernels This is a combination of 6 commits. add perf-kernels fix formating issues fix unused variables and other bugs fix other issues remove scripts save check changes format save save try pre-commit check save

Change all block pointers to tensor pointers Block pointers are for nvidia TMAs. They are useful for regular loads as well but not well supported. Also cleaned up some code I came across along the way and updated comment at the top.

Add support for layouts commonly used by users. Add option for varlen / thd layout to specify equal context lengths for all batches. Also often used by users.

* remove on push for Integration Tests * rename * add post merge test * save * dtype params * skip bad config * fix more stuff

Increase CI timeout

Couple of FA optimizations Set SM scale multiplication to a constexpr. Minor asm improvement. Changed acc scaling to adjust for softmax division to multiplication with reciprocal. ~10% perf improvement. --------- Co-authored-by: Michael Melesse <[email protected]>

vgokhale · 2024-07-29T18:19:25Z

python/perf-kernels/int4/dequantize.py

+    tl.store(x_out_dequant_ptr + offset, dequant, mask=mask)
+
+
+if __name__ == '__main__':


Can you add a few test cases instead of this? Check the flash-attention.py to see how we use pytest.

vgokhale · 2024-07-29T18:19:38Z

python/perf-kernels/int4/quantize.py

+    tl.store(x_out_quant_ptr + offset, x_quant, mask=mask)
+
+
+if __name__ == '__main__':


Same for this (pytest)

micmelesse and others added 8 commits July 17, 2024 05:04

skip backward (#586)

17575ea

Change all block pointers to tensor pointers (#585)

a3d784a

Change all block pointers to tensor pointers Block pointers are for nvidia TMAs. They are useful for regular loads as well but not well supported. Also cleaned up some code I came across along the way and updated comment at the top.

Add support for bshd layout (#587)

aa6685a

Add support for layouts commonly used by users. Add option for varlen / thd layout to specify equal context lengths for all batches. Also often used by users.

Post-Merge CI (#612)

dbe1173

* remove on push for Integration Tests * rename * add post merge test * save * dtype params * skip bad config * fix more stuff

Increase CI timeout (#615)

23ba546

Increase CI timeout

Add INT4 quant/de-quant kernels

71c5845

rahulbatra85 requested a review from vgokhale July 29, 2024 18:09

vgokhale reviewed Jul 29, 2024

View reviewed changes

micmelesse force-pushed the main_perf branch from 16b0bbf to 628e09b Compare October 28, 2024 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add INT4 quant/de-quant kernels #620

Add INT4 quant/de-quant kernels #620

rahulbatra85 commented Jul 29, 2024

vgokhale Jul 29, 2024

vgokhale Jul 29, 2024

		tl.store(x_out_dequant_ptr + offset, dequant, mask=mask)


		if __name__ == '__main__':

		tl.store(x_out_quant_ptr + offset, x_quant, mask=mask)


		if __name__ == '__main__':

Add INT4 quant/de-quant kernels #620

Are you sure you want to change the base?

Add INT4 quant/de-quant kernels #620

Conversation

rahulbatra85 commented Jul 29, 2024

vgokhale Jul 29, 2024

Choose a reason for hiding this comment

vgokhale Jul 29, 2024

Choose a reason for hiding this comment