Couple of FA optimizations #608

vgokhale · 2024-06-27T19:33:03Z

Set SM scale multiplication to a constexpr. Minor asm improvement.

Changed acc scaling to adjust for softmax division to multiplication with reciprocal. ~10% perf improvement.

Add Perf Kernels This is a combination of 2 commits. Add Perf Kernels Add Perf Kernels This is a combination of 6 commits. add perf-kernels fix formating issues fix unused variables and other bugs fix other issues remove scripts save check changes format save save try pre-commit check save

Change all block pointers to tensor pointers Block pointers are for nvidia TMAs. They are useful for regular loads as well but not well supported. Also cleaned up some code I came across along the way and updated comment at the top.

Add support for layouts commonly used by users. Add option for varlen / thd layout to specify equal context lengths for all batches. Also often used by users.

Set SM scale multiplication to a constexpr. Minor asm improvement. Changed acc scaling to adjust for softmax division to multiplication with reciprocal. ~10% perf improvement.

Couple of FA optimizations Set SM scale multiplication to a constexpr. Minor asm improvement. Changed acc scaling to adjust for softmax division to multiplication with reciprocal. ~10% perf improvement. --------- Co-authored-by: Michael Melesse <[email protected]>

micmelesse and others added 5 commits June 19, 2024 08:21

skip backward (#586)

cc535d3

Change all block pointers to tensor pointers (#585)

cfb231f

Change all block pointers to tensor pointers Block pointers are for nvidia TMAs. They are useful for regular loads as well but not well supported. Also cleaned up some code I came across along the way and updated comment at the top.

Add support for bshd layout (#587)

18930eb

Add support for layouts commonly used by users. Add option for varlen / thd layout to specify equal context lengths for all batches. Also often used by users.

Couple of FA optimizations

0d1c3e1

Set SM scale multiplication to a constexpr. Minor asm improvement. Changed acc scaling to adjust for softmax division to multiplication with reciprocal. ~10% perf improvement.

vgokhale requested a review from micmelesse June 27, 2024 19:33

vgokhale self-assigned this Jun 27, 2024

vgokhale added the perf improvement label Jun 27, 2024

Make linter happy

db3beaf

zhanglx13 approved these changes Jun 27, 2024

View reviewed changes

micmelesse approved these changes Jun 28, 2024

View reviewed changes

vgokhale added 2 commits July 8, 2024 22:25

Reduce some autotune configs and keys to reduce runtime

5e2ffc6

Linter

8dd5404

micmelesse force-pushed the main_perf branch from d26ef1d to dbe1173 Compare July 16, 2024 23:38

vgokhale and others added 3 commits July 18, 2024 17:08

Merge branch 'main_perf' into fa_optims

4787d45

Fix bug

bfbc3ef

Remove accidentally added file

3557466

vgokhale merged commit df4c4d3 into main_perf Jul 19, 2024
4 checks passed

vgokhale deleted the fa_optims branch July 19, 2024 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Couple of FA optimizations #608

Couple of FA optimizations #608

vgokhale commented Jun 27, 2024

Couple of FA optimizations #608

Couple of FA optimizations #608

Conversation

vgokhale commented Jun 27, 2024