Implement u8 x i8 -> i32 GEMM kernel for x86_64 using AVX2 intrinsics #535

robertknight · 2025-01-12T22:49:50Z

Add packing functions for kernels that use int8 x int8 -> i32 dot product instruction sequences, and a kernel for x86_64 that uses them with AVX2 intrinsics.

This doesn't include the GEMV kernel. I will add that separately.

This is the "baseline" kernel for x64 for systems that don't support VNNI ("DL Boost") instructions.

Performance numbers on i5-1038NG7 (4 cores) comparing f32 matmul vs int8, excluding gemv. The int8 matmul is now faster, but other quantization overheads mean that eg. int8 ModernBERT is slightly slower than f32.

f32 x f32 -> f32
Testing kernel fma
m 512 n 512 k 512 iters 512. Duration 419.527ms (0.819ms/iter). GFLOPS 327.6
m 1024 n 1024 k 1024 iters 64. Duration 383.096ms (5.986ms/iter). GFLOPS 358.8
m 128 n 2048 k 512 iters 512. Duration 453.880ms (0.886ms/iter). GFLOPS 302.8
m 2048 n 128 k 512 iters 512. Duration 415.957ms (0.812ms/iter). GFLOPS 330.4
u8 x i8 -> i32
Testing kernel avx2-int8
m 512 n 512 k 512 iters 512. Duration 330.732ms (0.646ms/iter). GFLOPS 415.6
m 1024 n 1024 k 1024 iters 64. Duration 291.978ms (4.562ms/iter). GFLOPS 470.7
m 128 n 2048 k 512 iters 512. Duration 332.322ms (0.649ms/iter). GFLOPS 413.6
m 2048 n 128 k 512 iters 512. Duration 328.687ms (0.642ms/iter). GFLOPS 418.1

This uses the pre-VNNI instruction sequence for u8 x i8 -> i32 dot products: ``` _mm256_maddubs_epi16 _mm256_madd_epi16 _mm256_add_epi32 ``` The first instruction can saturate when adding pairs of intermediate 16-bit signed ints. This can be avoided by limiting the range of the u8 LHS input.

robertknight added 2 commits January 13, 2025 08:22

Add int8 packing functions and tests

8e612d7

robertknight force-pushed the gemm-int8-x64 branch from d0360dd to fc61865 Compare January 13, 2025 08:23

Add more detailed notes on the AVX2 saturation problem

f7e4a51

robertknight marked this pull request as ready for review January 13, 2025 08:41

robertknight merged commit 8fe2402 into main Jan 13, 2025
2 checks passed

robertknight deleted the gemm-int8-x64 branch January 13, 2025 08:42

robertknight mentioned this pull request Jan 15, 2025

8-bit quantization MVP #347

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement u8 x i8 -> i32 GEMM kernel for x86_64 using AVX2 intrinsics #535

Implement u8 x i8 -> i32 GEMM kernel for x86_64 using AVX2 intrinsics #535

robertknight commented Jan 12, 2025 •

edited

Loading

Implement u8 x i8 -> i32 GEMM kernel for x86_64 using AVX2 intrinsics #535

Implement u8 x i8 -> i32 GEMM kernel for x86_64 using AVX2 intrinsics #535

Conversation

robertknight commented Jan 12, 2025 • edited Loading

robertknight commented Jan 12, 2025 •

edited

Loading