Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement u8 x i8 -> i32 GEMM kernel for x86_64 using AVX2 intrinsics #535

Merged
merged 3 commits into from
Jan 13, 2025

Conversation

robertknight
Copy link
Owner

@robertknight robertknight commented Jan 12, 2025

Add packing functions for kernels that use int8 x int8 -> i32 dot product instruction sequences, and a kernel for x86_64 that uses them with AVX2 intrinsics.

This doesn't include the GEMV kernel. I will add that separately.

This is the "baseline" kernel for x64 for systems that don't support VNNI ("DL Boost") instructions.


Performance numbers on i5-1038NG7 (4 cores) comparing f32 matmul vs int8, excluding gemv. The int8 matmul is now faster, but other quantization overheads mean that eg. int8 ModernBERT is slightly slower than f32.

f32 x f32 -> f32
Testing kernel fma
m 512 n 512 k 512 iters 512. Duration 419.527ms (0.819ms/iter). GFLOPS 327.6
m 1024 n 1024 k 1024 iters 64. Duration 383.096ms (5.986ms/iter). GFLOPS 358.8
m 128 n 2048 k 512 iters 512. Duration 453.880ms (0.886ms/iter). GFLOPS 302.8
m 2048 n 128 k 512 iters 512. Duration 415.957ms (0.812ms/iter). GFLOPS 330.4
u8 x i8 -> i32
Testing kernel avx2-int8
m 512 n 512 k 512 iters 512. Duration 330.732ms (0.646ms/iter). GFLOPS 415.6
m 1024 n 1024 k 1024 iters 64. Duration 291.978ms (4.562ms/iter). GFLOPS 470.7
m 128 n 2048 k 512 iters 512. Duration 332.322ms (0.649ms/iter). GFLOPS 413.6
m 2048 n 128 k 512 iters 512. Duration 328.687ms (0.642ms/iter). GFLOPS 418.1

This uses the pre-VNNI instruction sequence for u8 x i8 -> i32 dot
products:

```
_mm256_maddubs_epi16
_mm256_madd_epi16
_mm256_add_epi32
```

The first instruction can saturate when adding pairs of intermediate 16-bit
signed ints. This can be avoided by limiting the range of the u8 LHS input.
@robertknight robertknight marked this pull request as ready for review January 13, 2025 08:41
@robertknight robertknight merged commit 8fe2402 into main Jan 13, 2025
2 checks passed
@robertknight robertknight deleted the gemm-int8-x64 branch January 13, 2025 08:42
@robertknight robertknight mentioned this pull request Jan 15, 2025
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant