[Help Wanted] Optimize binary matmul kernel #5

chromecast56 · 2024-03-03T21:07:20Z

The current kernel is fairly slow compared to the theoretical optimum, considering the small memory footprint of weight deltas. So right now it functions more as as proof of concept (eg. it outperforms naive simultaneous inference). Can expect an additional 4-8x latency improvement if further optimized.

I don't have much kernel optimization experience yet, though - if anyone in the OSS community is interested, would love some help!

Afterwards, it'd be super interesting to run some benchmarks against LoRA-based multi-tenant systems like Punica/S-LoRA.

chromecast56 · 2024-06-05T23:24:15Z

Ended up using the BitBLAS W1A16 kernel which is pretty fast -- when serving 16 models at once, its 5x faster than the triton kernel. will update repo soon

chromecast56 added the enhancement New feature or request label Mar 3, 2024

chromecast56 self-assigned this Mar 3, 2024

chromecast56 pinned this issue Mar 3, 2024

chromecast56 closed this as completed Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Help Wanted] Optimize binary matmul kernel #5

[Help Wanted] Optimize binary matmul kernel #5

chromecast56 commented Mar 3, 2024 •

edited

Loading

chromecast56 commented Jun 5, 2024

[Help Wanted] Optimize binary matmul kernel #5

[Help Wanted] Optimize binary matmul kernel #5

Comments

chromecast56 commented Mar 3, 2024 • edited Loading

chromecast56 commented Jun 5, 2024

chromecast56 commented Mar 3, 2024 •

edited

Loading