Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help Wanted] Optimize binary matmul kernel #5

Closed
chromecast56 opened this issue Mar 3, 2024 · 1 comment
Closed

[Help Wanted] Optimize binary matmul kernel #5

chromecast56 opened this issue Mar 3, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@chromecast56
Copy link
Collaborator

chromecast56 commented Mar 3, 2024

The current kernel is fairly slow compared to the theoretical optimum, considering the small memory footprint of weight deltas. So right now it functions more as as proof of concept (eg. it outperforms naive simultaneous inference). Can expect an additional 4-8x latency improvement if further optimized.

I don't have much kernel optimization experience yet, though - if anyone in the OSS community is interested, would love some help!

Afterwards, it'd be super interesting to run some benchmarks against LoRA-based multi-tenant systems like Punica/S-LoRA.

@chromecast56 chromecast56 added the enhancement New feature or request label Mar 3, 2024
@chromecast56 chromecast56 self-assigned this Mar 3, 2024
@chromecast56 chromecast56 pinned this issue Mar 3, 2024
@chromecast56
Copy link
Collaborator Author

Ended up using the BitBLAS W1A16 kernel which is pretty fast -- when serving 16 models at once, its 5x faster than the triton kernel. will update repo soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant