You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current kernel is fairly slow compared to the theoretical optimum, considering the small memory footprint of weight deltas. So right now it functions more as as proof of concept (eg. it outperforms naive simultaneous inference). Can expect an additional 4-8x latency improvement if further optimized.
I don't have much kernel optimization experience yet, though - if anyone in the OSS community is interested, would love some help!
Afterwards, it'd be super interesting to run some benchmarks against LoRA-based multi-tenant systems like Punica/S-LoRA.
The text was updated successfully, but these errors were encountered:
Ended up using the BitBLAS W1A16 kernel which is pretty fast -- when serving 16 models at once, its 5x faster than the triton kernel. will update repo soon
The current kernel is fairly slow compared to the theoretical optimum, considering the small memory footprint of weight deltas. So right now it functions more as as proof of concept (eg. it outperforms naive simultaneous inference). Can expect an additional 4-8x latency improvement if further optimized.
I don't have much kernel optimization experience yet, though - if anyone in the OSS community is interested, would love some help!
Afterwards, it'd be super interesting to run some benchmarks against LoRA-based multi-tenant systems like Punica/S-LoRA.
The text was updated successfully, but these errors were encountered: