You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, thanks very much for creating this cool technology.
On one A100 GPU w/ 80GB VRAM, I tried benchmarking sq-vicuna-7b-v1.3-w3-s0 and its base. It is a bit strange that running median time has not been reduced a lot. This seems different to the speed-up results reported in your paper. Do you mind helping on tracing a possible reason? Is it related to my experiment was on a more powerful GPU?
Also keep mind that inference dequantizing is also much more dependent on cpu vs native bf/fp16 models. We have seen 2.5x improvement in running quantized models on same gpu but different cpu/memory.
First, thanks very much for creating this cool technology.
On one A100 GPU w/ 80GB VRAM, I tried benchmarking
sq-vicuna-7b-v1.3-w3-s0
and its base. It is a bit strange that running median time has not been reduced a lot. This seems different to the speed-up results reported in your paper. Do you mind helping on tracing a possible reason? Is it related to my experiment was on a more powerful GPU?Script:
The text was updated successfully, but these errors were encountered: