On A100 card, speed-up effect does not show up. #51

leocnj · 2023-11-30T17:46:34Z

First, thanks very much for creating this cool technology.

On one A100 GPU w/ 80GB VRAM, I tried benchmarking sq-vicuna-7b-v1.3-w3-s0 and its base. It is a bit strange that running median time has not been reduced a lot. This seems different to the speed-up results reported in your paper. Do you mind helping on tracing a possible reason? Is it related to my experiment was on a more powerful GPU?

	Median	PPL	max memory(MiB)
w3-s0	0.025365471839904785	16.07021141052246	3602.3271484375
FP16	0.02616262435913086	14.921088218688965	25906.5771484375

Script:

#!/bin/bash

# vicuna v1.3 Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py models/sq-vicuna-7b-v1.3-w3-s0 c4 --wbits 3 --load models/sq-vicuna-7b-v1.3-w3-s0/sq-vicuna-7b-v1.3-w3-s0.pt --benchmark 128 --check 

# vicuna v1.3 base
# HF naming can use cache
CUDA_VISIBLE_DEVICES=0 python llama.py lmsys/vicuna-7b-v1.3 c4 --wbits 16 --benchmark 128 --check

The text was updated successfully, but these errors were encountered:

shiqingzhangCSU · 2024-01-16T02:33:27Z

Mabey the kernel is under optimized.

Qubitium · 2024-02-29T09:16:18Z

Also keep mind that inference dequantizing is also much more dependent on cpu vs native bf/fp16 models. We have seen 2.5x improvement in running quantized models on same gpu but different cpu/memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On A100 card, speed-up effect does not show up. #51

On A100 card, speed-up effect does not show up. #51

leocnj commented Nov 30, 2023 •

edited

Loading

shiqingzhangCSU commented Jan 16, 2024

Qubitium commented Feb 29, 2024

On A100 card, speed-up effect does not show up. #51

On A100 card, speed-up effect does not show up. #51

Comments

leocnj commented Nov 30, 2023 • edited Loading

shiqingzhangCSU commented Jan 16, 2024

Qubitium commented Feb 29, 2024

leocnj commented Nov 30, 2023 •

edited

Loading