-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mul_mat Speedup?? #31
Comments
Definitely we are just at the beginning, especially for smaller VRAM you should see a great improvement with the upcoming commit (hopefully within an hour) I am getting up to 60 tokens/sec on 7B and I've seen more than 22 tokens on 40B and generations of 2000 tokens working fine. |
I'm quite sure there are optimizations on mul_mat side remaining but the next big step is offloading more operations into cuda
|
Very nice, the inference (same seed and prompt), in my case was reduced by 10 seconds and I got 2 more tokens per second using the commit that has the wizard-type finetuning (what is that by the way)
|
And I've noticed that using falcon instruct 7b is 5 seconds faster than wizard falcon 7b (again same prompt and seed)
|
Here are my speeds using wiz falc 40 b, I had to increase threads to 8 to reduce the total time by 15 seconds
|
This is fantasctic!! The optimizations you've released today have cut down the total time by half using wiz falc 40b, it went down from 11.58 minutes to 4.98 minutes on my rig, using -t 8 and -b 512 This was the time it took to generate the story in the windows install video I did
And this is the result after today's release
|
Glad to see, give it a try with -t 4 and -b 1 To squeeze out the last bit of performance for your GPU use --gpu-reserve-mb-main combined with the latest GPU drivers |
You suggested settings have helped reduce the inference time on my rig a bit more. Now I get 1.69t/s using -t4 -b 1 --gpu-reserve-mb-main 300 The story did change though as the gpu usage is different now
|
The change most likely comes from the removal of "-b 512", that removes batched processing of the prompt which switches from cuBLAS to integer multiplication kernels. When you aim for quality, you will use a high precision model, 4k 5k or even 6k |
Tried the 40B 2k and I get 3 tokens/s now (2 minutes total for the below inference)
|
You can leave away -ngl 100 and -b 1 (both are default options now) If you want to further squeeze your card for more performance you can further lower the --gpu-reserve-mb-main in steps of 50-100mb per test. Also make sure you update to the last version. I would consider testing OpenAssistant. |
Here's a quick result using the GGCC Open assistant quant 2k model with the same prompt (3 t/s)
|
Im not too familiar with mul_mat, but it seem's like it is the part of the process that takes the longest time, is that able to be optimized even further?
The current speed is great for a falcon model, I had tested the original gptq ones and those were so slow in ooba.
The text was updated successfully, but these errors were encountered: