-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan Bugfixes and Improvements #7084
Conversation
…for single call batch operation
Thank you for the update! Unfortunately, I'm seeing gibberish on a 10.7B model at Q5_K_S (a framkenmerge, sure, but it works just fine on cpu and clblast) with a long initial prompt (~2700 tokens). Parameters are Additionally, there's a very noticeable delay after detecting Vulkan device on Win 10 (new system, still 1060 3GB), which was hardly noticeable on Win 8.1. That, however, might or might not be caused by a singe graphics output device on my new system (previous CPU had an iGPU, which wasn't used, though). Finally, there's still a huge difference in memory consumption. It seems like the difference for VRAM is even larger now: on that 10.7B model, 9 layers with clblast occupy 1792 MB, while 7 layers with Vulkan occupy 2524 MB. Also, it uses ~300 MB of shared VRAM with any number of layers. With At the same time, the difference in speed between this and clblast is even bigger, Vulkan is really fast both in prompt processing and token generation. |
Can you give me a link to the model that's not working? If I can reproduce the source of the incoherence I can hopefully fix it.
This is most likely shader compilation happening. The GPU driver should cache the shaders, so it should only be slow once with each update and fast on subsequent launches.
This shouldn't have changed compared to without this PR. It's expected that Vulkan uses more VRAM for layers since much more of the model is offloaded. The CLBlast backend basically only runs the matrix multiplication on the GPU and nothing else.
Shared VRAM is most likely the staging buffers for copying data to and from the GPU. Disabling KV offload means that the KV cache resides in RAM (shared VRAM is RAM), so that's expected behavior.
Thanks for testing! Did the speed improve for you compared to without this PR? |
For me on a Radeon 5700XT the performance is almost the same as the main branch, it's just a little bit slower: 264 t/s on main, 260 t/s pr for prompt processing and 33 t/s on main, 30 t/s pr for generation on llama 3 q5_k_m |
Yes, but it's only noticeable on the start. The average is not so impressive for generation: for example, on 7b Q6_K model it goes from 5.033 t/s to 5.064 t/s. Still, it was a 611 tokens result, so the usual slowdown diminishes the improvement. Processing of that ~2700t instruct: on 10.7B at Q5_K_S (10 layers offloaded) it was 35.895 t/s, with this PR it's 37.755 t/s.
Ok, I see now: it happened all the time because I was alternating between clblast and vk versions. Also, maybe it's because is uses all available memory (even though it doesn't look like it on graphs) It seems to cache each version's shaders separately, because launching one doesn't speedup launching the other. Also, the mainline compiles faster, but it's not a big deal.
https://huggingface.co/mradermacher/Fimbulvetr-10.7B-v1-i1-GGUF - I'm still testing messages reloading in my program, and for some reason this model became a good benchmark for that. I'm not sure of the quality or the changes imatrix brought.
No, the same gibberish happened. Interestingly, while trying to test it, I struggled to even run the model with that large instruct. I had to increase the amount of layers from 9 to 10 to make it work. It's like a sweetspot - not higher, not lower, exactly 10. |
I downloaded the q5_k_s version of that model you linked and it runs fine for me across AMD and Nvidia GPUs. Not sure what's going on on your end. Which GPU are you using? |
Same 1060 3GB, and the issue happens on a large initial instruct only. It works just fine with a typical Alpaca instruct or similar. |
Update: gibberish just happened on a ~1100t prompt. I wanted to try setting n_ubatch to 2048, but it's too much memory for my setup (16GB RAM). Same on mainline and this PR. |
I have a question, probably not related, but @0cc4m is the only one that can really answer it. When I use Also, as an aside, the initial implementations allowed me to offload most of the layers to GPU without any hiccups, but now I'll crash if I allocate too many layers to the GPU. This has led me to switching between the CPU and GPU for different tasks. Does this have anything to with with your previous PR where you modified how the layers were handled? I have narrowed down a general bug to the GLFW backend that is unrelated to llama.cpp, so I'm not sure if it's related or not. Still haven't pinned it down yet. This shouldn't be an issue in the near future because I plan on replacing my RX 580 with either to a RTX 4060 ti or a 7900 XT, haven't decided yet. Regardless, just curious if you're able/willing to provide any insights? |
I did some quick tests with my W8100 and didn't really see any improvements or regressions. Honestly after getting my CPU server I've been using Vulkan less and less since my GPU is really only good for 7B models and Command R 30B and Llama 70B completely blow away the small ones. PR:
Master:
|
I can't seem to reproduce that issue. But n_batch > 512 is definitely broken, I'll take a look at what's going on there. |
To be honest, I have no idea what train-text-from-scratch does and I'd be surprised if it can use Vulkan (assuming that's what you meant)
Do you mean for running a model with main? VRAM use might have changed in later versions. Can you give me more details on what worked, what didn't/doesn't and on what hardware?
What GLFW backend? |
That is what I meant. There is a definite 3x speed up. A 3 hour training session is dramatically reduced to a ~40-50 minute training session. Every time. I didn't think it would work, but tried it out to see. I use Any insights into why this might be the case @ggerganov? I don't know/understand enough about the implementation details.
I've been avoiding using GPU as much as I can lately because it keeps crashing my entire system. I would use I have plenty of CPU RAM, but it's not ideal for back prop.
I feel this is out of scope, but it does affect the AMDGPU DRM for the RX 5xx series and the Vulkan backend has crashed in a similar fashion while using llama.cpp. Part of the reason it's been difficult to trace and isolate. I'll have to do some thorough tests then when I'm not so deep into my work. Too many projects open at once ATM to risk it. GLFW#2493. There are a few of other identified bugs related to the RX 5xx series with the mesa graphics drivers. They wanted me to test a AUR package, but I haven't had time to test yet. |
I tried reproducing the results since I had to do a upgrade, but I can't with the latest commit. I'm so shuffled at the moment, I think I might of lost track. Vulkan backend does not seem to affect training text from scratch. I must've been mistaken. Sorry for wasting any time. I'll post if I figure it out. |
I just wanted to add some relevant input and for this branch. I've been experimenting with it and it cleared up a few issues.
I'm sure this is a mixture of things, but these differences are extremely noticeable when compared to the I think you're right about train from scratch not supporting this. I'll have to dig into this some more. Would be nice if it did work. |
@ggerganov @slaren If I run inference on Mistral 7B q4_k_s with batch size 2048 and context size 2048 (and default ubatch size, which should be 512), I get GET_ROWS calls with The call I used to run into this is Is that correct behavior that I need to work around, or is that a bug in the model code? |
That's normal behavior since #6122, the backend should skip zero size tensors. The Vulkan backend was also updated in that PR, but maybe there are other cases. |
Alright, thank you. I'll make sure they are skipped properly. After that fix this PR should be ready to merge. |
I found and fixed the issue. I'll wait for the CI to finish and then merge this PR. |
Gibberish is fixed! Thank you for the update @0cc4m ! |
Apologies for the wait since the last update, I was rather busy.
Here are a number of bugfixes that should hopefully fix incoherence that the Vulkan backend showed for a while now.
I also modified the MMV shaders to run batches in a single call instead of multiple calls, this might improve performance on devices with a higher shader invocation overhead.
Finally, there's further work towards running MoE models with Vulkan included, but the mul_mat_id code is not ready yet.