Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: bugfixes for small subgroup size systems + llvmpipe test #10809

Merged
merged 5 commits into from
Dec 17, 2024

Conversation

netrunnereve
Copy link
Collaborator

So I tried running our Vulkan implementation with llvmpipe for fun and discovered that some shaders don't work properly with a smaller than usual subgroup size. Llvmpipe has a subgroup size of 8 on AVX systems as it treats each 256-bit AVX processor (computing 8 32 bit floats) as a GPU core.

Having llvmpipe support also means that we can run the Vulkan tests on the regular Github CI machines. It's super slow considering it's simulating a GPU on CPU but at least it's faster than the CUDA and HIP builds!

@netrunnereve netrunnereve requested a review from 0cc4m December 12, 2024 21:29
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Dec 12, 2024
Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't review the workflow change, but the vulkan change LGTM.

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow change looks good to me, it doesn't add an unreasonable amount of time to the CI.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 16, 2024

I tried running it on llvmpipe, but it got stuck on MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]) and didn't continue. Not sure what's going on. I'll try again tomorrow.

@netrunnereve
Copy link
Collaborator Author

Not sure if this helps but here's the Vulkan printout from my computer which passes all tests. The flags are the same on the Github machines but they use version 15.0.7.

ggml_vulkan: 0 = llvmpipe (LLVM 17.0.6, 256 bits) (llvmpipe) | uma: 0 | fp16: 1 | warp size: 8 | matrix cores: none

I think if you have a CPU with AVX 512 you'll get a 512 bit llvmpipe with a warp size of 16.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 17, 2024

I don't have an AVX512 CPU. I get exactly the same printout as you do. When using Mesa 24.3.1 - kisak-mesa PPA (LLVM 17.0.6) it gets stuck on q4_1. When using Mesa 24.0.9-0ubuntu0.3 (LLVM 17.0.6) it doesn't. Not sure what's going on, but it shouldn't affect us.

@0cc4m 0cc4m merged commit 7b1ec53 into ggerganov:master Dec 17, 2024
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants