Performance Tuning for Q4_K matmul CUDA kernel #8136

contentis · 2024-06-26T13:38:30Z

Changes:

Use float4 to force aligned reads of 128bit to increase memory bandwidth
Increase block dimension to 32x4 to submit enough threads per SM to avoid stalls

Performance
Using llama-bench I measured the end-to-end speedup
Device 0: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes

BASLINE

model	size	params	backend	ngl	test	t/s
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	100	pp128	6476.13 ± 66.28
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	100	tg128	140.66 ± 0.65

FLOAT4

model	size	params	backend	ngl	test	t/s
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	100	pp128	6479.48 ± 107.08
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	100	tg128	142.87 ± 0.39

128 Threads per Block

model	size	params	backend	ngl	test	t/s
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	100	pp128	6489.03 ± 118.45
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	100	tg128	145.76 ± 0.88

FLOAT4 + 128 Threads per Block

model	size	params	backend	ngl	test	t/s
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	100	pp128	6522.17 ± 98.75
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	100	tg128	149.49 ± 0.28

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

slaren · 2024-06-26T14:03:23Z

With LLAMA_CUDA_FORCE_DMMV I see a significant improvement from these changes:

GPU	Model	Test	t/s master	t/s q4_k_performance	Speedup
RTX 3090 Ti	llama 7B Q4_K_M	tg128	104.36	127.25	1.22

However, without LLAMA_CUDA_FORCE_DMMV mmvq is used instead, which is significantly faster:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	tg128	146.10 ± 1.12

For this format I think mmvq should be automatically used for compute capability >= 6.1.

ggerganov · 2024-06-26T15:15:48Z

Similar results for RTX 2060:

LLAMA_CUDA_FORCE_DMMV=1 LLAMA_CUDA=1 ./scripts/compare-commits.sh master pr/8136 -m ./models/llama-7b-v2/ggml-model-q4_k.gguf -p 0 -ngl 99

GPU	Model	Test	t/s master	t/s pr/8136	Speedup
RTX 2060 SUPER	llama 7B Q4_K_M	tg128	29.47	40.27	1.37

LLAMA_CUDA=1 make -j && ./llama-bench -m ./models/llama-7b-v2/ggml-model-q4_k.gguf -p 0 -ngl 99

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	tg128	69.08 ± 0.24

build: 50732c0 (3236)

JohannesGaessler · 2024-06-26T16:05:54Z

Use float4 to force aligned reads of 128bit to increase memory bandwidth

I don't think that's the reason. __builtin_assume_aligned does not achieve any speedup. And unlike with e.g. half2 there is no hardware support for float4; the compiler will simply convert it to float instructions. I think the reason for the performance difference is rather that with the code on master the compiler fails to group or schedule the memory accesses to y1 and y2 in the optimal way.

In any case, I can confirm that the performance increases with this change and that most of the difference comes from the float4 change.

JohannesGaessler · 2024-06-26T16:06:37Z

ggml-cuda/dmmv.cu

+#define BLOCK_DIM_X 32
+#define BLOCK_DIM_Y 4


Suggested change

#define BLOCK_DIM_X 32

#define BLOCK_DIM_Y 4

#define Q4_K_BLOCK_DIM_X 32

#define Q4_K_BLOCK_DIM_Y 4

Also if the optimal value for the x dimension is 32, should that parameter simply be removed?

Any is fine with me - had them there as defines to easily change them and also reuse them across other kernels in my testing.
I left them in this PR for readability but happy to hard-code them or renaming them.

In that case I would just hard-code them because as it is they get multiplied with magic numbers anyways.

JohannesGaessler · 2024-06-26T16:30:19Z

And unlike with e.g. half2 there is no hardware support for float4; the compiler will simply convert it to float instructions.

Unless I am completely misremembering the CUDA documentation definitely said that the float4 datatype effectively is just 4 floats. However, looking at the PTX code it seems that there is a dedicated instruction for loading a chunk of 16 bytes:

1031	ld.global.nc.v4.f32 	{%f9, %f10, %f11, %f12}, [%rd31+-512];
1032	
1033	ld.global.nc.v4.f32 	{%f17, %f18, %f19, %f20}, [%rd31+-384];
1034	
1035	ld.global.nc.v4.f32 	{%f25, %f26, %f27, %f28}, [%rd31];
1036	
1037	ld.global.nc.v4.f32 	{%f33, %f34, %f35, %f36}, [%rd31+128];

Assuming this can also be applied to other kernels this is definitely a very good find.

contentis · 2024-06-27T06:23:57Z

@JohannesGaessler float4 issues a single instruction to read 128bit. In theory, the compiler should be capable of understanding this from the original code (w.o. having to use "unsafe" methods as reinterpret cast), but the compiler often is imperfect.

@slaren My plan was to look at dequantize_mul_mat_vec (Q4_0) next and see if I can find opportunities for optimization.
Would you say mmvq is used more frequently? Trying to understand what is the most common format.

mtavenrath · 2024-06-27T07:52:07Z

Unless I am completely misremembering the CUDA documentation definitely said that the float4 datatype effectively is just 4 floats. However, looking at the PTX code it seems that there is a dedicated instruction for loading a chunk of 16 bytes:

You are correct. A float4 is 4 floats in 4 registers and there is supposed for aligned vectorized loads (LD*.(8|16|32|64|128) with numbers being bits.

Loads / Stores are split into transactions of sectors (32 bytes) with one sector being processed each clock. If a warp accesses multiple sectors they get serialized which increases MIO utilization. When simplifying things a bit memory efficiency can be considered as (number_of_bytes_accessed / (number_of_sectors_accessed*32)).

In this case 4 contiguous floats have been read with 4 LD instructions. Two contiguous warps have a byte offset of 16 bytes resulting in reading bytes 0-3, 16-19 within a sector. Thus memory efficiency was 8/32=0.25. Using float4 instead of float increased this metric to 1.0.

Assuming this can also be applied to other kernels this is definitely a very good find.

A full sweep with Nsight Compute through all kernels / kernel dimensions of the most important networks will reveal candidates where the optimization might help. FB utilization is the metric to look for most of the time for the llama.cpp kernels. A utilization >95% is perfect, >90% is pretty good, Everything below might be using memory inefficient and thus benefit from vectorized loads and/or reordering of memory accesses.

slaren · 2024-06-27T08:41:59Z

@slaren My plan was to look at dequantize_mul_mat_vec (Q4_0) next and see if I can find opportunities for optimization.
Would you say mmvq is used more frequently? Trying to understand what is the most common format.

Yes, mmvq is used more frequently. At this point, the dmmv kernels are only with GPUs that don't support dp4a, cc < 6.1.

JohannesGaessler · 2024-06-27T12:06:51Z

Would you say mmvq is used more frequently? Trying to understand what is the most common format.

All NVIDIA GPUs starting with compute capability 6.1 have the __dp4a instruction which does per-byte integer dot products. This is what is used in the mul_mat_vec_q kernels and unless you are compiling with LLAMA_CUDA_FORCE_DMMV the MMVQ kernels are always used if __dp4a is available. The most modern GPU that would benefit from optimizations to the dequantize_mul_mat_vec kernels is the P100 because it only has compute capability 6.0. With MMVQ the activations are quantized to 8 bit so you can load 4 values with a single 32 bit value and those values align nicer with the quantized data in the weights.

A full sweep with Nsight Compute through all kernels / kernel dimensions of the most important networks will reveal candidates where the optimization might help.

Basically the only kernels that are performance relevant are mul_mat_vec_q (matrix multiplication with batch size 1-8) and mul_mat_q (matrix multiplication with batch size > 8) as well as the ggml FlashAttention kernels. The former two deal with quantized weights and right now the main issue in terms of bandwidth is that the data layout is not very CUDA friendly. By design the same data layout (array of structs) is used for all backends and as a consequence all of the quantized data blocks are aligned to only 2 or 4 bytes. The blocks are also very small with only 16 or 32 quantized values per scale (for GPTQ it's I think 128).

For MMVQ I've been thinking it would maybe be possible to just load the data contiguously and apply a bit mask instead of loading the data from 2/4 byte values that are not 100% contiguous.

FB utilization is the metric to look for most of the time for the llama.cpp kernels. A utilization >95% is perfect, >90% is pretty good

I currently don't have access to an instance of NSight compute, MMVQ had ~90% "speed of light" memory utilization, I don't know FB specifically.

slaren · 2024-06-27T12:21:14Z

By design the same data layout (array of structs) is used for all backends

This is not really the design, backends have the ability to change the layout of the tensor data. I can go into more detail if you think that could improve performance significantly, but essentially the backends can convert the data to whatever layout they want during the set_tensor function, although they should also provide a way to convert it to the ggml layout in get_tensor.

JohannesGaessler · 2024-06-27T12:37:42Z

I previously made prototypes where I converted the data to struct of arrays layout but I was not able to get better performance for MMVQ (though it's always possible that I just did it wrong). For MMQ it would maybe be useful because for efficient use of asynchronous data loading you need 16 byte alignment. But right now the int8 tensor core utilization is only 35% so there are probably still other problems that would need to be fixed. And unless you drop support for partial offloading completely you would need to implement and compile two separate instances for loading data per quantization format so I'm thinking that changing the data layout is comparatively a lot of work for the potential benefits.

slaren · 2024-06-27T12:46:20Z

And unless you drop support for partial offloading completely you would need to implement and compile two separate instances for loading data per quantization format

I don't think this needs to be the case, during partial offloading the weights are also copied to VRAM using the backend interface, and they can be converted to a different layout during the copy. As long as it is only a change in layout and does not require any expensive computations, I don't think it would affect performance significantly.

JohannesGaessler · 2024-06-27T12:51:33Z

In my prototype I did the conversion via host->device cudaMemcpy2D and the performance was very terrible but maybe device->device is acceptable; I'd have to test it.

mtavenrath · 2024-06-28T05:43:58Z

Given that SM load is low for the kernels of interest instead of adding complexity to the codebase we can also add support for unaligned loads in CUDA:


__device__ int load_unaligned_int(void const* address)
{
  ptrdiff_t address_int = reinterpret_cast<ptrdiff_t>(address);

  uint32_t const* address_int32 = reinterpret_cast<const uint32_t*>(address_int & ~size_t(3));
  
  uint32_t offset = address_int & 3;

  uint32_t dword1 = *address_int32;
  if (offset) {
    auto dword2 = *(address_int32 + 1);
    asm volatile("prmt.b32.f4e %0, %1, %2, %3;" : "=r"(dword1) : "r"(uint32_t(dword1)), "r"(uint32_t(dword2)), "r"(offset));
  }
  return dword1;
}

is as good as it can get loading 32-bit from an unaligned address. Given that the offset is 2 byte aligned there is a 50% chance that only a single load is required and if the alignment is not given 2 loads have to be done anyway and the result has to be combined as well.

https://godbolt.org/z/6zdT9osb3 has the code and SASS. The unaligned load logic essentially adds only 2 LOP3 instructions and that's it. When streaming contiguous data only a single additional load would be required for the whole stream.

 LOP3.LUT P0, R0, R4.reuse, 0x3, RZ, 0xc0, !PT ; required
 IMAD.MOV.U32 R7, RZ, RZ, R5 ; might be in the SASS or not depending on the register allocator strategy
 LOP3.LUT R6, R4, 0xfffffffc, RZ, 0xc0, !PT ; required
 ULDC.64 UR4, c[0x0][0x118] ; won't be in the SASS if inlined
 LD.E R4, [R6.64]  ; load 1
 @P0 LD.E R3, [R6.64+0x4]  ; load 2
 @P0 PRMT.F4E R4, R4, R0, R3  ; combine the two ints
 RET.ABS.NODEC R20 0x0 ; just for the function
.L_x_0:
 BRA `(.L_x_0)

The same can be done with 64-bit and 128-bit loads as long as there are free cycles in the ALU or FP unit to move around registers.

JohannesGaessler · 2024-06-28T13:28:29Z

Relative to the current master code that loads the data as 2 16 bit values I'm measuring a speedup of ~1%. Is there a reason why you're using inline PTX assembly instead of __byte_perm (which I think does the same thing)?

JohannesGaessler · 2024-06-30T10:13:49Z

ggml-cuda/dmmv.cu

+#define BLOCK_DIM_X 32
+#define BLOCK_DIM_Y 4


In that case I would just hard-code them because as it is they get multiplied with magic numbers anyways.

JohannesGaessler · 2024-06-30T10:14:31Z

ggml-cuda/dmmv.cu

+        float4 y11 = *reinterpret_cast<const float4*>(y1+0);
+        float4 y12 = *reinterpret_cast<const float4*>(y1+32);
+        float4 y21 = *reinterpret_cast<const float4*>(y2+0);
+        float4 y22 = *reinterpret_cast<const float4*>(y2+32);


I think these could be made const but it doesn't really matter.

Add 128 byte read and 128 threads per block

50732c0

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 26, 2024

JohannesGaessler reviewed Jun 26, 2024

View reviewed changes

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 27, 2024

JohannesGaessler reviewed Jun 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Tuning for Q4_K matmul CUDA kernel #8136

Performance Tuning for Q4_K matmul CUDA kernel #8136

contentis commented Jun 26, 2024

slaren commented Jun 26, 2024

ggerganov commented Jun 26, 2024

JohannesGaessler commented Jun 26, 2024

JohannesGaessler Jun 26, 2024

JohannesGaessler Jun 26, 2024

contentis Jun 27, 2024

JohannesGaessler Jun 30, 2024

JohannesGaessler commented Jun 26, 2024

contentis commented Jun 27, 2024

mtavenrath commented Jun 27, 2024 •

edited

Loading

slaren commented Jun 27, 2024

JohannesGaessler commented Jun 27, 2024

slaren commented Jun 27, 2024

JohannesGaessler commented Jun 27, 2024

slaren commented Jun 27, 2024

JohannesGaessler commented Jun 27, 2024

mtavenrath commented Jun 28, 2024

JohannesGaessler commented Jun 28, 2024 •

edited

Loading

JohannesGaessler Jun 30, 2024

JohannesGaessler Jun 30, 2024

Performance Tuning for Q4_K matmul CUDA kernel #8136

Are you sure you want to change the base?

Performance Tuning for Q4_K matmul CUDA kernel #8136

Conversation

contentis commented Jun 26, 2024

slaren commented Jun 26, 2024

ggerganov commented Jun 26, 2024

JohannesGaessler commented Jun 26, 2024

JohannesGaessler Jun 26, 2024

Choose a reason for hiding this comment

JohannesGaessler Jun 26, 2024

Choose a reason for hiding this comment

contentis Jun 27, 2024

Choose a reason for hiding this comment

JohannesGaessler Jun 30, 2024

Choose a reason for hiding this comment

JohannesGaessler commented Jun 26, 2024

contentis commented Jun 27, 2024

mtavenrath commented Jun 27, 2024 • edited Loading

slaren commented Jun 27, 2024

JohannesGaessler commented Jun 27, 2024

slaren commented Jun 27, 2024

JohannesGaessler commented Jun 27, 2024

slaren commented Jun 27, 2024

JohannesGaessler commented Jun 27, 2024

mtavenrath commented Jun 28, 2024

JohannesGaessler commented Jun 28, 2024 • edited Loading

JohannesGaessler Jun 30, 2024

Choose a reason for hiding this comment

JohannesGaessler Jun 30, 2024

Choose a reason for hiding this comment

mtavenrath commented Jun 27, 2024 •

edited

Loading

JohannesGaessler commented Jun 28, 2024 •

edited

Loading