ggml : fix iq4_nl dot product with odd number of blocks #8549

slaren · 2024-07-17T21:14:00Z

Runs the last block on the pure C implementation if there is an odd number of blocks. Only AVX and AVX2 tested, likely affects NEON and others as well.

oldgithubman · 2024-07-17T22:16:22Z

Ref: #8495

Runs the last block on the pure C implementation if there is an odd number of blocks. Only AVX and AVX2 tested, likely affects NEON and others as well.

Nice solution

ggerganov · 2024-07-18T06:50:23Z

On M2 Ultra, test-backend-ops runs successful.

Enabling the random tests causes many failures:

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 154618.82 MB
  MUL_MAT(type_a=q4_1,type_b=f32,m=43,n=92,k=288,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.105420734 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=12,n=24,k=480,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 11773.473165934 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=4,n=38,k=480,bs=[1,1],nr=[1,1]): [MUL_MAT] NaN at index 148 (Metal=-5.410255 CPU=nan) FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=69,n=86,k=96,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 12530281.861510953 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=75,n=113,k=96,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 65.765954396 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=74,n=114,k=480,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 3563.260783020 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=7,n=126,k=288,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 43.655039701 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=20,n=115,k=480,bs=[1,1],nr=[1,1]): [MUL_MAT] NaN at index 19 (Metal=11.000481 CPU=nan) FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=56,n=38,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NaN at index 55 (Metal=-16.780495 CPU=nan) FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=15,n=16,k=160,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 10617966.006262200 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=103,n=126,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 10811661558.555959702 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=122,n=67,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 1.668788946 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=21,n=19,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 7835.887396502 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=77,n=120,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 137.825738287 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=31,n=102,k=160,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.192240393 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=127,n=42,k=288,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 1.022682184 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=121,n=70,k=416,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.436010703 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=90,n=120,k=288,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 153120430276.364410400 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=12,n=47,k=480,bs=[1,1],nr=[1,1]): [MUL_MAT] NaN at index 552 (Metal=13.171962 CPU=nan) FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=51,n=127,k=224,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 162.877185397 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=97,n=24,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 20659224025105313792.000000000 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=45,n=26,k=288,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.159211832 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=68,n=58,k=288,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 59558404564.251739502 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=4,n=112,k=160,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 35656903797.003738403 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=70,n=124,k=416,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.864959230 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=108,n=49,k=224,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 21854.549477808 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=70,n=71,k=224,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 44287746.743597277 > 0.000500000 FAIL
GGML_ASSERT: ggml/src/ggml-metal.m:1790: ne00 >= nth0*nth1

Will look into those now

ggml/src/ggml-quants.c

JohannesGaessler · 2024-07-18T09:22:16Z

Does the intended scope of this PR include the random tests?

slaren · 2024-07-18T12:25:17Z

No, this PR does not add random tests. It adds some commented code to test-backend-ops that I thought other people may find useful to test this or other PRs, so I decided to keep it there disabled by default, but it can be removed otherwise.

* ggml : fix iq4_nl dot product with odd number of blocks * ggml : fix q4_1 * ggml : fix q5_0 * ggml : fix q5_1 * ggml : fix iq4_nl metal ggml-ci * ggml : fix q4_0 * ggml : fix q8_0 ggml-ci * ggml : remove special Q4_0 code for first 2 blocks * ggml : fix sumf redefinition --------- Co-authored-by: slaren <[email protected]>

* ggml : fix iq4_nl dot product with odd number of blocks * ggml : fix odd blocks for ARM_NEON (ggerganov#8556) * ggml : fix iq4_nl dot product with odd number of blocks * ggml : fix q4_1 * ggml : fix q5_0 * ggml : fix q5_1 * ggml : fix iq4_nl metal ggml-ci * ggml : fix q4_0 * ggml : fix q8_0 ggml-ci * ggml : remove special Q4_0 code for first 2 blocks * ggml : fix sumf redefinition --------- Co-authored-by: slaren <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

* squashed readd my iq4_nl sgemm PR #8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per #8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* squashed readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* squashed readd my iq4_nl sgemm PR ggerganov#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* squashed readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* squashed readd my iq4_nl sgemm PR ggerganov#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* squashed readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Jul 17, 2024

slaren mentioned this pull request Jul 17, 2024

CUDA: MMQ code deduplication + iquant support #8495

Merged

slaren force-pushed the sl/fix-iqnl-odd-blocks branch from cc70bdb to 90e8f81 Compare July 18, 2024 00:54

ggerganov mentioned this pull request Jul 18, 2024

ggml : fix odd blocks for ARM_NEON #8556

Merged

4 tasks

JohannesGaessler reviewed Jul 18, 2024

View reviewed changes

ggml/src/ggml-quants.c Outdated Show resolved Hide resolved

ggml : fix iq4_nl dot product with odd number of blocks

cc6a0f5

slaren force-pushed the sl/fix-iqnl-odd-blocks branch from 90e8f81 to cc6a0f5 Compare July 18, 2024 20:11

ggerganov approved these changes Jul 19, 2024

View reviewed changes

mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label Jul 19, 2024

slaren merged commit 87e397d into master Jul 19, 2024
58 checks passed

slaren deleted the sl/fix-iqnl-odd-blocks branch July 19, 2024 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : fix iq4_nl dot product with odd number of blocks #8549

ggml : fix iq4_nl dot product with odd number of blocks #8549

slaren commented Jul 17, 2024

oldgithubman commented Jul 17, 2024

ggerganov commented Jul 18, 2024

JohannesGaessler commented Jul 18, 2024

slaren commented Jul 18, 2024

ggml : fix iq4_nl dot product with odd number of blocks #8549

ggml : fix iq4_nl dot product with odd number of blocks #8549

Conversation

slaren commented Jul 17, 2024

oldgithubman commented Jul 17, 2024

ggerganov commented Jul 18, 2024

JohannesGaessler commented Jul 18, 2024

slaren commented Jul 18, 2024