Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : fix iq4_nl dot product with odd number of blocks #8549

Merged
merged 2 commits into from
Jul 19, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Jul 17, 2024

Ref: #8495

Runs the last block on the pure C implementation if there is an odd number of blocks. Only AVX and AVX2 tested, likely affects NEON and others as well.

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Jul 17, 2024
@oldgithubman
Copy link

Ref: #8495

Runs the last block on the pure C implementation if there is an odd number of blocks. Only AVX and AVX2 tested, likely affects NEON and others as well.

Nice solution

@slaren slaren force-pushed the sl/fix-iqnl-odd-blocks branch from cc70bdb to 90e8f81 Compare July 18, 2024 00:54
@ggerganov
Copy link
Owner

On M2 Ultra, test-backend-ops runs successful.

Enabling the random tests causes many failures:

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 154618.82 MB
  MUL_MAT(type_a=q4_1,type_b=f32,m=43,n=92,k=288,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.105420734 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=12,n=24,k=480,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 11773.473165934 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=4,n=38,k=480,bs=[1,1],nr=[1,1]): [MUL_MAT] NaN at index 148 (Metal=-5.410255 CPU=nan) FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=69,n=86,k=96,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 12530281.861510953 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=75,n=113,k=96,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 65.765954396 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=74,n=114,k=480,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 3563.260783020 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=7,n=126,k=288,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 43.655039701 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=20,n=115,k=480,bs=[1,1],nr=[1,1]): [MUL_MAT] NaN at index 19 (Metal=11.000481 CPU=nan) FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=56,n=38,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NaN at index 55 (Metal=-16.780495 CPU=nan) FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=15,n=16,k=160,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 10617966.006262200 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=103,n=126,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 10811661558.555959702 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=122,n=67,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 1.668788946 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=21,n=19,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 7835.887396502 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=77,n=120,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 137.825738287 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=31,n=102,k=160,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.192240393 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=127,n=42,k=288,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 1.022682184 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=121,n=70,k=416,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.436010703 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=90,n=120,k=288,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 153120430276.364410400 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=12,n=47,k=480,bs=[1,1],nr=[1,1]): [MUL_MAT] NaN at index 552 (Metal=13.171962 CPU=nan) FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=51,n=127,k=224,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 162.877185397 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=97,n=24,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 20659224025105313792.000000000 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=45,n=26,k=288,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.159211832 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=68,n=58,k=288,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 59558404564.251739502 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=4,n=112,k=160,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 35656903797.003738403 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=70,n=124,k=416,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.864959230 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=108,n=49,k=224,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 21854.549477808 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=70,n=71,k=224,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 44287746.743597277 > 0.000500000 FAIL
GGML_ASSERT: ggml/src/ggml-metal.m:1790: ne00 >= nth0*nth1

Will look into those now

@JohannesGaessler
Copy link
Collaborator

Does the intended scope of this PR include the random tests?

@slaren
Copy link
Collaborator Author

slaren commented Jul 18, 2024

No, this PR does not add random tests. It adds some commented code to test-backend-ops that I thought other people may find useful to test this or other PRs, so I decided to keep it there disabled by default, but it can be removed otherwise.

@slaren slaren force-pushed the sl/fix-iqnl-odd-blocks branch from 90e8f81 to cc6a0f5 Compare July 18, 2024 20:11
@mofosyne mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label Jul 19, 2024
* ggml : fix iq4_nl dot product with odd number of blocks

* ggml : fix q4_1

* ggml : fix q5_0

* ggml : fix q5_1

* ggml : fix iq4_nl metal

ggml-ci

* ggml : fix q4_0

* ggml : fix q8_0

ggml-ci

* ggml : remove special Q4_0 code for first 2 blocks

* ggml : fix sumf redefinition

---------

Co-authored-by: slaren <[email protected]>
@slaren slaren merged commit 87e397d into master Jul 19, 2024
58 checks passed
@slaren slaren deleted the sl/fix-iqnl-odd-blocks branch July 19, 2024 15:17
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
* ggml : fix iq4_nl dot product with odd number of blocks

* ggml : fix odd blocks for ARM_NEON (ggerganov#8556)

* ggml : fix iq4_nl dot product with odd number of blocks

* ggml : fix q4_1

* ggml : fix q5_0

* ggml : fix q5_1

* ggml : fix iq4_nl metal

ggml-ci

* ggml : fix q4_0

* ggml : fix q8_0

ggml-ci

* ggml : remove special Q4_0 code for first 2 blocks

* ggml : fix sumf redefinition

---------

Co-authored-by: slaren <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
ggerganov pushed a commit that referenced this pull request Sep 16, 2024
* squashed

readd my iq4_nl sgemm PR #8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per #8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
ggerganov pushed a commit to ggerganov/ggml that referenced this pull request Sep 20, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
ggerganov pushed a commit to ggerganov/ggml that referenced this pull request Sep 20, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
ggerganov pushed a commit to ggerganov/whisper.cpp that referenced this pull request Sep 24, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
ggerganov pushed a commit to ggerganov/whisper.cpp that referenced this pull request Sep 24, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
lyapple2008 pushed a commit to lyapple2008/whisper.cpp.mars that referenced this pull request Nov 2, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
github-actions bot pushed a commit to martin-steinegger/ProstT5-llama that referenced this pull request Dec 30, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Review Complexity : High Generally require indepth knowledge of LLMs or GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants