Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal : fix fa kernel #9187

Closed
wants to merge 1 commit into from
Closed

metal : fix fa kernel #9187

wants to merge 1 commit into from

Conversation

ggerganov
Copy link
Owner

cont #9159

For some reason, test-backend-ops fails on master on M2 Ultra with the latest changes from #9159:

$ ▶ make -j tests/test-backend-ops && ./tests/test-backend-ops -o FLASH_ATTN_EXT
I ccache found, compilation results will be cached. Disable with GGML_NO_CCACHE.
I llama.cpp build info: 
I UNAME_S:   Darwin
I UNAME_P:   arm
I UNAME_M:   arm64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Werror -pthread -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Werror -pthread   -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY  -DLLAMA_USE_CURL
I NVCCFLAGS: -std=c++11 -O3 -g 
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit  -lcurl
I CC:        Apple clang version 15.0.0 (clang-1500.3.9.4)
I CXX:       Apple clang version 15.0.0 (clang-1500.3.9.4)

Embedding Metal library
/opt/homebrew/bin/ccache cc -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Werror -pthread -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion  -c /var/folders/ly/vg6q1qhj1s56t_9_gvd7blgc0000gn/T/tmp.AkYdgocN2G/ggml-metal-embed.s -o ggml/src/ggml-metal-embed.o
/opt/homebrew/bin/ccache c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Werror -pthread   -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY  -DLLAMA_USE_CURL -c tests/test-backend-ops.cpp -o tests/test-backend-ops.o
/opt/homebrew/bin/ccache c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Werror -pthread   -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY  -DLLAMA_USE_CURL ggml/src/ggml-blas.o ggml/src/llamafile/sgemm.o ggml/src/ggml-metal.o ggml/src/ggml-metal-embed.o ggml/src/ggml.o ggml/src/ggml-alloc.o ggml/src/ggml-backend.o ggml/src/ggml-quants.o ggml/src/ggml-aarch64.o tests/test-backend-ops.o -o tests/test-backend-ops -framework Accelerate -framework Foundation -framework Metal -framework MetalKit  -lcurl
Testing 2 backends

Backend 1/2 (CPU)
  Skipping CPU backend
Backend 2/2 (Metal)
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 154618.82 MB
  Backend name: Metal
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000604905 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.001518410 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000743344 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000722942 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.001181968 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000782221 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000688070 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000690445 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.001651777 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000826782 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000778284 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000567809 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.001188310 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000716152 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000686001 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000798424 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.001515117 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000909834 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000806680 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.001511847 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000689971 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000661116 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000513066 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.000619641 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.008018188 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007280940 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007653298 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007133281 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.008177775 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006881671 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.008105932 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006895132 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007553170 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006566270 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007686164 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007350518 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007649443 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007418936 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006870192 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006712699 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007743271 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006541790 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007529393 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006782106 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.006798391 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007439705 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): OK
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007285962 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=f16): [FLASH_ATTN_EXT] NMSE = 0.007053095 > 0.000500000 FAIL
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=10.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=1,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=1,max_bias=8.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0): not supported [Metal] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0): not supported [Metal] 
  1366/1414 tests passed
  Backend Metal: FAIL

ggml_metal_free: deallocating
1/2 backends passed
FAIL

Maybe a race condition or running out of registers? Not sure.

The proposed change in this PR fixes the tests and the performance remains the same:

./scripts/compare-commits.sh master gg/metal-fix-fa -m ./models/tinyllama-1b/ggml-model-f16.gguf -m ./models/tinyllama-1b/ggml-model-q8_0.gguf -m ./models/tinyllama-1b/ggml-model-q4_0.gguf -m ./models/llama-8b-v3/ggml-model-f16.gguf -r 10 -fa 1
CPU Model Model Size [GiB] Num. of Par. Test t/s master t/s gg/metal-fix-fa Speedup
llama 1B F16 2.05 1100048384 pp512 7732.45 7714.03 1.00
llama 1B F16 2.05 1100048384 tg128 146.94 147.20 1.00
llama 1B Q4_0 0.59 1100048384 pp512 7064.83 7053.91 1.00
llama 1B Q4_0 0.59 1100048384 tg128 236.93 236.28 1.00
llama 1B Q8_0 1.09 1100048384 pp512 6953.21 6934.44 1.00
llama 1B Q8_0 1.09 1100048384 tg128 202.32 202.45 1.00
llama 8B F16 14.96 8030261248 pp512 1398.25 1396.37 1.00
llama 8B F16 14.96 8030261248 tg128 38.53 38.51 1.00

@ggerganov ggerganov requested a review from slaren August 26, 2024 10:20
@slaren
Copy link
Collaborator

slaren commented Aug 26, 2024

I cannot reproduce this on M3 Max, but a race condition seems like it would be the most likely cause. If that's the case, it is likely to create more issues in the future, so it would be good to fix it completely rather than making some changes that make the issue less likely, but still possible.

@ggerganov
Copy link
Owner Author

I found another fix with #9188, but I still don't understand why master fails. It also fails on M1 Pro. Additionally, every generation with llama-cli and a fixed random seed results in a different output, which does indicate a race condition.

Another observation is that the race is likely inside this block:

simdgroup_store(mqk, ss + 8*cc, TF, 0, false);
const short tx = tiisg%4;
const short ty = tiisg/4;
// mqk = mqk*scale
ss[8*cc + ty*TF + 2*tx + 0] *= scale;
ss[8*cc + ty*TF + 2*tx + 1] *= scale;
if (logit_softcap != 0.0f) {
ss[8*cc + ty*TF + 2*tx + 0] = logit_softcap*precise::tanh(ss[8*cc + ty*TF + 2*tx + 0]);
ss[8*cc + ty*TF + 2*tx + 1] = logit_softcap*precise::tanh(ss[8*cc + ty*TF + 2*tx + 1]);
}
if (mask != q) {
// mqk = mqk + mask*slope
ss[8*cc + ty*TF + 2*tx + 0] += slope*mp[ic + 8*cc + ty*nb31/sizeof(half) + 2*tx + 0];
ss[8*cc + ty*TF + 2*tx + 1] += slope*mp[ic + 8*cc + ty*nb31/sizeof(half) + 2*tx + 1];
}

The reason is that if I change the tx and ty vars to be loops in [0..4) and [0..8) respectively, using only tiisg == 0, the results are again correct:

diff --git a/ggml/src/ggml-metal.metal b/ggml/src/ggml-metal.metal
index aba0b9a0..df1be087 100644
--- a/ggml/src/ggml-metal.metal
+++ b/ggml/src/ggml-metal.metal
@@ -2141,8 +2141,9 @@ kernel void kernel_flash_attn_ext_f16(
 
                     simdgroup_store(mqk, ss + 8*cc, TF, 0, false);
 
-                    const short tx = tiisg%4;
-                    const short ty = tiisg/4;
+                    if (tiisg == 0) {
+                        for (short ty = 0; ty < 8; ++ty) {
+                            for (short tx = 0; tx < 4; ++tx) {
 
                     // mqk = mqk*scale
                     ss[8*cc + ty*TF + 2*tx + 0] *= scale;
@@ -2157,6 +2158,10 @@ kernel void kernel_flash_attn_ext_f16(
                         // mqk = mqk + mask*slope
                         ss[8*cc + ty*TF + 2*tx + 0] += slope*mp[ic + 8*cc + ty*nb31/sizeof(half) + 2*tx + 0];
                         ss[8*cc + ty*TF + 2*tx + 1] += slope*mp[ic + 8*cc + ty*nb31/sizeof(half) + 2*tx + 1];
+                    }
+
+                            }
+                        }
                     }
                 }
             }

Any ideas what could be the issue? I'm not sure how to debug this further

@slaren
Copy link
Collaborator

slaren commented Aug 26, 2024

I don't know what threads use what data, and I can't test it either. Intuitively I would assume that a barrier is necessary before the softmax to synchronize the shared memory in ss.

@ggerganov
Copy link
Owner Author

It does look like a barrier is necessary, though adding one did not fix the problem.

However, I rewrote the code to make the threads access the data more coherently and this resolved the problem: #9189

@ggerganov
Copy link
Owner Author

Superseded by #9189

@ggerganov ggerganov closed this Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants