CUDA error: the function failed to launch on the GPU #579

sinand99 · 2025-01-23T15:30:04Z

Just trying the example on the readme page with JuggernautXL, but it fails:

sd.exe -m "G:\AI\Image\stable-diffusion-webui\models\Stable-diffusion\juggernautXL_juggXIByRundiffusion.safetensors" --cfg-scale 7.5 --steps 35 --sampling-method euler -H 1024 -W 1024 --seed 42 --diffusion-fa -p "fantasy medieval village world inside a glass sphere , high detail, fantasy, realistic, light effect, hyper detail, volumetric lighting, cinematic, macro, depth of field, blur, red light and clouds from the back, highly detailed epic cinematic concept art cg render made in maya, blender and photoshop, octane render, excellent composition, dynamic dramatic cinematic lighting, aesthetic, very inspirational, world inside a glass sphere by james gurney by artgerm with james jean, joe fenton and tristan eaton by ross tran, fine details, 4k resolution"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes
[INFO ] stable-diffusion.cpp:195 - loading model from 'G:\AI\Image\stable-diffusion-webui\models\Stable-diffusion\juggernautXL_juggXIByRundiffusion.safetensors'
[INFO ] model.cpp:888 - load G:\AI\Image\stable-diffusion-webui\models\Stable-diffusion\juggernautXL_juggXIByRundiffusion.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:242 - Version: SDXL
[INFO ] stable-diffusion.cpp:275 - Weight type: f16
[INFO ] stable-diffusion.cpp:276 - Conditioner weight type: f16
[INFO ] stable-diffusion.cpp:277 - Diffusion model weight type: f16
[INFO ] stable-diffusion.cpp:278 - VAE weight type: f32
[WARN ] stable-diffusion.cpp:289 - !!!It looks like you are using SDXL model. If you find that the generated images are completely black, try specifying SDXL VAE FP16 Fix with the --vae parameter. You can find it here: https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors
[INFO ] stable-diffusion.cpp:326 - Using flash attention in the diffusion model
|==================================================| 2641/2641 - 333.33it/s
[INFO ] stable-diffusion.cpp:516 - total params memory size = 8113.89MB (VRAM 8113.89MB, RAM 0.00MB): clip 3119.36MB(VRAM), unet 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:520 - loading model from 'G:\AI\Image\stable-diffusion-webui\models\Stable-diffusion\juggernautXL_juggXIByRundiffusion.safetensors' completed, taking 6.22s
[INFO ] stable-diffusion.cpp:554 - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:688 - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1241 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1374 - get_learned_condition completed, taking 1138 ms
[INFO ] stable-diffusion.cpp:1397 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1434 - generating image: 1/1 - seed 42
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
CUDA error: the function failed to launch on the GPU
current device: 0, in function ggml_cuda_op_mul_mat_cublas at D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:1151
cublasSgemm_v2(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:70: CUDA error

stduhpf · 2025-01-23T16:18:51Z

Are you using a custom build, or the release?

I suspect the cuda build in the lastest release doesn't support Ampere (RTX 30) architecture for some reason unknown to me. (no Ada (RTX 40) or Hopper support either). So you would have to build it from source on your end to get it to work... Or use something else like the Vulkan build.

https://github.com/leejet/stable-diffusion.cpp/blob/master/.github/workflows/build.yml#L166
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list

sinand99 · 2025-01-23T16:58:30Z

Are you using a custom build, or the release?

I suspect the cuda build in the lastest release doesn't support Ampere (RTX 30) architecture for some reason unknown to me. (no Ada (RTX 40) or Hopper support either). So you would have to build it from source on your end to get it to work... Or use something else like the Vulkan build.

https://github.com/leejet/stable-diffusion.cpp/blob/master/.github/workflows/build.yml#L166 https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list

I am using the release. It is not supporting Ampere? That's very strange. So stable-diffusion.cpp only works with old gpus? That makes it useless for general folk, because most people just use the released exe, not building it every time. I think that support should be added asap.

stduhpf · 2025-01-23T17:03:25Z

Yes, that's not something I expected. Vulkan should work though, but performance might not be as good.

@ag2s20150909 is there anything preventing us from adding the more recent cuda architectures to the CI build?

ag2s20150909 · 2025-01-24T06:06:54Z

Maybe all-major or all should be used, but the size will increase.
https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html#prop_tgt:CUDA_ARCHITECTURES

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: the function failed to launch on the GPU #579

CUDA error: the function failed to launch on the GPU #579

sinand99 commented Jan 23, 2025

stduhpf commented Jan 23, 2025 •

edited

Loading

sinand99 commented Jan 23, 2025

stduhpf commented Jan 23, 2025 •

edited

Loading

ag2s20150909 commented Jan 24, 2025 •

edited

Loading

CUDA error: the function failed to launch on the GPU #579

CUDA error: the function failed to launch on the GPU #579

Comments

sinand99 commented Jan 23, 2025

stduhpf commented Jan 23, 2025 • edited Loading

sinand99 commented Jan 23, 2025

stduhpf commented Jan 23, 2025 • edited Loading

ag2s20150909 commented Jan 24, 2025 • edited Loading

stduhpf commented Jan 23, 2025 •

edited

Loading

stduhpf commented Jan 23, 2025 •

edited

Loading

ag2s20150909 commented Jan 24, 2025 •

edited

Loading