Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: the function failed to launch on the GPU #579

Open
sinand99 opened this issue Jan 23, 2025 · 4 comments
Open

CUDA error: the function failed to launch on the GPU #579

sinand99 opened this issue Jan 23, 2025 · 4 comments

Comments

@sinand99
Copy link

Just trying the example on the readme page with JuggernautXL, but it fails:

sd.exe -m "G:\AI\Image\stable-diffusion-webui\models\Stable-diffusion\juggernautXL_juggXIByRundiffusion.safetensors" --cfg-scale 7.5 --steps 35 --sampling-method euler -H 1024 -W 1024 --seed 42 --diffusion-fa -p "fantasy medieval village world inside a glass sphere , high detail, fantasy, realistic, light effect, hyper detail, volumetric lighting, cinematic, macro, depth of field, blur, red light and clouds from the back, highly detailed epic cinematic concept art cg render made in maya, blender and photoshop, octane render, excellent composition, dynamic dramatic cinematic lighting, aesthetic, very inspirational, world inside a glass sphere by james gurney by artgerm with james jean, joe fenton and tristan eaton by ross tran, fine details, 4k resolution"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes
[INFO ] stable-diffusion.cpp:195 - loading model from 'G:\AI\Image\stable-diffusion-webui\models\Stable-diffusion\juggernautXL_juggXIByRundiffusion.safetensors'
[INFO ] model.cpp:888 - load G:\AI\Image\stable-diffusion-webui\models\Stable-diffusion\juggernautXL_juggXIByRundiffusion.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:242 - Version: SDXL
[INFO ] stable-diffusion.cpp:275 - Weight type: f16
[INFO ] stable-diffusion.cpp:276 - Conditioner weight type: f16
[INFO ] stable-diffusion.cpp:277 - Diffusion model weight type: f16
[INFO ] stable-diffusion.cpp:278 - VAE weight type: f32
[WARN ] stable-diffusion.cpp:289 - !!!It looks like you are using SDXL model. If you find that the generated images are completely black, try specifying SDXL VAE FP16 Fix with the --vae parameter. You can find it here: https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors
[INFO ] stable-diffusion.cpp:326 - Using flash attention in the diffusion model
|==================================================| 2641/2641 - 333.33it/s
[INFO ] stable-diffusion.cpp:516 - total params memory size = 8113.89MB (VRAM 8113.89MB, RAM 0.00MB): clip 3119.36MB(VRAM), unet 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:520 - loading model from 'G:\AI\Image\stable-diffusion-webui\models\Stable-diffusion\juggernautXL_juggXIByRundiffusion.safetensors' completed, taking 6.22s
[INFO ] stable-diffusion.cpp:554 - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:688 - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1241 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1374 - get_learned_condition completed, taking 1138 ms
[INFO ] stable-diffusion.cpp:1397 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1434 - generating image: 1/1 - seed 42
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../fattn-wmma-f16.cuh:422: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 600
CUDA error: the function failed to launch on the GPU
current device: 0, in function ggml_cuda_op_mul_mat_cublas at D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:1151
cublasSgemm_v2(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:70: CUDA error

@stduhpf
Copy link
Contributor

stduhpf commented Jan 23, 2025

Are you using a custom build, or the release?

I suspect the cuda build in the lastest release doesn't support Ampere (RTX 30) architecture for some reason unknown to me. (no Ada (RTX 40) or Hopper support either). So you would have to build it from source on your end to get it to work... Or use something else like the Vulkan build.

https://github.com/leejet/stable-diffusion.cpp/blob/master/.github/workflows/build.yml#L166
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list

@sinand99
Copy link
Author

Are you using a custom build, or the release?

I suspect the cuda build in the lastest release doesn't support Ampere (RTX 30) architecture for some reason unknown to me. (no Ada (RTX 40) or Hopper support either). So you would have to build it from source on your end to get it to work... Or use something else like the Vulkan build.

https://github.com/leejet/stable-diffusion.cpp/blob/master/.github/workflows/build.yml#L166 https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list

I am using the release. It is not supporting Ampere? That's very strange. So stable-diffusion.cpp only works with old gpus? That makes it useless for general folk, because most people just use the released exe, not building it every time. I think that support should be added asap.

@stduhpf
Copy link
Contributor

stduhpf commented Jan 23, 2025

Yes, that's not something I expected. Vulkan should work though, but performance might not be as good.

@ag2s20150909 is there anything preventing us from adding the more recent cuda architectures to the CI build?

@ag2s20150909
Copy link
Contributor

ag2s20150909 commented Jan 24, 2025

Maybe all-major or all should be used, but the size will increase.
https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html#prop_tgt:CUDA_ARCHITECTURES

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants