Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge slowdown when performing fft over the second dimension of a 3D array #2641

Open
marcsgil opened this issue Feb 5, 2025 · 12 comments
Open
Labels
cuda libraries Stuff about CUDA library wrappers. performance How fast can we go?

Comments

@marcsgil
Copy link

marcsgil commented Feb 5, 2025

I'm getting super slow speeds when performing fft over the second dimension of a 3D array.

The Minimal Working Example (MWE) for this bug:

using CUDA, CUDA.CUFFT, BenchmarkTools
CUDA.allowscalar(false)

x = CUDA.randn(ComplexF32, 128, 128, 128)

for dim  1:ndims(x)
    @info "FFT along dimension $dim"
    display(@benchmark CUDA.@sync fft($x, $dim))
    println()
end

gives me

[ Info: FFT along dimension 1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  20.100 μs …  2.264 ms  ┊ GC (min … max):  0.00% … 96.17%
 Time  (median):     23.840 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   29.895 μs ± 42.071 μs  ┊ GC (mean ± σ):  16.20% ± 12.13%

  ▇█▂                                                      ▁▁ ▁
  ███▇▇▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄██ █
  20.1 μs      Histogram: log(frequency) by time       212 μs <

 Memory estimate: 1.12 KiB, allocs estimate: 30.

[ Info: FFT along dimension 2
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  402.482 μs …   1.119 ms  ┊ GC (min … max): 0.00% … 47.31%
 Time  (median):     405.602 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   426.066 μs ± 112.028 μs  ┊ GC (mean ± σ):  3.37% ±  7.67%

  █▁                                                          ▂ ▁
  ██▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃██ █
  402 μs        Histogram: log(frequency) by time       1.06 ms <

 Memory estimate: 64.89 KiB, allocs estimate: 3221.

[ Info: FFT along dimension 3
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  21.350 μs … 476.332 μs  ┊ GC (min … max):  0.00% … 77.35%
 Time  (median):     24.470 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   31.359 μs ±  38.881 μs  ┊ GC (mean ± σ):  17.95% ± 12.71%

  █▆                                                         ▂ ▁
  ████▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇█ █
  21.4 μs       Histogram: log(frequency) by time       249 μs <

 Memory estimate: 1.12 KiB, allocs estimate: 30.
Manifest.toml

# This file is machine-generated - editing it directly is not advised

julia_version = "1.11.3"
manifest_format = "2.0"
project_hash = "61d5c7e6e585098ec1d5968623bed79894027344"

[[deps.AbstractFFTs]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "d92ad398961a3ed262d8bf04a1a2b8340f915fef"
uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c"
version = "1.5.0"

    [deps.AbstractFFTs.extensions]
    AbstractFFTsChainRulesCoreExt = "ChainRulesCore"
    AbstractFFTsTestExt = "Test"

    [deps.AbstractFFTs.weakdeps]
    ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
    Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[[deps.Adapt]]
deps = ["LinearAlgebra", "Requires"]
git-tree-sha1 = "50c3c56a52972d78e8be9fd135bfb91c9574c140"
uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
version = "4.1.1"
weakdeps = ["StaticArrays"]

    [deps.Adapt.extensions]
    AdaptStaticArraysExt = "StaticArrays"

[[deps.ArgTools]]
uuid = "0dad84c5-d112-42e6-8d28-ef12dabb789f"
version = "1.1.2"

[[deps.Artifacts]]
uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"
version = "1.11.0"

[[deps.Atomix]]
deps = ["UnsafeAtomics"]
git-tree-sha1 = "93da6c8228993b0052e358ad592ee7c1eccaa639"
uuid = "a9b6321e-bd34-4604-b9c9-b65b8de01458"
version = "1.1.0"

    [deps.Atomix.extensions]
    AtomixCUDAExt = "CUDA"
    AtomixMetalExt = "Metal"
    AtomixOpenCLExt = "OpenCL"
    AtomixoneAPIExt = "oneAPI"

    [deps.Atomix.weakdeps]
    CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
    Metal = "dde4c033-4e86-420c-a63e-0dd931031962"
    OpenCL = "08131aa3-fb12-5dee-8b74-c09406e224a2"
    oneAPI = "8f75cd03-7ff8-4ecb-9b8f-daf728133b1b"

[[deps.BFloat16s]]
deps = ["LinearAlgebra", "Printf", "Random", "Test"]
git-tree-sha1 = "2c7cc21e8678eff479978a0a2ef5ce2f51b63dff"
uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b"
version = "0.5.0"

[[deps.Base64]]
uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"
version = "1.11.0"

[[deps.BenchmarkTools]]
deps = ["Compat", "JSON", "Logging", "Printf", "Profile", "Statistics", "UUIDs"]
git-tree-sha1 = "e38fbc49a620f5d0b660d7f543db1009fe0f8336"
uuid = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
version = "1.6.0"

[[deps.CEnum]]
git-tree-sha1 = "389ad5c84de1ae7cf0e28e381131c98ea87d54fc"
uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82"
version = "0.5.0"

[[deps.CUDA]]
deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CUDA_Driver_jll", "CUDA_Runtime_Discovery", "CUDA_Runtime_jll", "Crayons", "DataFrames", "ExprTools", "GPUArrays", "GPUCompiler", "KernelAbstractions", "LLVM", "LLVMLoopInfo", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "NVTX", "Preferences", "PrettyTables", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "StaticArrays", "Statistics", "demumble_jll"]
git-tree-sha1 = "7be665c420b5d16059b1ba00b1dbb4e85012fa65"
uuid = "052768ef-5323-5732-b1bb-66c8b64840ba"
version = "5.6.1"

    [deps.CUDA.extensions]
    ChainRulesCoreExt = "ChainRulesCore"
    EnzymeCoreExt = "EnzymeCore"
    SpecialFunctionsExt = "SpecialFunctions"

    [deps.CUDA.weakdeps]
    ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
    EnzymeCore = "f151be2c-9106-41f4-ab19-57ee4f262869"
    SpecialFunctions = "276daf66-3868-5448-9aa4-cd146d93841b"

[[deps.CUDA_Driver_jll]]
deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"]
git-tree-sha1 = "14996d716a2eaaeccfc8d7bc854dd87fde720ac1"
uuid = "4ee394cb-3365-5eb0-8335-949819d2adfc"
version = "0.10.4+0"

[[deps.CUDA_Runtime_Discovery]]
deps = ["Libdl"]
git-tree-sha1 = "33576c7c1b2500f8e7e6baa082e04563203b3a45"
uuid = "1af6417a-86b4-443c-805f-a4643ffb695f"
version = "0.3.5"

[[deps.CUDA_Runtime_jll]]
deps = ["Artifacts", "CUDA_Driver_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "TOML"]
git-tree-sha1 = "17f1536c600133f7c4113bae0a2d98dbf27c7ebc"
uuid = "76a88914-d11a-5bdc-97e0-2f5a05c973a2"
version = "0.15.5+0"

[[deps.ColorTypes]]
deps = ["FixedPointNumbers", "Random"]
git-tree-sha1 = "c7acce7a7e1078a20a285211dd73cd3941a871d6"
uuid = "3da002f7-5984-5a60-b8a6-cbb66c0b333f"
version = "0.12.0"

    [deps.ColorTypes.extensions]
    StyledStringsExt = "StyledStrings"

    [deps.ColorTypes.weakdeps]
    StyledStrings = "f489334b-da3d-4c2e-b8f0-e476e12c162b"

[[deps.Colors]]
deps = ["ColorTypes", "FixedPointNumbers", "Reexport"]
git-tree-sha1 = "64e15186f0aa277e174aa81798f7eb8598e0157e"
uuid = "5ae59095-9a9b-59fe-a467-6f913c188581"
version = "0.13.0"

[[deps.Compat]]
deps = ["TOML", "UUIDs"]
git-tree-sha1 = "8ae8d32e09f0dcf42a36b90d4e17f5dd2e4c4215"
uuid = "34da2185-b29b-5c13-b0c7-acf172513d20"
version = "4.16.0"
weakdeps = ["Dates", "LinearAlgebra"]

    [deps.Compat.extensions]
    CompatLinearAlgebraExt = "LinearAlgebra"

[[deps.CompilerSupportLibraries_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae"
version = "1.1.1+0"

[[deps.Crayons]]
git-tree-sha1 = "249fe38abf76d48563e2f4556bebd215aa317e15"
uuid = "a8cc5b0e-0ffa-5ad4-8c14-923d3ee1735f"
version = "4.1.1"

[[deps.DataAPI]]
git-tree-sha1 = "abe83f3a2f1b857aac70ef8b269080af17764bbe"
uuid = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a"
version = "1.16.0"

[[deps.DataFrames]]
deps = ["Compat", "DataAPI", "DataStructures", "Future", "InlineStrings", "InvertedIndices", "IteratorInterfaceExtensions", "LinearAlgebra", "Markdown", "Missings", "PooledArrays", "PrecompileTools", "PrettyTables", "Printf", "Random", "Reexport", "SentinelArrays", "SortingAlgorithms", "Statistics", "TableTraits", "Tables", "Unicode"]
git-tree-sha1 = "fb61b4812c49343d7ef0b533ba982c46021938a6"
uuid = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
version = "1.7.0"

[[deps.DataStructures]]
deps = ["Compat", "InteractiveUtils", "OrderedCollections"]
git-tree-sha1 = "1d0a14036acb104d9e89698bd408f63ab58cdc82"
uuid = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8"
version = "0.18.20"

[[deps.DataValueInterfaces]]
git-tree-sha1 = "bfc1187b79289637fa0ef6d4436ebdfe6905cbd6"
uuid = "e2d170a0-9d28-54be-80f0-106bbe20a464"
version = "1.0.0"

[[deps.Dates]]
deps = ["Printf"]
uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"
version = "1.11.0"

[[deps.Downloads]]
deps = ["ArgTools", "FileWatching", "LibCURL", "NetworkOptions"]
uuid = "f43a241f-c20a-4ad4-852c-f6b1247861c6"
version = "1.6.0"

[[deps.ExprTools]]
git-tree-sha1 = "27415f162e6028e81c72b82ef756bf321213b6ec"
uuid = "e2ba6199-217a-4e67-a87a-7c52f15ade04"
version = "0.1.10"

[[deps.FileWatching]]
uuid = "7b1f6079-737a-58dc-b8bc-7a2ca5c1b5ee"
version = "1.11.0"

[[deps.FixedPointNumbers]]
deps = ["Statistics"]
git-tree-sha1 = "05882d6995ae5c12bb5f36dd2ed3f61c98cbb172"
uuid = "53c48c17-4a7d-5ca2-90c5-79b7896eea93"
version = "0.8.5"

[[deps.Future]]
deps = ["Random"]
uuid = "9fa8497b-333b-5362-9e8d-4d0656e87820"
version = "1.11.0"

[[deps.GPUArrays]]
deps = ["Adapt", "GPUArraysCore", "KernelAbstractions", "LLVM", "LinearAlgebra", "Printf", "Random", "Reexport", "ScopedValues", "Serialization", "Statistics"]
git-tree-sha1 = "0ef97e93edced3d0e713f4cfd031cc9020e022b0"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "11.2.1"

[[deps.GPUArraysCore]]
deps = ["Adapt"]
git-tree-sha1 = "83cf05ab16a73219e5f6bd1bdfa9848fa24ac627"
uuid = "46192b85-c4d5-4398-a991-12ede77f4527"
version = "0.2.0"

[[deps.GPUCompiler]]
deps = ["ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "PrecompileTools", "Preferences", "Scratch", "Serialization", "TOML", "TimerOutputs", "UUIDs"]
git-tree-sha1 = "8e30cd0b1934f03dd925416970061c1014c6686f"
uuid = "61eb1bfa-7361-4325-ad38-22787b887f55"
version = "1.1.0"

[[deps.HashArrayMappedTries]]
git-tree-sha1 = "2eaa69a7cab70a52b9687c8bf950a5a93ec895ae"
uuid = "076d061b-32b6-4027-95e0-9a2c6f6d7e74"
version = "0.2.0"

[[deps.InlineStrings]]
git-tree-sha1 = "45521d31238e87ee9f9732561bfee12d4eebd52d"
uuid = "842dd82b-1e85-43dc-bf29-5d0ee9dffc48"
version = "1.4.2"

    [deps.InlineStrings.extensions]
    ArrowTypesExt = "ArrowTypes"
    ParsersExt = "Parsers"

    [deps.InlineStrings.weakdeps]
    ArrowTypes = "31f734f8-188a-4ce0-8406-c8a06bd891cd"
    Parsers = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0"

[[deps.InteractiveUtils]]
deps = ["Markdown"]
uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240"
version = "1.11.0"

[[deps.InvertedIndices]]
git-tree-sha1 = "6da3c4316095de0f5ee2ebd875df8721e7e0bdbe"
uuid = "41ab1584-1d38-5bbf-9106-f11c6c58b48f"
version = "1.3.1"

[[deps.IteratorInterfaceExtensions]]
git-tree-sha1 = "a3f24677c21f5bbe9d2a714f95dcd58337fb2856"
uuid = "82899510-4779-5014-852e-03e436cf321d"
version = "1.0.0"

[[deps.JLLWrappers]]
deps = ["Artifacts", "Preferences"]
git-tree-sha1 = "a007feb38b422fbdab534406aeca1b86823cb4d6"
uuid = "692b3bcd-3c85-4b1f-b108-f13ce0eb3210"
version = "1.7.0"

[[deps.JSON]]
deps = ["Dates", "Mmap", "Parsers", "Unicode"]
git-tree-sha1 = "31e996f0a15c7b280ba9f76636b3ff9e2ae58c9a"
uuid = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
version = "0.21.4"

[[deps.JuliaNVTXCallbacks_jll]]
deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"]
git-tree-sha1 = "af433a10f3942e882d3c671aacb203e006a5808f"
uuid = "9c1d0b0a-7046-5b2e-a33f-ea22f176ac7e"
version = "0.2.1+0"

[[deps.KernelAbstractions]]
deps = ["Adapt", "Atomix", "InteractiveUtils", "MacroTools", "PrecompileTools", "Requires", "StaticArrays", "UUIDs"]
git-tree-sha1 = "d5bc0b079382e89bfa91433639bc74b9f9e17ae7"
uuid = "63c18a36-062a-441e-b654-da1e3ab1ce7c"
version = "0.9.33"

    [deps.KernelAbstractions.extensions]
    EnzymeExt = "EnzymeCore"
    LinearAlgebraExt = "LinearAlgebra"
    SparseArraysExt = "SparseArrays"

    [deps.KernelAbstractions.weakdeps]
    EnzymeCore = "f151be2c-9106-41f4-ab19-57ee4f262869"
    LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
    SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[[deps.LLVM]]
deps = ["CEnum", "LLVMExtra_jll", "Libdl", "Preferences", "Printf", "Unicode"]
git-tree-sha1 = "5fcfea6df2ff3e4da708a40c969c3812162346df"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "9.2.0"
weakdeps = ["BFloat16s"]

    [deps.LLVM.extensions]
    BFloat16sExt = "BFloat16s"

[[deps.LLVMExtra_jll]]
deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "TOML"]
git-tree-sha1 = "4b5ad6a4ffa91a00050a964492bc4f86bb48cea0"
uuid = "dad2f222-ce93-54a1-a47d-0025e8a3acab"
version = "0.0.35+0"

[[deps.LLVMLoopInfo]]
git-tree-sha1 = "2e5c102cfc41f48ae4740c7eca7743cc7e7b75ea"
uuid = "8b046642-f1f6-4319-8d3c-209ddc03c586"
version = "1.0.0"

[[deps.LaTeXStrings]]
git-tree-sha1 = "dda21b8cbd6a6c40d9d02a73230f9d70fed6918c"
uuid = "b964fa9f-0449-5b57-a5c2-d3ea65f4040f"
version = "1.4.0"

[[deps.LazyArtifacts]]
deps = ["Artifacts", "Pkg"]
uuid = "4af54fe1-eca0-43a8-85a7-787d91b784e3"
version = "1.11.0"

[[deps.LibCURL]]
deps = ["LibCURL_jll", "MozillaCACerts_jll"]
uuid = "b27032c2-a3e7-50c8-80cd-2d36dbcbfd21"
version = "0.6.4"

[[deps.LibCURL_jll]]
deps = ["Artifacts", "LibSSH2_jll", "Libdl", "MbedTLS_jll", "Zlib_jll", "nghttp2_jll"]
uuid = "deac9b47-8bc7-5906-a0fe-35ac56dc84c0"
version = "8.6.0+0"

[[deps.LibGit2]]
deps = ["Base64", "LibGit2_jll", "NetworkOptions", "Printf", "SHA"]
uuid = "76f85450-5226-5b5a-8eaa-529ad045b433"
version = "1.11.0"

[[deps.LibGit2_jll]]
deps = ["Artifacts", "LibSSH2_jll", "Libdl", "MbedTLS_jll"]
uuid = "e37daf67-58a4-590a-8e99-b0245dd2ffc5"
version = "1.7.2+0"

[[deps.LibSSH2_jll]]
deps = ["Artifacts", "Libdl", "MbedTLS_jll"]
uuid = "29816b5a-b9ab-546f-933c-edad1886dfa8"
version = "1.11.0+1"

[[deps.Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"
version = "1.11.0"

[[deps.LinearAlgebra]]
deps = ["Libdl", "OpenBLAS_jll", "libblastrampoline_jll"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
version = "1.11.0"

[[deps.Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"
version = "1.11.0"

[[deps.MacroTools]]
git-tree-sha1 = "72aebe0b5051e5143a079a4685a46da330a40472"
uuid = "1914dd2f-81c6-5fcd-8719-6d5c9610ff09"
version = "0.5.15"

[[deps.Markdown]]
deps = ["Base64"]
uuid = "d6f4376e-aef5-505a-96c1-9c027394607a"
version = "1.11.0"

[[deps.MbedTLS_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "c8ffd9c3-330d-5841-b78e-0817d7145fa1"
version = "2.28.6+0"

[[deps.Missings]]
deps = ["DataAPI"]
git-tree-sha1 = "ec4f7fbeab05d7747bdf98eb74d130a2a2ed298d"
uuid = "e1d29d7a-bbdc-5cf2-9ac0-f12de2c33e28"
version = "1.2.0"

[[deps.Mmap]]
uuid = "a63ad114-7e13-5084-954f-fe012c677804"
version = "1.11.0"

[[deps.MozillaCACerts_jll]]
uuid = "14a3606d-f60d-562e-9121-12d972cd8159"
version = "2023.12.12"

[[deps.NVTX]]
deps = ["Colors", "JuliaNVTXCallbacks_jll", "Libdl", "NVTX_jll"]
git-tree-sha1 = "6a6f8bfaa91bb2e40ff562ab9f30dc827741daef"
uuid = "5da4648a-3479-48b8-97b9-01cb529c0a1f"
version = "0.3.5"

[[deps.NVTX_jll]]
deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"]
git-tree-sha1 = "ce3269ed42816bf18d500c9f63418d4b0d9f5a3b"
uuid = "e98f9f5b-d649-5603-91fd-7774390e6439"
version = "3.1.0+2"

[[deps.NetworkOptions]]
uuid = "ca575930-c2e3-43a9-ace4-1e988b2c1908"
version = "1.2.0"

[[deps.OpenBLAS_jll]]
deps = ["Artifacts", "CompilerSupportLibraries_jll", "Libdl"]
uuid = "4536629a-c528-5b80-bd46-f80d51c5b363"
version = "0.3.27+1"

[[deps.OrderedCollections]]
git-tree-sha1 = "cc4054e898b852042d7b503313f7ad03de99c3dd"
uuid = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
version = "1.8.0"

[[deps.Parsers]]
deps = ["Dates", "PrecompileTools", "UUIDs"]
git-tree-sha1 = "8489905bcdbcfac64d1daa51ca07c0d8f0283821"
uuid = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0"
version = "2.8.1"

[[deps.Pkg]]
deps = ["Artifacts", "Dates", "Downloads", "FileWatching", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "Random", "SHA", "TOML", "Tar", "UUIDs", "p7zip_jll"]
uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
version = "1.11.0"

    [deps.Pkg.extensions]
    REPLExt = "REPL"

    [deps.Pkg.weakdeps]
    REPL = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"

[[deps.PooledArrays]]
deps = ["DataAPI", "Future"]
git-tree-sha1 = "36d8b4b899628fb92c2749eb488d884a926614d3"
uuid = "2dfb63ee-cc39-5dd5-95bd-886bf059d720"
version = "1.4.3"

[[deps.PrecompileTools]]
deps = ["Preferences"]
git-tree-sha1 = "5aa36f7049a63a1528fe8f7c3f2113413ffd4e1f"
uuid = "aea7be01-6a6a-4083-8856-8a6e6704d82a"
version = "1.2.1"

[[deps.Preferences]]
deps = ["TOML"]
git-tree-sha1 = "9306f6085165d270f7e3db02af26a400d580f5c6"
uuid = "21216c6a-2e73-6563-6e65-726566657250"
version = "1.4.3"

[[deps.PrettyTables]]
deps = ["Crayons", "LaTeXStrings", "Markdown", "PrecompileTools", "Printf", "Reexport", "StringManipulation", "Tables"]
git-tree-sha1 = "1101cd475833706e4d0e7b122218257178f48f34"
uuid = "08abe8d2-0d0c-5749-adfa-8a2ac140af0d"
version = "2.4.0"

[[deps.Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"
version = "1.11.0"

[[deps.Profile]]
uuid = "9abbd945-dff8-562f-b5e8-e1ebf5ef1b79"
version = "1.11.0"

[[deps.Random]]
deps = ["SHA"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
version = "1.11.0"

[[deps.Random123]]
deps = ["Random", "RandomNumbers"]
git-tree-sha1 = "4743b43e5a9c4a2ede372de7061eed81795b12e7"
uuid = "74087812-796a-5b5d-8853-05524746bad3"
version = "1.7.0"

[[deps.RandomNumbers]]
deps = ["Random"]
git-tree-sha1 = "c6ec94d2aaba1ab2ff983052cf6a606ca5985902"
uuid = "e6cf234a-135c-5ec9-84dd-332b85af5143"
version = "1.6.0"

[[deps.Reexport]]
git-tree-sha1 = "45e428421666073eab6f2da5c9d310d99bb12f9b"
uuid = "189a3867-3050-52da-a836-e630ba90ab69"
version = "1.2.2"

[[deps.Requires]]
deps = ["UUIDs"]
git-tree-sha1 = "838a3a4188e2ded87a4f9f184b4b0d78a1e91cb7"
uuid = "ae029012-a4dd-5104-9daa-d747884805df"
version = "1.3.0"

[[deps.SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"
version = "0.7.0"

[[deps.ScopedValues]]
deps = ["HashArrayMappedTries", "Logging"]
git-tree-sha1 = "1147f140b4c8ddab224c94efa9569fc23d63ab44"
uuid = "7e506255-f358-4e82-b7e4-beb19740aa63"
version = "1.3.0"

[[deps.Scratch]]
deps = ["Dates"]
git-tree-sha1 = "3bac05bc7e74a75fd9cba4295cde4045d9fe2386"
uuid = "6c6a2e73-6563-6170-7368-637461726353"
version = "1.2.1"

[[deps.SentinelArrays]]
deps = ["Dates", "Random"]
git-tree-sha1 = "712fb0231ee6f9120e005ccd56297abbc053e7e0"
uuid = "91c51154-3ec4-41a3-a24f-3f23e20d615c"
version = "1.4.8"

[[deps.Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
version = "1.11.0"

[[deps.SortingAlgorithms]]
deps = ["DataStructures"]
git-tree-sha1 = "66e0a8e672a0bdfca2c3f5937efb8538b9ddc085"
uuid = "a2af1166-a08f-5f64-846c-94a0d3cef48c"
version = "1.2.1"

[[deps.SparseArrays]]
deps = ["Libdl", "LinearAlgebra", "Random", "Serialization", "SuiteSparse_jll"]
uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
version = "1.11.0"

[[deps.StaticArrays]]
deps = ["LinearAlgebra", "PrecompileTools", "Random", "StaticArraysCore"]
git-tree-sha1 = "02c8bd479d26dbeff8a7eb1d77edfc10dacabc01"
uuid = "90137ffa-7385-5640-81b9-e52037218182"
version = "1.9.11"

    [deps.StaticArrays.extensions]
    StaticArraysChainRulesCoreExt = "ChainRulesCore"
    StaticArraysStatisticsExt = "Statistics"

    [deps.StaticArrays.weakdeps]
    ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
    Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[[deps.StaticArraysCore]]
git-tree-sha1 = "192954ef1208c7019899fbf8049e717f92959682"
uuid = "1e83bf80-4336-4d27-bf5d-d5a4f845583c"
version = "1.4.3"

[[deps.Statistics]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "ae3bb1eb3bba077cd276bc5cfc337cc65c3075c0"
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
version = "1.11.1"
weakdeps = ["SparseArrays"]

    [deps.Statistics.extensions]
    SparseArraysExt = ["SparseArrays"]

[[deps.StringManipulation]]
deps = ["PrecompileTools"]
git-tree-sha1 = "a6b1675a536c5ad1a60e5a5153e1fee12eb146e3"
uuid = "892a3eda-7b42-436c-8928-eab12a02cf0e"
version = "0.4.0"

[[deps.SuiteSparse_jll]]
deps = ["Artifacts", "Libdl", "libblastrampoline_jll"]
uuid = "bea87d4a-7f5b-5778-9afe-8cc45184846c"
version = "7.7.0+0"

[[deps.TOML]]
deps = ["Dates"]
uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"
version = "1.0.3"

[[deps.TableTraits]]
deps = ["IteratorInterfaceExtensions"]
git-tree-sha1 = "c06b2f539df1c6efa794486abfb6ed2022561a39"
uuid = "3783bdb8-4a98-5b6b-af9a-565f29a5fe9c"
version = "1.0.1"

[[deps.Tables]]
deps = ["DataAPI", "DataValueInterfaces", "IteratorInterfaceExtensions", "OrderedCollections", "TableTraits"]
git-tree-sha1 = "598cd7c1f68d1e205689b1c2fe65a9f85846f297"
uuid = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
version = "1.12.0"

[[deps.Tar]]
deps = ["ArgTools", "SHA"]
uuid = "a4e569a6-e804-4fa4-b0f3-eef7a1d5b13e"
version = "1.10.0"

[[deps.Test]]
deps = ["InteractiveUtils", "Logging", "Random", "Serialization"]
uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
version = "1.11.0"

[[deps.TimerOutputs]]
deps = ["ExprTools", "Printf"]
git-tree-sha1 = "d7298ebdfa1654583468a487e8e83fae9d72dac3"
uuid = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f"
version = "0.5.26"

[[deps.UUIDs]]
deps = ["Random", "SHA"]
uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"
version = "1.11.0"

[[deps.Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"
version = "1.11.0"

[[deps.UnsafeAtomics]]
git-tree-sha1 = "b13c4edda90890e5b04ba24e20a310fbe6f249ff"
uuid = "013be700-e6cd-48c3-b4a1-df204f14c38f"
version = "0.3.0"
weakdeps = ["LLVM"]

    [deps.UnsafeAtomics.extensions]
    UnsafeAtomicsLLVM = ["LLVM"]

[[deps.Zlib_jll]]
deps = ["Libdl"]
uuid = "83775a58-1f1d-513f-b197-d71354ab007a"
version = "1.2.13+1"

[[deps.demumble_jll]]
deps = ["Artifacts", "JLLWrappers", "Libdl"]
git-tree-sha1 = "6498e3581023f8e530f34760d18f75a69e3a4ea8"
uuid = "1e29f10c-031c-5a83-9565-69cddfc27673"
version = "1.3.0+0"

[[deps.libblastrampoline_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "8e850b90-86db-534c-a0d3-1478176c7d93"
version = "5.11.0+0"

[[deps.nghttp2_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "8e850ede-7688-5339-a07c-302acd2aaf8d"
version = "1.59.0+0"

[[deps.p7zip_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "3f19e933-33d8-53b3-aaab-bd5110c3b7a0"
version = "17.4.0+2"

Expected behavior

I know a slowdown is expected due to a noncontiguous memory access pattern, but not by this much. Furthermore, I actually see no slowdown when performing the fft over the third dimension, which is also noncontiguous, and it is also not present in cupy. One can check it by running the code

import cupy as cp
import cupyx.scipy.fft as cufft
from cupyx.profiler import benchmark

x = cp.random.random((128, 128, 128)).astype(cp.complex64)

for axis in range(3):
    print('FFT along axis ', axis, ':')
    print(benchmark(cufft.fft, (x,), {'axis':axis}))
    print('\n')

which gives me

FFT along axis  0 :
fft                 :    CPU:    15.023 us   +/-  0.923 (min:    13.440 / max:    35.750) us     GPU-0:    48.207 us   +/-  1.777 (min:    45.056 / max:    58.368) us


FFT along axis  1 :
fft                 :    CPU:    14.247 us   +/-  0.645 (min:    13.680 / max:    34.480) us     GPU-0:    44.632 us   +/-  1.122 (min:    39.936 / max:    53.888) us


FFT along axis  2 :
fft                 :    CPU:     6.780 us   +/-  0.301 (min:     6.470 / max:    17.380) us     GPU-0:    16.641 us   +/-  1.159 (min:    12.992 / max:    26.784) us

Version info

Details on Julia:

julia> versioninfo()
Julia Version 1.11.3
Commit d63adeda50d (2025-01-21 19:42 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 7950X 16-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 

Details on CUDA:

julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.4
NVIDIA driver 550.144.3

CUDA libraries: 
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+550.144.3

Julia packages: 
- CUDA: 5.6.1
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.11.3
- LLVM: 16.0.6

2 devices:
  0: NVIDIA GeForce RTX 4090 (sm_89, 5.983 GiB / 23.988 GiB available)
  1: NVIDIA GeForce RTX 3090 (sm_86, 23.665 GiB / 24.000 GiB available)
@marcsgil marcsgil added the bug Something isn't working label Feb 5, 2025
@maleadt
Copy link
Member

maleadt commented Feb 10, 2025

It'd be useful to profile the code here, either using CUDA.@profile or NSight, and determine what cupy is doing better here.

@maleadt maleadt added cuda libraries Stuff about CUDA library wrappers. performance How fast can we go? and removed bug Something isn't working labels Feb 10, 2025
@marcsgil
Copy link
Author

marcsgil commented Feb 10, 2025

From the Julia side, when I run

using CUDA, CUDA.CUFFT, LinearAlgebra
CUDA.allowscalar(false)

x = CUDA.randn(ComplexF32, 128, 128, 128)
buffer = similar(x)

for dim  1:ndims(x)
    @info "FFT along dimension $dim"
    plan = plan_fft(buffer, dim)
    display(CUDA.@profile mul!(buffer, plan, x))
    println()
end

I get the result

[ Info: FFT along dimension 1
Profiler ran for 44.35 µs, capturing 24 events.

Host-side activity: calling CUDA APIs took 19.31 µs (43.55% of the trace)
┌──────────┬────────────┬───────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Name                │
├──────────┼────────────┼───────┼─────────────────────┤
│   42.47% │   18.84 µs │     1 │ cuLaunchKernel      │
│    0.00% │     0.0 ns │     1 │ cuStreamIsCapturing │
└──────────┴────────────┴───────┴─────────────────────┘

Device-side activity: GPU was busy for 11.68 µs (26.34% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Name                                                                                                                                                              │
├──────────┼────────────┼───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   26.34% │   11.68 µs │     1 │ void vector_fft<128u, EPT<8u>, 2u, 32u, (padding_t)70, (twiddle_t)0, (loadstore_modifier_t)2, (layout_t)0, unsigned int, float>(kernel_arguments_t<unsigned int>) │
└──────────┴────────────┴───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


[ Info: FFT along dimension 2
Profiler ran for 744.1 µs, capturing 1802 events.

Host-side activity: calling CUDA APIs took 326.87 µs (43.93% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────┤
│   37.07% │  275.85 µs │   128 │   2.16 µs ± 1.72   (  1.43 ‥ 20.98)  │ cuLaunchKernel      │
│    0.80% │    5.96 µs │   128 │  46.57 ns ± 103.9  (   0.0 ‥ 476.84) │ cuStreamIsCapturing │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────┘

Device-side activity: GPU was busy for 369.79 µs (49.70% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                                                                                                                                               │
├──────────┼────────────┼───────┼────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   49.70% │  369.79 µs │   128 │   2.89 µs ± 0.14   (  2.62 ‥ 3.34) │ void regular_fft<128u, EPT<16u>, 32u, 4u, (padding_t)6, (twiddle_t)0, (loadstore_modifier_t)2, (layout_t)1, unsigned int, float>(kernel_arguments_t<unsigned int>) │
└──────────┴────────────┴───────┴────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


[ Info: FFT along dimension 3
Profiler ran for 45.3 µs, capturing 24 events.

Host-side activity: calling CUDA APIs took 20.98 µs (46.32% of the trace)
┌──────────┬────────────┬───────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Name                │
├──────────┼────────────┼───────┼─────────────────────┤
│   43.68% │   19.79 µs │     1 │ cuLaunchKernel      │
│    0.00% │     0.0 ns │     1 │ cuStreamIsCapturing │
└──────────┴────────────┴───────┴─────────────────────┘

Device-side activity: GPU was busy for 12.16 µs (26.84% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Name                                                                                                                                                               │
├──────────┼────────────┼───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   26.84% │   12.16 µs │     1 │ void regular_fft<128u, EPT<16u>, 32u, 4u, (padding_t)6, (twiddle_t)0, (loadstore_modifier_t)2, (layout_t)1, unsigned int, float>(kernel_arguments_t<unsigned int>) │
└──────────┴────────────┴───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The major difference seems to be that, when dim=2, there are 128 calls to the regular_fft kernel. For the dim=3 case, we get, instead, just 1 call.

As for the cupy code, the cluster I'm using does not have a profiler installed. As soon as I can get one working I'll post the results here.

@maleadt
Copy link
Member

maleadt commented Feb 11, 2025

The loop comes from

CUDA.jl/lib/cufft/fft.jl

Lines 304 to 322 in 5461475

# a version of unsafe_execute which applies the plan to each element of trailing dimensions not covered by the plan.
# Note that for plans, with trailing non-transform dimensions views are created for each of such elements.
# Such views each have lower dimensions and are then transformed by the lower dimension low-level Cuda plan.
function unsafe_execute_trailing!(p, x, y)
N = plan_max_dims(p.region, p.output_size)
M = ndims(x)
d = p.region[end]
if M == N
unsafe_execute!(p,x,y)
else
front_ids = ntuple((dd)->Colon(), d)
for c in CartesianIndices(size(x)[d+1:end])
ids = ntuple((dd)->c[dd], M-N)
vx = @view x[front_ids..., ids...]
vy = @view y[front_ids..., ids...]
unsafe_execute!(p,vx,vy)
end
end
end
, which was implemented in #1903, so cc @RainerHeintzmann.

@marcsgil
Copy link
Author

So, here is the screenshot of profiling in python:

Image
The profiling was obtained running this code:
import cupy as cp
import cupyx.scipy.fft as cufft
from cupyx.profiler import benchmark, profile, time_range

x = cp.random.random((128, 128, 128)).astype(cp.complex64)

for axis in range(3):
    cufft.fft(x, axis=axis)

with profile():
    for axis in range(3):
        with time_range('FFT along axis ' + str(axis)):
            cufft.fft(x, axis=axis)

I tried to annotate the sections corresponding to each axis. I'm not sure if there is a better way to share these results. If there are, please instruct me how.

Anyway, It seems clear that cupy is launching a single vector_fft kernel from each call. The calls in the noncontiguous case (axis = 0, 1) also contain a copy kernel.

@RainerHeintzmann
Copy link
Contributor

Thanks for finding out. I think the issue is the following:
The function CufftXtMakePlanMany supports an input stride (istride), an output stride (ostride) and an input distance (idist) and an output distance odist, to indicate the step between successive individual transforms to perform.
For the case of a Y-only transform you can only use istride of size(data,1) and idist of EITHER 1 (to perform the first row of trasforms) OR size(data,1)*size(data,2) to perform a series of Y-transforms for different Z coordinates.
So I do not see a way of simultaneously transforming along Y for an entire XZ slice, since this is not covered by a single stride.
However you can use permute-dims, which may make this faster, but this depends really on the machine and data-sizes:
y_fft(dat) = permutedims(fft(permutedims(dat,(1,3,2)),3),(1,3,2))
but on my computer this was only a bit faster.
I have no idea, how Python does this, but maybe they have access to another CUFFT library or they do rearrange the data?
But maybe we should also really compare properly planned (FFT_MEASURE) ffts and not the coarse interface without plans?
@marcsgil Or did you mean by vector_fft some other function in the CUFFT toolbox which I missed?

@marcsgil
Copy link
Author

marcsgil commented Feb 12, 2025

I haven't benchmarked the planned cupy fft just because I'm not very familiar with its API. Nonetheless, the planned CUDA.jl fft is already slower than the unplanned cupy one.

I'm not sure either what vector_fft means, it is just the kernel name that appears in the profilers.

About how the cupy code may accomplish this, I did some investigation and this section of their code seems relevant: cupy/fft/_fft.py.

They appear to always use swapaxis so that the transformed dimension is the last one. But according to the documentation, this only creates a view into the array. Unfortunately, CUFFT in Julia does not allow me to operate on views.

Finally, I had tried the idea of using permute dims. It is better then the current situation in Julia but still worse than cupy:

using CUDA, CUDA.CUFFT, LinearAlgebra, BenchmarkTools

function my_fft!(dest, buffer, src, perm, plan)
    iperm = invperm(perm)
    permutedims!(buffer, src, perm)
    mul!(dest, plan, buffer)
    permutedims!(dest, dest, iperm)
end

x = CUDA.randn(ComplexF32, 128, 128, 128)
perm = (2, 1, 3)
buffer = permutedims(x, perm)
dest = similar(buffer)
plan = plan_fft(buffer, 1)

@benchmark CUDA.@sync my_fft!($dest, $buffer, $x, $perm, $plan)

gives me

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  230.581 μs … 547.383 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     234.201 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   234.614 μs ±   4.656 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▃▄▃▆██▃▂▃▂▂ ▁                                  
  ▁▁▁▂▂▃▄▅▅▅▅▄▄▄▅▇█████████████▇▇▆▆▄▄▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▁ ▃
  231 μs           Histogram: frequency by time          241 μs <

 Memory estimate: 5.31 KiB, allocs estimate: 168.

@RainerHeintzmann
Copy link
Contributor

RainerHeintzmann commented Feb 12, 2025

OK. I have not looked into the Python code yet. However I had another, pretty odd idea, simply going back and forth in the first dimension:

x = CUDA.randn(ComplexF32, 128, 128, 128)
dest = similar(x)
planA = plan_fft(x, (1,2))
planB = plan_ifft!(x, 1)
function fft_trick(dest, src, plan_A, plan_B)
    mul!(dest, plan_A, src)
    mul!(dest, plan_B, dest)
end
fft_trick(dest, x, planA, planB)

maximum(abs.(fft(x, 2) .- dest))

@benchmark CUDA.@sync fft_trick($dest, $x, $planA, $planB)

For some odd reasons, I get pretty nice performances this way. Even faster than transforming along only the first dimension. But a drawback is the loss in numerical precision (in my case about 1e-5 for the Float32 cases). Can you post your measurement for comparison here?

@RainerHeintzmann
Copy link
Contributor

Btw. regarding your code permutedims!(dest, dest, iperm). Stickly speaking an in-place operation is not allowed for permutedims! or even transpose!. But it seemed to work anyhow for these cases.

@marcsgil
Copy link
Author

marcsgil commented Feb 12, 2025

About my code snippet, I indeed got the logic wrong. I meant something in the lines of

function my_fft!(buffer1, buffer2, x, perm, plan)
    iperm = invperm(perm)
    permutedims!(buffer1, x, perm)
    mul!(buffer2, plan, buffer1)
    permutedims!(x, buffer2, iperm)
end

This requires more buffers and actually overwrites x. In the end the timing is more or less the same, because the same type of operations are performed.

But your proposal seems to be the fastest: I get

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  47.190 μs … 116.550 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     48.170 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.205 μs ± 780.807 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                        ▃▃▃▃▃▇▅▄▅▄█▄▂▃▄▁                        
  ▂▂▁▂▂▂▂▂▂▂▂▃▃▃▄▄▅▅▇▇███████████████████▇▆▇▅▅▅▅▄▄▃▃▃▃▃▂▃▃▂▂▂▂ ▄
  47.2 μs         Histogram: frequency by time         49.1 μs <

 Memory estimate: 2.75 KiB, allocs estimate: 103.

Nonetheless, this is still (a bit) slower than cupy.

@RainerHeintzmann
Copy link
Contributor

OK. This makes about sense. The cost is roughly this of two single-direction FFTs and interestingly less than 3 single-direction FFTs, which tells us that there must still be some significant overhead in the call itself. I assume you have a pretty beefy GPU. Maybe the whole dataset size is sort of small for it. Even so, I would have expected that the for loop over z lauches the kernels in parallel and as long as no sync is called, all should be fine.

@marcsgil
Copy link
Author

I'm testing this on a 4090. But I discovered the issue when trying to perform the FFT over an array of size (2, 1024, 10^5), for which I believe that the overhead of the calls was not that relevant.

@RainerHeintzmann
Copy link
Contributor

I see, Then the loop run over 10^5 entries. I gues the fft_trick should work well for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers. performance How fast can we go?
Projects
None yet
Development

No branches or pull requests

3 participants