-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge slowdown when performing fft
over the second dimension of a 3D array
#2641
Comments
It'd be useful to profile the code here, either using |
From the Julia side, when I run using CUDA, CUDA.CUFFT, LinearAlgebra
CUDA.allowscalar(false)
x = CUDA.randn(ComplexF32, 128, 128, 128)
buffer = similar(x)
for dim ∈ 1:ndims(x)
@info "FFT along dimension $dim"
plan = plan_fft(buffer, dim)
display(CUDA.@profile mul!(buffer, plan, x))
println()
end I get the result
The major difference seems to be that, when As for the cupy code, the cluster I'm using does not have a profiler installed. As soon as I can get one working I'll post the results here. |
The loop comes from Lines 304 to 322 in 5461475
|
Thanks for finding out. I think the issue is the following: |
I haven't benchmarked the planned cupy fft just because I'm not very familiar with its API. Nonetheless, the planned CUDA.jl fft is already slower than the unplanned cupy one. I'm not sure either what About how the cupy code may accomplish this, I did some investigation and this section of their code seems relevant: cupy/fft/_fft.py. They appear to always use swapaxis so that the transformed dimension is the last one. But according to the documentation, this only creates a view into the array. Unfortunately, CUFFT in Julia does not allow me to operate on views. Finally, I had tried the idea of using permute dims. It is better then the current situation in Julia but still worse than cupy: using CUDA, CUDA.CUFFT, LinearAlgebra, BenchmarkTools
function my_fft!(dest, buffer, src, perm, plan)
iperm = invperm(perm)
permutedims!(buffer, src, perm)
mul!(dest, plan, buffer)
permutedims!(dest, dest, iperm)
end
x = CUDA.randn(ComplexF32, 128, 128, 128)
perm = (2, 1, 3)
buffer = permutedims(x, perm)
dest = similar(buffer)
plan = plan_fft(buffer, 1)
@benchmark CUDA.@sync my_fft!($dest, $buffer, $x, $perm, $plan) gives me
|
OK. I have not looked into the Python code yet. However I had another, pretty odd idea, simply going back and forth in the first dimension: x = CUDA.randn(ComplexF32, 128, 128, 128)
dest = similar(x)
planA = plan_fft(x, (1,2))
planB = plan_ifft!(x, 1)
function fft_trick(dest, src, plan_A, plan_B)
mul!(dest, plan_A, src)
mul!(dest, plan_B, dest)
end
fft_trick(dest, x, planA, planB)
maximum(abs.(fft(x, 2) .- dest))
@benchmark CUDA.@sync fft_trick($dest, $x, $planA, $planB) For some odd reasons, I get pretty nice performances this way. Even faster than transforming along only the first dimension. But a drawback is the loss in numerical precision (in my case about 1e-5 for the Float32 cases). Can you post your measurement for comparison here? |
Btw. regarding your code |
About my code snippet, I indeed got the logic wrong. I meant something in the lines of function my_fft!(buffer1, buffer2, x, perm, plan)
iperm = invperm(perm)
permutedims!(buffer1, x, perm)
mul!(buffer2, plan, buffer1)
permutedims!(x, buffer2, iperm)
end This requires more buffers and actually overwrites x. In the end the timing is more or less the same, because the same type of operations are performed. But your proposal seems to be the fastest: I get
Nonetheless, this is still (a bit) slower than cupy. |
OK. This makes about sense. The cost is roughly this of two single-direction FFTs and interestingly less than 3 single-direction FFTs, which tells us that there must still be some significant overhead in the call itself. I assume you have a pretty beefy GPU. Maybe the whole dataset size is sort of small for it. Even so, I would have expected that the for loop over z lauches the kernels in parallel and as long as no sync is called, all should be fine. |
I'm testing this on a 4090. But I discovered the issue when trying to perform the FFT over an array of size (2, 1024, 10^5), for which I believe that the overhead of the calls was not that relevant. |
I see, Then the loop run over 10^5 entries. I gues the |
I'm getting super slow speeds when performing
fft
over the second dimension of a 3D array.The Minimal Working Example (MWE) for this bug:
gives me
Manifest.toml
Expected behavior
I know a slowdown is expected due to a noncontiguous memory access pattern, but not by this much. Furthermore, I actually see no slowdown when performing the
fft
over the third dimension, which is also noncontiguous, and it is also not present incupy
. One can check it by running the codewhich gives me
Version info
Details on Julia:
Details on CUDA:
The text was updated successfully, but these errors were encountered: