-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA compilers insert extraneous FMAs, breaking MultiFloats.jl algorithms #23
Comments
Thanks for the find @haampie ! This is very mysterious to me. I haven't played with Julia/CUDA interop before, so I'm unsure how Julia code gets compiled for GPU execution. The fact that most of the limbs are accurate is especially puzzling; if it was simply the case that CUDA's arithmetic/fma operations are improperly rounded, then you would expect all of limbs 2-8 to be garbage. But the fact that limbs 1-6 are right, while limbs 7-8 are wrong, rules out all of my easy hypotheses for what could be going wrong. I'll look into this the next time I work on MultiFloats.jl (which honestly might take a while... grad student life has me swamped these days) |
Just to add another data point. I tried exactly the same code on my GPU (1080ti), I am getting the accuracy 1e-114. Is this the expected accuracy level? 100×100 Matrix{MultiFloat{Float64, 8}}: |
Hey @kunyuan, thanks for your interest in MultiFloats.jl! No, this is not the expected accuracy, and I'm afraid to report I still don't understand what's going on with GPU |
Hi @dzhang314, do you know CAMPARY? |
Is this still an issue in v2? |
@rguerrab I'm not sure about the current status of this issue, since I don't do any Julia GPU development. I originally wanted to investigate this myself, but after a few years, I've never found the opportunity or resources. What I can tell you is that MultiFloats.jl does not contain any GPU-specific code -- it is written purely in terms of I'm going to close this issue until someone can demonstrate that this is a fault in MultiFloats.jl as opposed to, say, an overly-aggressive optimization pass in CUDA.jl. Note that the algorithms in MultiFloats.jl are all sensitive to rounding mode (must be round-to-nearest, ties-to-even) and contraction (rounding must occur exactly in the places where IEEE 754 specifies it, no more and no less). If you have an aggressive optimizing compiler that doesn't strictly preserve IEEE 754 semantics, then MultiFloats.jl will silently and catastrophically break. |
I've found this to originate from the fused multiply add optimizations done by CUDA.jl.
After this the GPU and CPU computations coincide exactly. |
That's interesting! Presumably the single FMA operation is very slightly more accurate than two rounded operations. So why is the CPU computation with extra steps more accurate? Is the calculation swapping FMAs for non-fused operations but assuming the error-free transform of the non-fused operation? For the sake of a little extra speed, and since CPUs have supported SIMD FMA for a while too, maybe writing things explicitly with FMA and its rounding error (https://ieeexplore.ieee.org/document/5487502/) might be nice. -R |
Hey @n-khera, thanks for making this discovery! Wow, it's extremely surprising to me that the LLVM CUDA backend automatically fuses separate adds and multiplies into FMA operations. But this answers a long-standing open question and confirms that the issue indeed has nothing to do with MultiFloats.jl but with a faulty compiler backend. I see that the LLVM behavior matches the behavior of NVIDIA's own @rguerrab: The error-free transformations used in MultiFloats.jl (mainly 2Sum and Fast2Mult) do not include any add/mul sequences that I could manually rewrite with FMAs. The CUDA compiler is reaching across function call boundaries to find nonlocal add/mul sequences to fuse. This is not something I can fix without implementing my own compiler infrastructure. However, I very much appreciate the suggestion, and the pointer to the ErrFma algorithm, which I had not seen before. |
Maybe it is possible to call directly ptx intrinsics as in JuliaGPU/CUDA.jl#2576. Do you know the works of Valentina Popescu et al. at Lyon? They did a work of formally many multifloats algorithms. This is her Ph. D. Thesis... if you're interested we can work on porting their implementation to MultiFloats.jl |
@orkolorko Thanks for the pointer! I recently became aware of Popescu's work in the course of working on issue #42. Actually, I have now developed my own novel techniques for computer verification of MultiFloats.jl algorithms, and I have been actively developing new algorithms for the next major release of MultiFloats.jl v3. Directly calling PTX intrinsics is an interesting approach, but I do not want to introduce a dependency on CUDA.jl for users who are not using NVIDIA GPUs. Perhaps in the future we can have an extension package that loads automatically if MultiFloats.jl and CUDA.jl are simultaneously loaded. In light of this discussion, I'll reopen this issue to track work on CUDA compatibility. I have other priorities (finishing my PhD!) at the moment, but I'd like to get back to this in the future. |
I've tried on MultiFloats.jl on the GPU, but I'm getting loss of precision compared to the CPU:
Gives me
but
eps(Float64x8)
is5.9091063153828709e-126
.What explain this? The order of iteration?
The text was updated successfully, but these errors were encountered: