Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA compilers insert extraneous FMAs, breaking MultiFloats.jl algorithms #23

Open
haampie opened this issue Apr 25, 2021 · 11 comments
Open

Comments

@haampie
Copy link
Contributor

haampie commented Apr 25, 2021

I've tried on MultiFloats.jl on the GPU, but I'm getting loss of precision compared to the CPU:

using CUDA, MultiFloats
A = rand(Float64x8, 100, 100)
B = rand(Float64x8, 100, 100)
A * B - Array(CuArray(A) * CuArray(B))

Gives me

100×100 Matrix{MultiFloat{Float64, 8}}:
 -1.23827e-98    9.35263e-99  -8.83181e-99  …  -4.70324e-99   -1.3348e-98
 -1.98421e-99    8.20389e-99   1.67043e-98      1.45499e-98    2.32225e-98
 -2.77264e-99   -3.30951e-99   1.32426e-98     -1.09181e-98    7.84157e-100
  1.92544e-98    6.35776e-99  -8.85547e-99      1.29435e-98   -4.89252e-99
 -5.52038e-99    5.35901e-99  -3.705e-98        1.53947e-99    7.38954e-99
 -2.16904e-98    1.64505e-98  -1.16536e-98  …  -3.19036e-98    7.5397e-99
  6.72487e-98    6.07349e-99  -2.87359e-98      ...

but eps(Float64x8) is 5.9091063153828709e-126.

What explain this? The order of iteration?

@dzhang314
Copy link
Owner

Thanks for the find @haampie ! This is very mysterious to me. I haven't played with Julia/CUDA interop before, so I'm unsure how Julia code gets compiled for GPU execution. The fact that most of the limbs are accurate is especially puzzling; if it was simply the case that CUDA's arithmetic/fma operations are improperly rounded, then you would expect all of limbs 2-8 to be garbage. But the fact that limbs 1-6 are right, while limbs 7-8 are wrong, rules out all of my easy hypotheses for what could be going wrong.

I'll look into this the next time I work on MultiFloats.jl (which honestly might take a while... grad student life has me swamped these days)

@kunyuan
Copy link

kunyuan commented Mar 7, 2022

Just to add another data point. I tried exactly the same code on my GPU (1080ti), I am getting the accuracy 1e-114. Is this the expected accuracy level?

100×100 Matrix{MultiFloat{Float64, 8}}:
-1.14468e-115 2.54748e-115 -6.94563e-115 -4.96936e-115 2.13842e-115 8.64325e-115 1.6073e-115 … 3.4182e-115 6.56281e-115 3.55884e-115 4.6424e-115 -1.1596e-115 2.23816e-115
...

@dzhang314
Copy link
Owner

Hey @kunyuan, thanks for your interest in MultiFloats.jl! No, this is not the expected accuracy, and I'm afraid to report I still don't understand what's going on with GPU MultiFloat calculations. I'll report back when I have time to take a look at this in detail.

@orkolorko
Copy link

Hi @dzhang314, do you know CAMPARY?
CAMPARY
It is a library whose idea is similar to your Multifloat library but made to work on the GPU.
I did a reimplementation in Julia some times ago, on the CPU, so quite similar to your library.
If you're interested I can try to take off some dust and compare with Multifloat...

@rguerrab
Copy link

Is this still an issue in v2?

@dzhang314
Copy link
Owner

dzhang314 commented Sep 26, 2024

@rguerrab I'm not sure about the current status of this issue, since I don't do any Julia GPU development. I originally wanted to investigate this myself, but after a few years, I've never found the opportunity or resources.

What I can tell you is that MultiFloats.jl does not contain any GPU-specific code -- it is written purely in terms of Float64 arithmetic operations (including FMA) that are specified by IEEE 754 to return identical results on any standards-conforming CPU or GPU. Therefore, any discrepancy between CPU and GPU indicates a fault in either the hardware or the compiler toolchain, not MultiFloats.jl itself. This was always true in MultiFloats.jl v1.0 and remains true in v2.0.

I'm going to close this issue until someone can demonstrate that this is a fault in MultiFloats.jl as opposed to, say, an overly-aggressive optimization pass in CUDA.jl. Note that the algorithms in MultiFloats.jl are all sensitive to rounding mode (must be round-to-nearest, ties-to-even) and contraction (rounding must occur exactly in the places where IEEE 754 specifies it, no more and no less). If you have an aggressive optimizing compiler that doesn't strictly preserve IEEE 754 semantics, then MultiFloats.jl will silently and catastrophically break.

@n-khera
Copy link

n-khera commented Dec 15, 2024

I've found this to originate from the fused multiply add optimizations done by CUDA.jl.
This can currently be disabled by running the following hack:

import LLVM
LLVM.clopts("-nvptx-fma-level=0")

After this the GPU and CPU computations coincide exactly.

@rguerrab
Copy link

I've found this to originate from the fused multiply add optimizations done by CUDA.jl. This can currently be disabled by running the following hack:

import LLVM
LLVM.clopts("-nvptx-fma-level=0")

After this the GPU and CPU computations coincide exactly.

That's interesting! Presumably the single FMA operation is very slightly more accurate than two rounded operations. So why is the CPU computation with extra steps more accurate? Is the calculation swapping FMAs for non-fused operations but assuming the error-free transform of the non-fused operation? For the sake of a little extra speed, and since CPUs have supported SIMD FMA for a while too, maybe writing things explicitly with FMA and its rounding error (https://ieeexplore.ieee.org/document/5487502/) might be nice.

-R

@dzhang314
Copy link
Owner

Hey @n-khera, thanks for making this discovery! Wow, it's extremely surprising to me that the LLVM CUDA backend automatically fuses separate adds and multiplies into FMA operations. But this answers a long-standing open question and confirms that the issue indeed has nothing to do with MultiFloats.jl but with a faulty compiler backend.

I see that the LLVM behavior matches the behavior of NVIDIA's own nvcc compiler, so it is unlikely that it will be changed. This leaves us in a real pickle. Setting LLVM.clopts("-nvptx-fma-level=0") by default would be an unacceptably intrusive global change that would propagate to other code unrelated to MultiFloats.jl. What we need is the ability to selectively disable automatic add/mul fusion in certain parts of the code but not others. This does not seem to exist.

@rguerrab: The error-free transformations used in MultiFloats.jl (mainly 2Sum and Fast2Mult) do not include any add/mul sequences that I could manually rewrite with FMAs. The CUDA compiler is reaching across function call boundaries to find nonlocal add/mul sequences to fuse. This is not something I can fix without implementing my own compiler infrastructure.

However, I very much appreciate the suggestion, and the pointer to the ErrFma algorithm, which I had not seen before.

@orkolorko
Copy link

orkolorko commented Dec 16, 2024

Maybe it is possible to call directly ptx intrinsics as in JuliaGPU/CUDA.jl#2576.

Do you know the works of Valentina Popescu et al. at Lyon? They did a work of formally many multifloats algorithms.

This is her Ph. D. Thesis... if you're interested we can work on porting their implementation to MultiFloats.jl

@dzhang314
Copy link
Owner

@orkolorko Thanks for the pointer! I recently became aware of Popescu's work in the course of working on issue #42. Actually, I have now developed my own novel techniques for computer verification of MultiFloats.jl algorithms, and I have been actively developing new algorithms for the next major release of MultiFloats.jl v3.

Directly calling PTX intrinsics is an interesting approach, but I do not want to introduce a dependency on CUDA.jl for users who are not using NVIDIA GPUs. Perhaps in the future we can have an extension package that loads automatically if MultiFloats.jl and CUDA.jl are simultaneously loaded.

In light of this discussion, I'll reopen this issue to track work on CUDA compatibility. I have other priorities (finishing my PhD!) at the moment, but I'd like to get back to this in the future.

@dzhang314 dzhang314 reopened this Dec 16, 2024
@dzhang314 dzhang314 changed the title Some rounding error issue between CPU & GPU CUDA compilers insert extraneous FMAs, breaking MultiFloats.jl algorithms Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants