CUDA compilers insert extraneous FMAs, breaking MultiFloats.jl algorithms #23

haampie · 2021-04-25T15:55:16Z

I've tried on MultiFloats.jl on the GPU, but I'm getting loss of precision compared to the CPU:

using CUDA, MultiFloats
A = rand(Float64x8, 100, 100)
B = rand(Float64x8, 100, 100)
A * B - Array(CuArray(A) * CuArray(B))

Gives me

100×100 Matrix{MultiFloat{Float64, 8}}:
 -1.23827e-98    9.35263e-99  -8.83181e-99  …  -4.70324e-99   -1.3348e-98
 -1.98421e-99    8.20389e-99   1.67043e-98      1.45499e-98    2.32225e-98
 -2.77264e-99   -3.30951e-99   1.32426e-98     -1.09181e-98    7.84157e-100
  1.92544e-98    6.35776e-99  -8.85547e-99      1.29435e-98   -4.89252e-99
 -5.52038e-99    5.35901e-99  -3.705e-98        1.53947e-99    7.38954e-99
 -2.16904e-98    1.64505e-98  -1.16536e-98  …  -3.19036e-98    7.5397e-99
  6.72487e-98    6.07349e-99  -2.87359e-98      ...

but eps(Float64x8) is 5.9091063153828709e-126.

What explain this? The order of iteration?

The text was updated successfully, but these errors were encountered:

dzhang314 · 2021-04-29T03:41:30Z

Thanks for the find @haampie ! This is very mysterious to me. I haven't played with Julia/CUDA interop before, so I'm unsure how Julia code gets compiled for GPU execution. The fact that most of the limbs are accurate is especially puzzling; if it was simply the case that CUDA's arithmetic/fma operations are improperly rounded, then you would expect all of limbs 2-8 to be garbage. But the fact that limbs 1-6 are right, while limbs 7-8 are wrong, rules out all of my easy hypotheses for what could be going wrong.

I'll look into this the next time I work on MultiFloats.jl (which honestly might take a while... grad student life has me swamped these days)

kunyuan · 2022-03-07T12:23:22Z

Just to add another data point. I tried exactly the same code on my GPU (1080ti), I am getting the accuracy 1e-114. Is this the expected accuracy level?

100×100 Matrix{MultiFloat{Float64, 8}}:
-1.14468e-115 2.54748e-115 -6.94563e-115 -4.96936e-115 2.13842e-115 8.64325e-115 1.6073e-115 … 3.4182e-115 6.56281e-115 3.55884e-115 4.6424e-115 -1.1596e-115 2.23816e-115
...

dzhang314 · 2022-03-08T04:49:16Z

Hey @kunyuan, thanks for your interest in MultiFloats.jl! No, this is not the expected accuracy, and I'm afraid to report I still don't understand what's going on with GPU MultiFloat calculations. I'll report back when I have time to take a look at this in detail.

orkolorko · 2022-06-20T07:01:01Z

Hi @dzhang314, do you know CAMPARY?
CAMPARY
It is a library whose idea is similar to your Multifloat library but made to work on the GPU.
I did a reimplementation in Julia some times ago, on the CPU, so quite similar to your library.
If you're interested I can try to take off some dust and compare with Multifloat...

rguerrab · 2024-09-25T14:59:14Z

Is this still an issue in v2?

dzhang314 · 2024-09-26T00:48:44Z

@rguerrab I'm not sure about the current status of this issue, since I don't do any Julia GPU development. I originally wanted to investigate this myself, but after a few years, I've never found the opportunity or resources.

What I can tell you is that MultiFloats.jl does not contain any GPU-specific code -- it is written purely in terms of Float64 arithmetic operations (including FMA) that are specified by IEEE 754 to return identical results on any standards-conforming CPU or GPU. Therefore, any discrepancy between CPU and GPU indicates a fault in either the hardware or the compiler toolchain, not MultiFloats.jl itself. This was always true in MultiFloats.jl v1.0 and remains true in v2.0.

I'm going to close this issue until someone can demonstrate that this is a fault in MultiFloats.jl as opposed to, say, an overly-aggressive optimization pass in CUDA.jl. Note that the algorithms in MultiFloats.jl are all sensitive to rounding mode (must be round-to-nearest, ties-to-even) and contraction (rounding must occur exactly in the places where IEEE 754 specifies it, no more and no less). If you have an aggressive optimizing compiler that doesn't strictly preserve IEEE 754 semantics, then MultiFloats.jl will silently and catastrophically break.

n-khera · 2024-12-15T10:26:02Z

I've found this to originate from the fused multiply add optimizations done by CUDA.jl.
This can currently be disabled by running the following hack:

import LLVM
LLVM.clopts("-nvptx-fma-level=0")

After this the GPU and CPU computations coincide exactly.

rguerrab · 2024-12-15T10:51:08Z

I've found this to originate from the fused multiply add optimizations done by CUDA.jl. This can currently be disabled by running the following hack:
import LLVM
LLVM.clopts("-nvptx-fma-level=0")
After this the GPU and CPU computations coincide exactly.

That's interesting! Presumably the single FMA operation is very slightly more accurate than two rounded operations. So why is the CPU computation with extra steps more accurate? Is the calculation swapping FMAs for non-fused operations but assuming the error-free transform of the non-fused operation? For the sake of a little extra speed, and since CPUs have supported SIMD FMA for a while too, maybe writing things explicitly with FMA and its rounding error (https://ieeexplore.ieee.org/document/5487502/) might be nice.

-R

dzhang314 · 2024-12-16T03:40:20Z

Hey @n-khera, thanks for making this discovery! Wow, it's extremely surprising to me that the LLVM CUDA backend automatically fuses separate adds and multiplies into FMA operations. But this answers a long-standing open question and confirms that the issue indeed has nothing to do with MultiFloats.jl but with a faulty compiler backend.

I see that the LLVM behavior matches the behavior of NVIDIA's own nvcc compiler, so it is unlikely that it will be changed. This leaves us in a real pickle. Setting LLVM.clopts("-nvptx-fma-level=0") by default would be an unacceptably intrusive global change that would propagate to other code unrelated to MultiFloats.jl. What we need is the ability to selectively disable automatic add/mul fusion in certain parts of the code but not others. This does not seem to exist.

@rguerrab: The error-free transformations used in MultiFloats.jl (mainly 2Sum and Fast2Mult) do not include any add/mul sequences that I could manually rewrite with FMAs. The CUDA compiler is reaching across function call boundaries to find nonlocal add/mul sequences to fuse. This is not something I can fix without implementing my own compiler infrastructure.

However, I very much appreciate the suggestion, and the pointer to the ErrFma algorithm, which I had not seen before.

orkolorko · 2024-12-16T14:18:48Z

Maybe it is possible to call directly ptx intrinsics as in JuliaGPU/CUDA.jl#2576.

Do you know the works of Valentina Popescu et al. at Lyon? They did a work of formally many multifloats algorithms.

This is her Ph. D. Thesis... if you're interested we can work on porting their implementation to MultiFloats.jl

dzhang314 · 2024-12-16T15:25:46Z

@orkolorko Thanks for the pointer! I recently became aware of Popescu's work in the course of working on issue #42. Actually, I have now developed my own novel techniques for computer verification of MultiFloats.jl algorithms, and I have been actively developing new algorithms for the next major release of MultiFloats.jl v3.

Directly calling PTX intrinsics is an interesting approach, but I do not want to introduce a dependency on CUDA.jl for users who are not using NVIDIA GPUs. Perhaps in the future we can have an extension package that loads automatically if MultiFloats.jl and CUDA.jl are simultaneously loaded.

In light of this discussion, I'll reopen this issue to track work on CUDA compatibility. I have other priorities (finishing my PhD!) at the moment, but I'd like to get back to this in the future.

dzhang314 closed this as completed Sep 26, 2024

dzhang314 reopened this Dec 16, 2024

dzhang314 changed the title ~~Some rounding error issue between CPU & GPU~~ CUDA compilers insert extraneous FMAs, breaking MultiFloats.jl algorithms Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA compilers insert extraneous FMAs, breaking MultiFloats.jl algorithms #23

CUDA compilers insert extraneous FMAs, breaking MultiFloats.jl algorithms #23

haampie commented Apr 25, 2021

dzhang314 commented Apr 29, 2021

kunyuan commented Mar 7, 2022 •

edited

Loading

dzhang314 commented Mar 8, 2022

orkolorko commented Jun 20, 2022

rguerrab commented Sep 25, 2024

dzhang314 commented Sep 26, 2024 •

edited

Loading

n-khera commented Dec 15, 2024

rguerrab commented Dec 15, 2024

dzhang314 commented Dec 16, 2024

orkolorko commented Dec 16, 2024 •

edited

Loading

dzhang314 commented Dec 16, 2024

CUDA compilers insert extraneous FMAs, breaking MultiFloats.jl algorithms #23

CUDA compilers insert extraneous FMAs, breaking MultiFloats.jl algorithms #23

Comments

haampie commented Apr 25, 2021

dzhang314 commented Apr 29, 2021

kunyuan commented Mar 7, 2022 • edited Loading

dzhang314 commented Mar 8, 2022

orkolorko commented Jun 20, 2022

rguerrab commented Sep 25, 2024

dzhang314 commented Sep 26, 2024 • edited Loading

n-khera commented Dec 15, 2024

rguerrab commented Dec 15, 2024

dzhang314 commented Dec 16, 2024

orkolorko commented Dec 16, 2024 • edited Loading

dzhang314 commented Dec 16, 2024

kunyuan commented Mar 7, 2022 •

edited

Loading

dzhang314 commented Sep 26, 2024 •

edited

Loading

orkolorko commented Dec 16, 2024 •

edited

Loading