replace AMREX_DEVICE_COMPILE with AMREX_IF_ON_DEVICE and AMREX_IF_ON_HOST #3591

BenWibking · 2023-10-10T01:29:07Z

Summary

This adds the macros AMREX_IF_ON_DEVICE((code_for_device)) and AMREX_IF_ON_HOST((code_for_host)) that are compatible with single-pass host/device compilation (as used by nvc++ -cuda), as well as backward compatible with all other compilers.

This also replaces all uses of AMREX_DEVICE_COMPILE with these macros.

Fixes #3586.

Additional background

Single-pass compilation evalutes the preprocessor macros once for each source file. This means that preprocessor conditionals cannot be used to choose between host and device code. In particular, NVHPC with -cuda does not support __CUDA_ARCH__, instead requiring the use of the if target construct. This creates portable macros that work for either single-pass or two-pass compilation, but requires restructuring of any code that uses AMREX_DEVICE_COMPILE so that the code appears as a macro argument.

This PR will allow using NVHPC with -cuda as the unified host/device compiler for AMReX. In the future, single-pass compilers for other backends may be available, e.g., SYCL (https://dl.acm.org/doi/abs/10.1145/3585341.3585351).

AMReX can be configured to build with nvc++ -cuda using CMake:

cmake .. -DAMReX_GPU_BACKEND=CUDA -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER_ID=NVCXX -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CUDA_COMPILER_FORCED=ON -DCMAKE_CUDA_COMPILE_FEATURES=cuda_std_17 -DAMReX_GPU_RDC=OFF -DCMAKE_CXX_FLAGS="-cuda --gcc-toolchain=$(which gcc)" -DCMAKE_CUDA_FLAGS="-cuda --gcc-toolchain=$(which gcc)" -DAMReX_ENABLE_TESTS=ON -DCMAKE_CUDA_HOST_LINK_LAUNCHER=nvc++ -DCMAKE_CUDA_LINK_EXECUTABLE="<CMAKE_CUDA_HOST_LINK_LAUNCHER> <FLAGS> <LINK_FLAGS> <OBJECTS> -o <TARGET> <LINK_LIBRARIES>"

CMake hacks (https://github.com/NVIDIA/cub/blob/0fc3c3701632a4be906765b73be20a9ad0da603d/cmake/CubCompilerHacks.cmake) are tested with CMake 3.22.1 and NVHPC 23.5, 23.7, and 23.9 (earlier versions do not work). However, it currently fails to link the executables for the tests due to a compiler/linker bug.

(Note that by default, nvcc preserves denormals, whereas nvc++ does not. Also, nvc++ generates relocatable device code by default, whereas nvcc does not.)

Checklist

The proposed changes:

fix a bug or incorrect behavior in AMReX
add new capabilities to AMReX
changes answers in the test suite to more than roundoff level
are likely to significantly affect the results of downstream AMReX users
include documentation in the code and/or rst files, if appropriate

BenWibking · 2023-10-14T22:11:54Z

The OpenMP builds fail due to a strange quirk of how #pragma omp atomic is implemented:

#pragma omp atomic update
        *sum += value;

is considered valid by GCC, whereas

#pragma omp atomic update
        { *sum += value; }

is considered invalid.

The above is the preprocessed output for the host compile from this code in AMReX_GpuAtomic.H:

#ifdef AMREX_USE_OMP
#pragma omp atomic update
#endif
        AMREX_IF_ON_HOST((
            *sum += value;
        ))

Strangely, replacing the above with:

#ifdef AMREX_USE_OMP
        AMREX_IF_ON_HOST((
#pragma omp atomic update
            *sum += value;
        ))
#else
        AMREX_IF_ON_HOST((*sum += value;))
#endif

still expands to:

#pragma omp atomic update
        { *sum += value; }

I'm not sure how to work around this.

BenWibking · 2023-10-14T22:48:27Z

This is the only portable workaround I could find:

namespace HostDevice::Atomic {

    template <class T>
    AMREX_FORCE_INLINE
    void Add_Host (T* const sum, T const value) noexcept
    {
#ifdef AMREX_USE_OMP
#pragma omp atomic update
#endif
        *sum += value;
    }

    template <class T>
    AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
    void Add (T* const sum, T const value) noexcept
    {
        AMREX_IF_ON_DEVICE((Gpu::Atomic::AddNoRet(sum,value);))
        AMREX_IF_ON_HOST((Add_Host(sum,value);))
    }

}

BenWibking · 2023-10-14T23:16:04Z

In order to workaround the macro limitations, I had to replace the CLZ functions with, e.g.:

AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
int clz (std::uint64_t x) noexcept
{
#if defined(AMREX_USE_CUDA) // all supported cuda versions have __clz
    AMREX_IF_ON_DEVICE((return detail::clz_wrapper(detail::clz_tag{}, x);))
    AMREX_IF_ON_HOST((return clz_generic(x);))
#elif AMREX_HAS_BUILTIN_CLZ
    AMREX_IF_ON_HOST((return detail::builtin_clz_wrapper(detail::clz_tag{}, x);))
    AMREX_IF_ON_DEVICE((return clz_generic(x);))
#else
    return clz_generic(x);
#endif
}

BenWibking · 2023-10-15T16:23:07Z

When building with NVHPC, I get several errors about missing return statements:

"amrex/Src/Base/AMReX_GpuRange.H", line 40: warning: missing return statement at end of non-void function "amrex::Gpu::range_detail::size(const amrex::Box &) noexcept" [implicit_return_from_non_void_function]
  }
  ^

However, these appear to be spurious, due to conditional constructs like this:

AMREX_GPU_HOST_DEVICE
AMREX_FORCE_INLINE Long size (Box const& b) noexcept
{
    AMREX_IF_ON_DEVICE((return b.numPts();))
    AMREX_IF_ON_HOST((
        amrex::ignore_unused(b);
        return 1;
    ))
} // AMReX_GpuRange.H:40

BenWibking · 2023-10-15T16:27:26Z

When building the tests with NVHPC, there are multiple definition linker errors for every binary for some cuRand symbols (they are the same for every binary):

[ 66%] Linking CUDA executable 3d/Test_Amr_Advection_AmrCore_3d
nvlink error   : Multiple definition of 'precalc_xorwow_matrix' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of 'precalc_xorwow_offset_matrix' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of 'mrg32k3aM1' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of 'mrg32k3aM2' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of 'mrg32k3aM1SubSeq' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of 'mrg32k3aM2SubSeq' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of 'mrg32k3aM1Seq' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of 'mrg32k3aM2Seq' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of '__cr_lgamma_table' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink fatal   : merge_elf failed
pgacclnk: child process exit status 2: /mnt/ufs18/home-208/wibkingb/spack/opt/spack/linux-centos7-zen/gcc-6.4.0/nvhpc-23.9-xysfmihvcig2jao55qhodg36ybrsxkde/Linux_x86_64/23.9/compilers/bin/tools/nvdd
make[2]: *** [Tests/Amr/Advection_AmrCore/3d/Test_Amr_Advection_AmrCore_3d] Error 2
make[1]: *** [Tests/Amr/Advection_AmrCore/CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/all] Error 2

@ax3l do you have any idea how this is happening?

EDIT: For device cuRAND functions, it might be necessary to compile with RDC enabled: https://gitlab.com/nvidia/headers/cuda-individual/curand/-/blob/main/curand_poisson.h?ref_type=heads#L182. Otherwise, we run into this issue: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#device-code-in-libraries

Src/Base/AMReX_GpuQualifiers.H

Src/Base/AMReX_Algorithm.H

Src/Base/AMReX_GpuAtomic.H

Src/Base/AMReX_GpuRange.H

Src/Base/AMReX_Random.H

Src/Base/AMReX_RealVect.H

WeiqunZhang · 2023-10-19T01:18:36Z

@BenWibking Thanks for the PR! I have done a first pass. Looks great overall. But I want to read the changes more carefully and do more testing on various architectures. Maybe we can target merging this at the beginning of next month.

WeiqunZhang · 2023-11-01T17:28:23Z

The multiple definition issue is a nvhpc compiler issue. We can ignore it for now. I will take one more pass within a couple of days before merging this.

Src/Base/AMReX_Math.H

Src/Base/AMReX_RealVect.H

…replace-cuda-arch

BenWibking · 2023-11-07T03:10:15Z

Src/Base/AMReX_Algorithm.H

@@ -252,7 +252,7 @@ int clz (T x) noexcept;
 AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
 int clz_generic (std::uint8_t x) noexcept
 {
-    constexpr int clz_lookup[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };
+    static constexpr int clz_lookup[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };


For some reason, nvc++ won't compile it with static.

Maybe you can file a bug report. I suspect it should pretty easy to write a small reproducer.

Since we need to wait for the inline bug to be fixed anyway, we probably don't need to worry about this issue for now. If we need to work around this for nvc++, we could use if !defined(__NVHPC). If we don't use static, the compiler might waste stack space unnecessarily.

…HOST (AMReX-Codes#3591) ## Summary This adds the macros `AMREX_IF_ON_DEVICE((code_for_device))` and `AMREX_IF_ON_HOST((code_for_host))` that are compatible with single-pass host/device compilation (as used by `nvc++ -cuda`), as well as backward compatible with all other compilers. This also replaces all uses of `AMREX_DEVICE_COMPILE` with these macros. Fixes AMReX-Codes#3586. ## Additional background Single-pass compilation evalutes the preprocessor macros once for each source file. This means that preprocessor conditionals cannot be used to choose between host and device code. In particular, NVHPC with `-cuda` does not support `__CUDA_ARCH__`, instead requiring the use of the `if target` construct. This creates portable macros that work for either single-pass or two-pass compilation, but requires restructuring of any code that uses AMREX_DEVICE_COMPILE so that the code appears as a macro argument. This PR will allow using NVHPC with `-cuda` as the unified host/device compiler for AMReX. In the future, single-pass compilers for other backends may be available, e.g., SYCL (https://dl.acm.org/doi/abs/10.1145/3585341.3585351). AMReX can be configured to build with `nvc++ -cuda` using CMake: ``` cmake .. -DAMReX_GPU_BACKEND=CUDA -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER_ID=NVCXX -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CUDA_COMPILER_FORCED=ON -DCMAKE_CUDA_COMPILE_FEATURES=cuda_std_17 -DAMReX_GPU_RDC=OFF -DCMAKE_CXX_FLAGS="-cuda --gcc-toolchain=$(which gcc)" -DCMAKE_CUDA_FLAGS="-cuda --gcc-toolchain=$(which gcc)" -DAMReX_ENABLE_TESTS=ON -DCMAKE_CUDA_HOST_LINK_LAUNCHER=nvc++ -DCMAKE_CUDA_LINK_EXECUTABLE="<CMAKE_CUDA_HOST_LINK_LAUNCHER> <FLAGS> <LINK_FLAGS> <OBJECTS> -o <TARGET> <LINK_LIBRARIES>" ``` CMake hacks (https://github.com/NVIDIA/cub/blob/0fc3c3701632a4be906765b73be20a9ad0da603d/cmake/CubCompilerHacks.cmake) are tested with CMake 3.22.1 and NVHPC 23.5, 23.7, and 23.9 (earlier versions do not work). However, it currently fails to link the executables for the tests due to a [compiler/linker bug](https://forums.developer.nvidia.com/t/nvc-cuda-fails-to-link-code-when-using-device-curand-functions/270401/5). (Note that by default, `nvcc` preserves denormals, whereas `nvc++` does not. Also, `nvc++` generates relocatable device code by default, whereas `nvcc` does not.) ## Checklist The proposed changes: - [ ] fix a bug or incorrect behavior in AMReX - [ ] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] include documentation in the code and/or rst files, if appropriate --------- Co-authored-by: Weiqun Zhang <[email protected]>

BenWibking · 2024-02-04T21:14:47Z

FYI- The nvhpc linker bug should be fixed in NVHPC 24.3: https://forums.developer.nvidia.com/t/nvc-cuda-fails-to-link-code-when-using-device-curand-functions/270401/7.

BenWibking · 2024-03-13T21:29:07Z

With the following patch:

diff --git a/Src/Base/AMReX_Algorithm.H b/Src/Base/AMReX_Algorithm.H
index 4718606b7..0fc26df05 100644
--- a/Src/Base/AMReX_Algorithm.H
+++ b/Src/Base/AMReX_Algorithm.H
@@ -289,7 +289,11 @@ int clz (T x) noexcept;
 AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
 int clz_generic (std::uint8_t x) noexcept
 {
+#if !defined(__NVCOMPILER)
     static constexpr int clz_lookup[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };
+#else
+    constexpr int clz_lookup[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };
+#endif
     auto upper = x >> 4;
     auto lower = x & 0xF;
     return upper ? clz_lookup[upper] : 4 + clz_lookup[lower];
diff --git a/Src/Base/AMReX_INT.H b/Src/Base/AMReX_INT.H
index f54852cb6..4356c70f1 100644
--- a/Src/Base/AMReX_INT.H
+++ b/Src/Base/AMReX_INT.H
@@ -31,7 +31,7 @@ namespace amrex {
 }
 #endif

-#if (defined(__x86_64) || defined (__aarch64__)) && !defined(_WIN32) && (defined(__GNUC__) || defined(__clang__))
+#if (defined(__x86_64) || defined (__aarch64__)) && !defined(_WIN32) && (defined(__GNUC__) || defined(__clang__)) && !defined(__NVCOMPILER)

 #define AMREX_INT128_SUPPORTED 1

CMake configures succesfully:

cmake .. -DAMReX_GPU_BACKEND=CUDA -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER_ID=NVCXX -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CUDA_COMPILER_FORCED=ON -DCMAKE_CUDA_COMPILE_FEATURES=cuda_std_17 -DAMReX_GPU_RDC=OFF -DCMAKE_CXX_FLAGS="-cuda --gcc-toolchain=$(which gcc)" -DCMAKE_CUDA_FLAGS="-cuda --gcc-toolchain=$(which gcc)" -DAMReX_ENABLE_TESTS=ON -DCMAKE_CUDA_HOST_LINK_LAUNCHER=nvc++ -DCMAKE_CUDA_LINK_EXECUTABLE="<CMAKE_CUDA_HOST_LINK_LAUNCHER> <FLAGS> <LINK_FLAGS> <OBJECTS> -o <TARGET> <LINK_LIBRARIES>"

Then NVHPC 24.3 crashes with the error:

[bwibking@dt-login02 Src]$ /u/bwibking/hpc_sdk/Linux_x86_64/24.3/compilers/bin/nvc++  -DAMREX_SPACEDIM=3 -I/u/bwibking/amrex/Src/Base -I/u/bwibking/amrex/Src/Base/Parser -I/u/bwibking/amrex/Src/Boundary -I/u/bwibking/amrex/Src/AmrCore -I/u/bwibking/amrex/Src/Amr -I/u/bwibking/amrex/Src/LinearSolvers -I/u/bwibking/amrex/Src/LinearSolvers/MLMG -I/u/bwibking/amrex/Src/LinearSolvers/OpenBC -I/u/bwibking/amrex/Src/Particle -I/u/bwibking/amrex/build -I/sw/spack/deltas11-2023-03/apps/linux-rhel8-zen3/gcc-11.4.0/openmpi-4.1.5-ejhvyny/include -I/sw/spack/deltas11-2023-03/apps/linux-rhel8-zen3/gcc-11.4.0/cuda-11.8.0-vfixfmc/include -cuda --gcc-toolchain=/sw/spack/deltas11-2023-03/apps/linux-rhel8-x86_64/gcc-8.5.0/gcc-11.4.0-yycklku/bin/gcc -pthread  -c /u/bwibking/amrex/Src/Base/AMReX_BlockMutex.cpp -o CMakeFiles/amrex_3d.dir/Base/AMReX_BlockMutex.cpp.o
nvdd-Fatal-/sw/spack/deltas11-2023-03/apps/linux-rhel8-zen3/gcc-11.4.0/cuda-11.8.0-vfixfmc/bin/ptxas TERMINATED by signal 11
NVC++-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (/u/bwibking/amrex/Src/Base/AMReX_BlockMutex.cpp: 14)
NVC++/x86-64 Linux 24.3-0: compilation aborted

## Summary This works around limitations in the NVHPC device compiler by disabling `int128` support and avoiding a static local device variable. This enables experimental use of NVHPC 24.3+ as the unified host and device compiler for CUDA. ## Additional background #3591

use AMREX_IF_ON_DEVICE

fbc67d5

BenWibking marked this pull request as draft October 10, 2023 01:33

BenWibking changed the title ~~add macro AMREX_IF_ON_DEVICE~~ [WIP] add macro AMREX_IF_ON_DEVICE Oct 10, 2023

BenWibking added 3 commits October 14, 2023 16:08

fix macro usage

02ee118

fix typos

b281824

use double parentheses for macro args

209c960

workaround pragma omp atomic restrictions

a322241

BenWibking added 3 commits October 14, 2023 19:01

convert CLZ functions to use macros

ce5c94d

update GPU doc

7c64ca8

handle all cases in CLZ

470c3cd

BenWibking changed the title ~~[WIP] add macro AMREX_IF_ON_DEVICE~~ [WIP] replace AMREX_DEVICE_COMPILE with AMREX_IF_ON_DEVICE and AMREX_IF_ON_HOST Oct 14, 2023

BenWibking changed the title ~~[WIP] replace AMREX_DEVICE_COMPILE with AMREX_IF_ON_DEVICE and AMREX_IF_ON_HOST~~ replace AMREX_DEVICE_COMPILE with AMREX_IF_ON_DEVICE and AMREX_IF_ON_HOST Oct 14, 2023

BenWibking marked this pull request as ready for review October 14, 2023 23:20

BenWibking and others added 2 commits October 14, 2023 20:56

avoid static local device var

615bffc

fix __CUDA_ARCH__ ifdefs

5471bc7

BenWibking marked this pull request as draft October 15, 2023 16:20

fix trailing whitespace

aa02932

BenWibking marked this pull request as ready for review October 15, 2023 17:33

WeiqunZhang reviewed Oct 18, 2023

View reviewed changes

Src/Base/AMReX_GpuQualifiers.H Outdated Show resolved Hide resolved

WeiqunZhang reviewed Oct 19, 2023

View reviewed changes

Src/Base/AMReX_Algorithm.H Outdated Show resolved Hide resolved

WeiqunZhang reviewed Oct 19, 2023

View reviewed changes

Src/Base/AMReX_GpuAtomic.H Outdated Show resolved Hide resolved

WeiqunZhang reviewed Oct 19, 2023

View reviewed changes

Src/Base/AMReX_GpuRange.H Show resolved Hide resolved

WeiqunZhang reviewed Oct 19, 2023

View reviewed changes

Src/Base/AMReX_Random.H Show resolved Hide resolved

WeiqunZhang reviewed Oct 19, 2023

View reviewed changes

Src/Base/AMReX_RealVect.H Outdated Show resolved Hide resolved

WeiqunZhang and others added 4 commits November 2, 2023 11:34

Simplify clz

34f9cbc

Need an extra declaration

98f29ed

Fix default parameter

0ec2ff1

Update Src/Base/AMReX_GpuAtomic.H

1f1850f

WeiqunZhang reviewed Nov 6, 2023

View reviewed changes