Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace AMREX_DEVICE_COMPILE with AMREX_IF_ON_DEVICE and AMREX_IF_ON_HOST #3591

Merged
merged 31 commits into from
Nov 7, 2023

Conversation

BenWibking
Copy link
Contributor

@BenWibking BenWibking commented Oct 10, 2023

Summary

This adds the macros AMREX_IF_ON_DEVICE((code_for_device)) and AMREX_IF_ON_HOST((code_for_host)) that are compatible with single-pass host/device compilation (as used by nvc++ -cuda), as well as backward compatible with all other compilers.

This also replaces all uses of AMREX_DEVICE_COMPILE with these macros.

Fixes #3586.

Additional background

Single-pass compilation evalutes the preprocessor macros once for each source file. This means that preprocessor conditionals cannot be used to choose between host and device code. In particular, NVHPC with -cuda does not support __CUDA_ARCH__, instead requiring the use of the if target construct. This creates portable macros that work for either single-pass or two-pass compilation, but requires restructuring of any code that uses AMREX_DEVICE_COMPILE so that the code appears as a macro argument.

This PR will allow using NVHPC with -cuda as the unified host/device compiler for AMReX. In the future, single-pass compilers for other backends may be available, e.g., SYCL (https://dl.acm.org/doi/abs/10.1145/3585341.3585351).

AMReX can be configured to build with nvc++ -cuda using CMake:

cmake .. -DAMReX_GPU_BACKEND=CUDA -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER_ID=NVCXX -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CUDA_COMPILER_FORCED=ON -DCMAKE_CUDA_COMPILE_FEATURES=cuda_std_17 -DAMReX_GPU_RDC=OFF -DCMAKE_CXX_FLAGS="-cuda --gcc-toolchain=$(which gcc)" -DCMAKE_CUDA_FLAGS="-cuda --gcc-toolchain=$(which gcc)" -DAMReX_ENABLE_TESTS=ON -DCMAKE_CUDA_HOST_LINK_LAUNCHER=nvc++ -DCMAKE_CUDA_LINK_EXECUTABLE="<CMAKE_CUDA_HOST_LINK_LAUNCHER> <FLAGS> <LINK_FLAGS> <OBJECTS> -o <TARGET> <LINK_LIBRARIES>"

CMake hacks (https://github.com/NVIDIA/cub/blob/0fc3c3701632a4be906765b73be20a9ad0da603d/cmake/CubCompilerHacks.cmake) are tested with CMake 3.22.1 and NVHPC 23.5, 23.7, and 23.9 (earlier versions do not work). However, it currently fails to link the executables for the tests due to a compiler/linker bug.

(Note that by default, nvcc preserves denormals, whereas nvc++ does not. Also, nvc++ generates relocatable device code by default, whereas nvcc does not.)

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • include documentation in the code and/or rst files, if appropriate

@BenWibking BenWibking marked this pull request as draft October 10, 2023 01:33
@BenWibking BenWibking changed the title add macro AMREX_IF_ON_DEVICE [WIP] add macro AMREX_IF_ON_DEVICE Oct 10, 2023
@BenWibking
Copy link
Contributor Author

BenWibking commented Oct 14, 2023

The OpenMP builds fail due to a strange quirk of how #pragma omp atomic is implemented:

#pragma omp atomic update
        *sum += value;

is considered valid by GCC, whereas

#pragma omp atomic update
        { *sum += value; }

is considered invalid.

The above is the preprocessed output for the host compile from this code in AMReX_GpuAtomic.H:

#ifdef AMREX_USE_OMP
#pragma omp atomic update
#endif
        AMREX_IF_ON_HOST((
            *sum += value;
        ))

Strangely, replacing the above with:

#ifdef AMREX_USE_OMP
        AMREX_IF_ON_HOST((
#pragma omp atomic update
            *sum += value;
        ))
#else
        AMREX_IF_ON_HOST((*sum += value;))
#endif

still expands to:

#pragma omp atomic update
        { *sum += value; }

I'm not sure how to work around this.

@BenWibking
Copy link
Contributor Author

This is the only portable workaround I could find:

namespace HostDevice::Atomic {

    template <class T>
    AMREX_FORCE_INLINE
    void Add_Host (T* const sum, T const value) noexcept
    {
#ifdef AMREX_USE_OMP
#pragma omp atomic update
#endif
        *sum += value;
    }

    template <class T>
    AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
    void Add (T* const sum, T const value) noexcept
    {
        AMREX_IF_ON_DEVICE((Gpu::Atomic::AddNoRet(sum,value);))
        AMREX_IF_ON_HOST((Add_Host(sum,value);))
    }

}

@BenWibking
Copy link
Contributor Author

In order to workaround the macro limitations, I had to replace the CLZ functions with, e.g.:

AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
int clz (std::uint64_t x) noexcept
{
#if defined(AMREX_USE_CUDA) // all supported cuda versions have __clz
    AMREX_IF_ON_DEVICE((return detail::clz_wrapper(detail::clz_tag{}, x);))
    AMREX_IF_ON_HOST((return clz_generic(x);))
#elif AMREX_HAS_BUILTIN_CLZ
    AMREX_IF_ON_HOST((return detail::builtin_clz_wrapper(detail::clz_tag{}, x);))
    AMREX_IF_ON_DEVICE((return clz_generic(x);))
#else
    return clz_generic(x);
#endif
}

@BenWibking BenWibking changed the title [WIP] add macro AMREX_IF_ON_DEVICE [WIP] replace AMREX_DEVICE_COMPILE with AMREX_IF_ON_DEVICE and AMREX_IF_ON_HOST Oct 14, 2023
@BenWibking BenWibking changed the title [WIP] replace AMREX_DEVICE_COMPILE with AMREX_IF_ON_DEVICE and AMREX_IF_ON_HOST replace AMREX_DEVICE_COMPILE with AMREX_IF_ON_DEVICE and AMREX_IF_ON_HOST Oct 14, 2023
@BenWibking BenWibking marked this pull request as ready for review October 14, 2023 23:20
@BenWibking BenWibking marked this pull request as draft October 15, 2023 16:20
@BenWibking
Copy link
Contributor Author

BenWibking commented Oct 15, 2023

When building with NVHPC, I get several errors about missing return statements:

"amrex/Src/Base/AMReX_GpuRange.H", line 40: warning: missing return statement at end of non-void function "amrex::Gpu::range_detail::size(const amrex::Box &) noexcept" [implicit_return_from_non_void_function]
  }
  ^

However, these appear to be spurious, due to conditional constructs like this:

AMREX_GPU_HOST_DEVICE
AMREX_FORCE_INLINE Long size (Box const& b) noexcept
{
    AMREX_IF_ON_DEVICE((return b.numPts();))
    AMREX_IF_ON_HOST((
        amrex::ignore_unused(b);
        return 1;
    ))
} // AMReX_GpuRange.H:40

@BenWibking
Copy link
Contributor Author

BenWibking commented Oct 15, 2023

When building the tests with NVHPC, there are multiple definition linker errors for every binary for some cuRand symbols (they are the same for every binary):

[ 66%] Linking CUDA executable 3d/Test_Amr_Advection_AmrCore_3d
nvlink error   : Multiple definition of 'precalc_xorwow_matrix' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of 'precalc_xorwow_offset_matrix' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of 'mrg32k3aM1' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of 'mrg32k3aM2' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of 'mrg32k3aM1SubSeq' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of 'mrg32k3aM2SubSeq' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of 'mrg32k3aM1Seq' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of 'mrg32k3aM2Seq' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink error   : Multiple definition of '__cr_lgamma_table' in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAtLevel.cpp.o', first defined in 'CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/Source/AdvancePhiAllLevels.cpp.o'
nvlink fatal   : merge_elf failed
pgacclnk: child process exit status 2: /mnt/ufs18/home-208/wibkingb/spack/opt/spack/linux-centos7-zen/gcc-6.4.0/nvhpc-23.9-xysfmihvcig2jao55qhodg36ybrsxkde/Linux_x86_64/23.9/compilers/bin/tools/nvdd
make[2]: *** [Tests/Amr/Advection_AmrCore/3d/Test_Amr_Advection_AmrCore_3d] Error 2
make[1]: *** [Tests/Amr/Advection_AmrCore/CMakeFiles/Test_Amr_Advection_AmrCore_3d.dir/all] Error 2

@ax3l do you have any idea how this is happening?

EDIT: For device cuRAND functions, it might be necessary to compile with RDC enabled: https://gitlab.com/nvidia/headers/cuda-individual/curand/-/blob/main/curand_poisson.h?ref_type=heads#L182. Otherwise, we run into this issue: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#device-code-in-libraries

@BenWibking BenWibking marked this pull request as ready for review October 15, 2023 17:33
@WeiqunZhang
Copy link
Member

@BenWibking Thanks for the PR! I have done a first pass. Looks great overall. But I want to read the changes more carefully and do more testing on various architectures. Maybe we can target merging this at the beginning of next month.

@WeiqunZhang
Copy link
Member

The multiple definition issue is a nvhpc compiler issue. We can ignore it for now. I will take one more pass within a couple of days before merging this.

Src/Base/AMReX_Math.H Outdated Show resolved Hide resolved
Src/Base/AMReX_Math.H Outdated Show resolved Hide resolved
Src/Base/AMReX_Math.H Outdated Show resolved Hide resolved
Src/Base/AMReX_Math.H Outdated Show resolved Hide resolved
Src/Base/AMReX_Math.H Outdated Show resolved Hide resolved
Src/Base/AMReX_Math.H Outdated Show resolved Hide resolved
Src/Base/AMReX_Math.H Outdated Show resolved Hide resolved
Src/Base/AMReX_Math.H Outdated Show resolved Hide resolved
Src/Base/AMReX_Math.H Outdated Show resolved Hide resolved
Src/Base/AMReX_Math.H Outdated Show resolved Hide resolved
@@ -252,7 +252,7 @@ int clz (T x) noexcept;
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
int clz_generic (std::uint8_t x) noexcept
{
constexpr int clz_lookup[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };
static constexpr int clz_lookup[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, nvc++ won't compile it with static.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can file a bug report. I suspect it should pretty easy to write a small reproducer.

Since we need to wait for the inline bug to be fixed anyway, we probably don't need to worry about this issue for now. If we need to work around this for nvc++, we could use if !defined(__NVHPC). If we don't use static, the compiler might waste stack space unnecessarily.

@WeiqunZhang WeiqunZhang merged commit d364631 into AMReX-Codes:development Nov 7, 2023
69 checks passed
guj pushed a commit to guj/amrex that referenced this pull request Dec 13, 2023
…HOST (AMReX-Codes#3591)

## Summary
This adds the macros `AMREX_IF_ON_DEVICE((code_for_device))` and
`AMREX_IF_ON_HOST((code_for_host))` that are compatible with single-pass
host/device compilation (as used by `nvc++ -cuda`), as well as backward
compatible with all other compilers.

This also replaces all uses of `AMREX_DEVICE_COMPILE` with these macros.

Fixes AMReX-Codes#3586.

## Additional background
Single-pass compilation evalutes the preprocessor macros once for each
source file. This means that preprocessor conditionals cannot be used to
choose between host and device code. In particular, NVHPC with `-cuda`
does not support `__CUDA_ARCH__`, instead requiring the use of the `if
target` construct. This creates portable macros that work for either
single-pass or two-pass compilation, but requires restructuring of any
code that uses AMREX_DEVICE_COMPILE so that the code appears as a macro
argument.

This PR will allow using NVHPC with `-cuda` as the unified host/device
compiler for AMReX. In the future, single-pass compilers for other
backends may be available, e.g., SYCL
(https://dl.acm.org/doi/abs/10.1145/3585341.3585351).

AMReX can be configured to build with `nvc++ -cuda` using CMake:
```
cmake .. -DAMReX_GPU_BACKEND=CUDA -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER_ID=NVCXX -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CUDA_COMPILER_FORCED=ON -DCMAKE_CUDA_COMPILE_FEATURES=cuda_std_17 -DAMReX_GPU_RDC=OFF -DCMAKE_CXX_FLAGS="-cuda --gcc-toolchain=$(which gcc)" -DCMAKE_CUDA_FLAGS="-cuda --gcc-toolchain=$(which gcc)" -DAMReX_ENABLE_TESTS=ON -DCMAKE_CUDA_HOST_LINK_LAUNCHER=nvc++ -DCMAKE_CUDA_LINK_EXECUTABLE="<CMAKE_CUDA_HOST_LINK_LAUNCHER> <FLAGS> <LINK_FLAGS> <OBJECTS> -o <TARGET> <LINK_LIBRARIES>"
```
CMake hacks
(https://github.com/NVIDIA/cub/blob/0fc3c3701632a4be906765b73be20a9ad0da603d/cmake/CubCompilerHacks.cmake)
are tested with CMake 3.22.1 and NVHPC 23.5, 23.7, and 23.9 (earlier
versions do not work). However, it currently fails to link the
executables for the tests due to a [compiler/linker
bug](https://forums.developer.nvidia.com/t/nvc-cuda-fails-to-link-code-when-using-device-curand-functions/270401/5).

(Note that by default, `nvcc` preserves denormals, whereas `nvc++` does
not. Also, `nvc++` generates relocatable device code by default, whereas
`nvcc` does not.)

## Checklist

The proposed changes:
- [ ] fix a bug or incorrect behavior in AMReX
- [ ] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX
users
- [ ] include documentation in the code and/or rst files, if appropriate

---------

Co-authored-by: Weiqun Zhang <[email protected]>
@BenWibking
Copy link
Contributor Author

FYI- The nvhpc linker bug should be fixed in NVHPC 24.3: https://forums.developer.nvidia.com/t/nvc-cuda-fails-to-link-code-when-using-device-curand-functions/270401/7.

@BenWibking BenWibking deleted the replace-cuda-arch branch February 4, 2024 21:14
@BenWibking
Copy link
Contributor Author

With the following patch:

diff --git a/Src/Base/AMReX_Algorithm.H b/Src/Base/AMReX_Algorithm.H
index 4718606b7..0fc26df05 100644
--- a/Src/Base/AMReX_Algorithm.H
+++ b/Src/Base/AMReX_Algorithm.H
@@ -289,7 +289,11 @@ int clz (T x) noexcept;
 AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
 int clz_generic (std::uint8_t x) noexcept
 {
+#if !defined(__NVCOMPILER)
     static constexpr int clz_lookup[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };
+#else
+    constexpr int clz_lookup[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };
+#endif
     auto upper = x >> 4;
     auto lower = x & 0xF;
     return upper ? clz_lookup[upper] : 4 + clz_lookup[lower];
diff --git a/Src/Base/AMReX_INT.H b/Src/Base/AMReX_INT.H
index f54852cb6..4356c70f1 100644
--- a/Src/Base/AMReX_INT.H
+++ b/Src/Base/AMReX_INT.H
@@ -31,7 +31,7 @@ namespace amrex {
 }
 #endif

-#if (defined(__x86_64) || defined (__aarch64__)) && !defined(_WIN32) && (defined(__GNUC__) || defined(__clang__))
+#if (defined(__x86_64) || defined (__aarch64__)) && !defined(_WIN32) && (defined(__GNUC__) || defined(__clang__)) && !defined(__NVCOMPILER)

 #define AMREX_INT128_SUPPORTED 1

CMake configures succesfully:

cmake .. -DAMReX_GPU_BACKEND=CUDA -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER_ID=NVCXX -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CUDA_COMPILER_FORCED=ON -DCMAKE_CUDA_COMPILE_FEATURES=cuda_std_17 -DAMReX_GPU_RDC=OFF -DCMAKE_CXX_FLAGS="-cuda --gcc-toolchain=$(which gcc)" -DCMAKE_CUDA_FLAGS="-cuda --gcc-toolchain=$(which gcc)" -DAMReX_ENABLE_TESTS=ON -DCMAKE_CUDA_HOST_LINK_LAUNCHER=nvc++ -DCMAKE_CUDA_LINK_EXECUTABLE="<CMAKE_CUDA_HOST_LINK_LAUNCHER> <FLAGS> <LINK_FLAGS> <OBJECTS> -o <TARGET> <LINK_LIBRARIES>"

Then NVHPC 24.3 crashes with the error:

[bwibking@dt-login02 Src]$ /u/bwibking/hpc_sdk/Linux_x86_64/24.3/compilers/bin/nvc++  -DAMREX_SPACEDIM=3 -I/u/bwibking/amrex/Src/Base -I/u/bwibking/amrex/Src/Base/Parser -I/u/bwibking/amrex/Src/Boundary -I/u/bwibking/amrex/Src/AmrCore -I/u/bwibking/amrex/Src/Amr -I/u/bwibking/amrex/Src/LinearSolvers -I/u/bwibking/amrex/Src/LinearSolvers/MLMG -I/u/bwibking/amrex/Src/LinearSolvers/OpenBC -I/u/bwibking/amrex/Src/Particle -I/u/bwibking/amrex/build -I/sw/spack/deltas11-2023-03/apps/linux-rhel8-zen3/gcc-11.4.0/openmpi-4.1.5-ejhvyny/include -I/sw/spack/deltas11-2023-03/apps/linux-rhel8-zen3/gcc-11.4.0/cuda-11.8.0-vfixfmc/include -cuda --gcc-toolchain=/sw/spack/deltas11-2023-03/apps/linux-rhel8-x86_64/gcc-8.5.0/gcc-11.4.0-yycklku/bin/gcc -pthread  -c /u/bwibking/amrex/Src/Base/AMReX_BlockMutex.cpp -o CMakeFiles/amrex_3d.dir/Base/AMReX_BlockMutex.cpp.o
nvdd-Fatal-/sw/spack/deltas11-2023-03/apps/linux-rhel8-zen3/gcc-11.4.0/cuda-11.8.0-vfixfmc/bin/ptxas TERMINATED by signal 11
NVC++-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (/u/bwibking/amrex/Src/Base/AMReX_BlockMutex.cpp: 14)
NVC++/x86-64 Linux 24.3-0: compilation aborted

WeiqunZhang pushed a commit that referenced this pull request Mar 14, 2024
## Summary
This works around limitations in the NVHPC device compiler by disabling
`int128` support and avoiding a static local device variable. This
enables experimental use of NVHPC 24.3+ as the unified host and device
compiler for CUDA.

## Additional background
#3591
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CUDA] replace __CUDA_ARCH__ with portable NV_IF_TARGET macros
2 participants