-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace AMREX_DEVICE_COMPILE with AMREX_IF_ON_DEVICE and AMREX_IF_ON_HOST #3591
replace AMREX_DEVICE_COMPILE with AMREX_IF_ON_DEVICE and AMREX_IF_ON_HOST #3591
Conversation
The OpenMP builds fail due to a strange quirk of how
is considered valid by GCC, whereas
is considered invalid. The above is the preprocessed output for the host compile from this code in AMReX_GpuAtomic.H:
Strangely, replacing the above with:
still expands to:
I'm not sure how to work around this. |
This is the only portable workaround I could find:
|
In order to workaround the macro limitations, I had to replace the CLZ functions with, e.g.:
|
When building with NVHPC, I get several errors about missing return statements:
However, these appear to be spurious, due to conditional constructs like this:
|
When building the tests with NVHPC, there are multiple definition linker errors for every binary for some cuRand symbols (they are the same for every binary):
@ax3l do you have any idea how this is happening? EDIT: For device cuRAND functions, it might be necessary to compile with RDC enabled: https://gitlab.com/nvidia/headers/cuda-individual/curand/-/blob/main/curand_poisson.h?ref_type=heads#L182. Otherwise, we run into this issue: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#device-code-in-libraries |
@BenWibking Thanks for the PR! I have done a first pass. Looks great overall. But I want to read the changes more carefully and do more testing on various architectures. Maybe we can target merging this at the beginning of next month. |
The multiple definition issue is a nvhpc compiler issue. We can ignore it for now. I will take one more pass within a couple of days before merging this. |
@@ -252,7 +252,7 @@ int clz (T x) noexcept; | |||
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE | |||
int clz_generic (std::uint8_t x) noexcept | |||
{ | |||
constexpr int clz_lookup[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 }; | |||
static constexpr int clz_lookup[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, nvc++ won't compile it with static
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can file a bug report. I suspect it should pretty easy to write a small reproducer.
Since we need to wait for the inline bug to be fixed anyway, we probably don't need to worry about this issue for now. If we need to work around this for nvc++, we could use if !defined(__NVHPC)
. If we don't use static
, the compiler might waste stack space unnecessarily.
…HOST (AMReX-Codes#3591) ## Summary This adds the macros `AMREX_IF_ON_DEVICE((code_for_device))` and `AMREX_IF_ON_HOST((code_for_host))` that are compatible with single-pass host/device compilation (as used by `nvc++ -cuda`), as well as backward compatible with all other compilers. This also replaces all uses of `AMREX_DEVICE_COMPILE` with these macros. Fixes AMReX-Codes#3586. ## Additional background Single-pass compilation evalutes the preprocessor macros once for each source file. This means that preprocessor conditionals cannot be used to choose between host and device code. In particular, NVHPC with `-cuda` does not support `__CUDA_ARCH__`, instead requiring the use of the `if target` construct. This creates portable macros that work for either single-pass or two-pass compilation, but requires restructuring of any code that uses AMREX_DEVICE_COMPILE so that the code appears as a macro argument. This PR will allow using NVHPC with `-cuda` as the unified host/device compiler for AMReX. In the future, single-pass compilers for other backends may be available, e.g., SYCL (https://dl.acm.org/doi/abs/10.1145/3585341.3585351). AMReX can be configured to build with `nvc++ -cuda` using CMake: ``` cmake .. -DAMReX_GPU_BACKEND=CUDA -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER_ID=NVCXX -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CUDA_COMPILER_FORCED=ON -DCMAKE_CUDA_COMPILE_FEATURES=cuda_std_17 -DAMReX_GPU_RDC=OFF -DCMAKE_CXX_FLAGS="-cuda --gcc-toolchain=$(which gcc)" -DCMAKE_CUDA_FLAGS="-cuda --gcc-toolchain=$(which gcc)" -DAMReX_ENABLE_TESTS=ON -DCMAKE_CUDA_HOST_LINK_LAUNCHER=nvc++ -DCMAKE_CUDA_LINK_EXECUTABLE="<CMAKE_CUDA_HOST_LINK_LAUNCHER> <FLAGS> <LINK_FLAGS> <OBJECTS> -o <TARGET> <LINK_LIBRARIES>" ``` CMake hacks (https://github.com/NVIDIA/cub/blob/0fc3c3701632a4be906765b73be20a9ad0da603d/cmake/CubCompilerHacks.cmake) are tested with CMake 3.22.1 and NVHPC 23.5, 23.7, and 23.9 (earlier versions do not work). However, it currently fails to link the executables for the tests due to a [compiler/linker bug](https://forums.developer.nvidia.com/t/nvc-cuda-fails-to-link-code-when-using-device-curand-functions/270401/5). (Note that by default, `nvcc` preserves denormals, whereas `nvc++` does not. Also, `nvc++` generates relocatable device code by default, whereas `nvcc` does not.) ## Checklist The proposed changes: - [ ] fix a bug or incorrect behavior in AMReX - [ ] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] include documentation in the code and/or rst files, if appropriate --------- Co-authored-by: Weiqun Zhang <[email protected]>
FYI- The nvhpc linker bug should be fixed in NVHPC 24.3: https://forums.developer.nvidia.com/t/nvc-cuda-fails-to-link-code-when-using-device-curand-functions/270401/7. |
With the following patch:
CMake configures succesfully:
Then NVHPC 24.3 crashes with the error:
|
## Summary This works around limitations in the NVHPC device compiler by disabling `int128` support and avoiding a static local device variable. This enables experimental use of NVHPC 24.3+ as the unified host and device compiler for CUDA. ## Additional background #3591
Summary
This adds the macros
AMREX_IF_ON_DEVICE((code_for_device))
andAMREX_IF_ON_HOST((code_for_host))
that are compatible with single-pass host/device compilation (as used bynvc++ -cuda
), as well as backward compatible with all other compilers.This also replaces all uses of
AMREX_DEVICE_COMPILE
with these macros.Fixes #3586.
Additional background
Single-pass compilation evalutes the preprocessor macros once for each source file. This means that preprocessor conditionals cannot be used to choose between host and device code. In particular, NVHPC with
-cuda
does not support__CUDA_ARCH__
, instead requiring the use of theif target
construct. This creates portable macros that work for either single-pass or two-pass compilation, but requires restructuring of any code that uses AMREX_DEVICE_COMPILE so that the code appears as a macro argument.This PR will allow using NVHPC with
-cuda
as the unified host/device compiler for AMReX. In the future, single-pass compilers for other backends may be available, e.g., SYCL (https://dl.acm.org/doi/abs/10.1145/3585341.3585351).AMReX can be configured to build with
nvc++ -cuda
using CMake:CMake hacks (https://github.com/NVIDIA/cub/blob/0fc3c3701632a4be906765b73be20a9ad0da603d/cmake/CubCompilerHacks.cmake) are tested with CMake 3.22.1 and NVHPC 23.5, 23.7, and 23.9 (earlier versions do not work). However, it currently fails to link the executables for the tests due to a compiler/linker bug.
(Note that by default,
nvcc
preserves denormals, whereasnvc++
does not. Also,nvc++
generates relocatable device code by default, whereasnvcc
does not.)Checklist
The proposed changes: