Use long integer in GPU kernels #3742

WeiqunZhang · 2024-02-01T18:48:21Z

In the current implementation of ParallelFor, we use int for the linear cell index in the flatten one-dimension view. This limits the size of the box to less than 2^30 cells (half of INT_MAX). The factor of half comes from the grid-stride loop. This limitation has not been a serious issue until recently because GPUs did not have that much memory. However, the total memory on the latest GPUs has increased quite a lot. Issues have been reported by users.

In this PR, we have started using std::uint64_t as the linear cell index. An issue of using std::uint64_t is that 64-bit integer division is very expensive. Fortunately, we are able to "steal" the fast division code from https://github.com/NVIDIA/cutlass. Streaming tests have shown very good performance. On A100, the FArrayBox version of the triad test achieves 1.76 TB/s, which is the same as the rate of the much simpler 1D vector version. In fact, it's slightly faster than the rate of 1.72 TB/s from the current version in the development branch.

We have not made all kernel launches in AMReX safe for large sizes. Nevertheless, this PR is the first step, and it will be followed up by more PRs.

This can avoid integer overflow when box size is very big (e.g., more than 2^30 cells).

This is now necessary because the returned long int is converted to std::uint64_t in ParallelFor. This is a follow-up on #3742.

ax3l · 2024-02-08T03:41:11Z

Src/Base/AMReX_INT.H

+typedef unsigned __int128 amrex_uint128_t; // NOLINT(modernize-use-using)
+typedef          __int128 amrex_int128_t;  // NOLINT(modernize-use-using)


NVHPC seems unhappy about long ints:
ECP-WarpX/WarpX#4678

-- The C compiler identification is NVHPC 21.11.0 -- The CXX compiler identification is NVHPC 21.11.0 -- Check for working CUDA compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/21.11/compilers/bin/nvcc - skipped

"/home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_INT.H", line 43: error: expected a ";" typedef unsigned __int128 amrex_uint128_t; ^ "/home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_INT.H", line 48: error: identifier "amrex_uint128_t" is undefined using UInt128_t = amrex_uint128_t; ^ ...

It is reported in AMReX-Codes#3742 that ROCm 5.3.0-5.7.1 fail at the link stage since `-gline-tables-only -fdebug-info-for-profiling` solves the issue. Note that for Intel SYCL compilers, we use these two arguments too. No changes are made to CMake because in Realease build type, no debug info is added.

This is continuation of the changes in AMReX-Codes#3742 making AMReX ready for big kernels. We also store the number of points in BoxIndexer now because we always need that number in GPU kernels.

This is continuation of the changes in #3742 making AMReX ready for big kernels. We also store the number of points in BoxIndexer now because we always need that number in GPU kernels.

It is reported in #3759 that ROCm 5.3.0-5.7.1 fail at the link stage since #3742. Replacing `-g` with `-gline-tables-only -fdebug-info-for-profiling` solves the issue. Note that for Intel SYCL compilers, we use these two arguments too. No changes are made to CMake because in Realease build type, no debug info is added.

WeiqunZhang force-pushed the gpu_kernel_longint branch 4 times, most recently from 147c745 to d4bb76a Compare February 3, 2024 05:55

Use long integer in GPU kernels

d87bfbb

This can avoid integer overflow when box size is very big (e.g., more than 2^30 cells).

WeiqunZhang force-pushed the gpu_kernel_longint branch from d4bb76a to d87bfbb Compare February 3, 2024 17:51

WeiqunZhang marked this pull request as ready for review February 3, 2024 19:49

WeiqunZhang requested a review from atmyers February 3, 2024 19:50

atmyers approved these changes Feb 5, 2024

View reviewed changes

WeiqunZhang merged commit e46bd91 into AMReX-Codes:development Feb 5, 2024
69 checks passed

WeiqunZhang deleted the gpu_kernel_longint branch February 5, 2024 22:57

WeiqunZhang mentioned this pull request Feb 6, 2024

Box::numPts() returns 0 for empty boxes #3747

Merged

WeiqunZhang added a commit that referenced this pull request Feb 7, 2024

Box::numPts() returns 0 for empty boxes (#3747)

928a485

This is now necessary because the returned long int is converted to std::uint64_t in ParallelFor. This is a follow-up on #3742.

ax3l reviewed Feb 8, 2024

View reviewed changes

ax3l mentioned this pull request Feb 8, 2024

CI: NVHPC 24.1 ECP-WarpX/WarpX#4679

Merged

baperry2 mentioned this pull request Feb 14, 2024

#3742 leads to HIP compile issues #3759

Closed

WeiqunZhang mentioned this pull request Feb 14, 2024

Adjust debug info argument for HIP compiler #3761

Merged

WeiqunZhang mentioned this pull request Feb 15, 2024

Make MFParallelFor safer from int overflow #3768

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use long integer in GPU kernels #3742

Use long integer in GPU kernels #3742

WeiqunZhang commented Feb 1, 2024 •

edited

Loading

ax3l Feb 8, 2024

		typedef unsigned __int128 amrex_uint128_t; // NOLINT(modernize-use-using)
		typedef __int128 amrex_int128_t; // NOLINT(modernize-use-using)

Use long integer in GPU kernels #3742

Use long integer in GPU kernels #3742

Conversation

WeiqunZhang commented Feb 1, 2024 • edited Loading

ax3l Feb 8, 2024

Choose a reason for hiding this comment

WeiqunZhang commented Feb 1, 2024 •

edited

Loading