Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use long integer in GPU kernels #3742

Merged

Conversation

WeiqunZhang
Copy link
Member

@WeiqunZhang WeiqunZhang commented Feb 1, 2024

In the current implementation of ParallelFor, we use int for the linear cell index in the flatten one-dimension view. This limits the size of the box to less than 2^30 cells (half of INT_MAX). The factor of half comes from the grid-stride loop. This limitation has not been a serious issue until recently because GPUs did not have that much memory. However, the total memory on the latest GPUs has increased quite a lot. Issues have been reported by users.

In this PR, we have started using std::uint64_t as the linear cell index. An issue of using std::uint64_t is that 64-bit integer division is very expensive. Fortunately, we are able to "steal" the fast division code from https://github.com/NVIDIA/cutlass. Streaming tests have shown very good performance. On A100, the FArrayBox version of the triad test achieves 1.76 TB/s, which is the same as the rate of the much simpler 1D vector version. In fact, it's slightly faster than the rate of 1.72 TB/s from the current version in the development branch.

We have not made all kernel launches in AMReX safe for large sizes. Nevertheless, this PR is the first step, and it will be followed up by more PRs.

@WeiqunZhang WeiqunZhang force-pushed the gpu_kernel_longint branch 4 times, most recently from 147c745 to d4bb76a Compare February 3, 2024 05:55
This can avoid integer overflow when box size is very big (e.g., more than
2^30 cells).
@WeiqunZhang WeiqunZhang marked this pull request as ready for review February 3, 2024 19:49
@WeiqunZhang WeiqunZhang requested a review from atmyers February 3, 2024 19:50
@WeiqunZhang WeiqunZhang merged commit e46bd91 into AMReX-Codes:development Feb 5, 2024
69 checks passed
@WeiqunZhang WeiqunZhang deleted the gpu_kernel_longint branch February 5, 2024 22:57
WeiqunZhang added a commit that referenced this pull request Feb 7, 2024
This is now necessary because the returned long int is converted to
std::uint64_t in ParallelFor.

This is a follow-up on #3742.
Comment on lines +43 to +44
typedef unsigned __int128 amrex_uint128_t; // NOLINT(modernize-use-using)
typedef __int128 amrex_int128_t; // NOLINT(modernize-use-using)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVHPC seems unhappy about long ints:
ECP-WarpX/WarpX#4678

-- The C compiler identification is NVHPC 21.11.0
-- The CXX compiler identification is NVHPC 21.11.0
-- Check for working CUDA compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/21.11/compilers/bin/nvcc - skipped
"/home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_INT.H", line 43: error: expected a ";"
  typedef unsigned __int128 amrex_uint128_t; 
                            ^

"/home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_INT.H", line 48: error: identifier "amrex_uint128_t" is undefined
  using UInt128_t = amrex_uint128_t; 
                    ^
...

WeiqunZhang added a commit to WeiqunZhang/amrex that referenced this pull request Feb 14, 2024
It is reported in AMReX-Codes#3742 that ROCm 5.3.0-5.7.1 fail at the link stage since
`-gline-tables-only -fdebug-info-for-profiling` solves the issue. Note that
for Intel SYCL compilers, we use these two arguments too.

No changes are made to CMake because in Realease build type, no debug info
is added.
WeiqunZhang added a commit to WeiqunZhang/amrex that referenced this pull request Feb 14, 2024
It is reported in AMReX-Codes#3742 that ROCm 5.3.0-5.7.1 fail at the link stage since
`-gline-tables-only -fdebug-info-for-profiling` solves the issue. Note that
for Intel SYCL compilers, we use these two arguments too.

No changes are made to CMake because in Realease build type, no debug info
is added.
WeiqunZhang added a commit to WeiqunZhang/amrex that referenced this pull request Feb 15, 2024
This is continuation of the changes in AMReX-Codes#3742 making AMReX ready for big
kernels.

We also store the number of points in BoxIndexer now because we always need
that number in GPU kernels.
WeiqunZhang added a commit to WeiqunZhang/amrex that referenced this pull request Feb 15, 2024
This is continuation of the changes in AMReX-Codes#3742 making AMReX ready for big
kernels.

We also store the number of points in BoxIndexer now because we always need
that number in GPU kernels.
WeiqunZhang added a commit to WeiqunZhang/amrex that referenced this pull request Feb 15, 2024
This is continuation of the changes in AMReX-Codes#3742 making AMReX ready for big
kernels.

We also store the number of points in BoxIndexer now because we always need
that number in GPU kernels.
WeiqunZhang added a commit to WeiqunZhang/amrex that referenced this pull request Feb 15, 2024
This is continuation of the changes in AMReX-Codes#3742 making AMReX ready for big
kernels.

We also store the number of points in BoxIndexer now because we always need
that number in GPU kernels.
WeiqunZhang added a commit to WeiqunZhang/amrex that referenced this pull request Feb 15, 2024
This is continuation of the changes in AMReX-Codes#3742 making AMReX ready for big
kernels.

We also store the number of points in BoxIndexer now because we always need
that number in GPU kernels.
WeiqunZhang added a commit to WeiqunZhang/amrex that referenced this pull request Feb 17, 2024
This is continuation of the changes in AMReX-Codes#3742 making AMReX ready for big
kernels.

We also store the number of points in BoxIndexer now because we always need
that number in GPU kernels.
WeiqunZhang added a commit that referenced this pull request Feb 20, 2024
This is continuation of the changes in #3742 making AMReX ready for big
kernels.

We also store the number of points in BoxIndexer now because we always
need that number in GPU kernels.
WeiqunZhang added a commit that referenced this pull request Feb 23, 2024
It is reported in #3759 that ROCm 5.3.0-5.7.1 fail at the link stage
since #3742. Replacing `-g` with `-gline-tables-only
-fdebug-info-for-profiling` solves the issue. Note that for Intel SYCL
compilers, we use these two arguments too.

No changes are made to CMake because in Realease build type, no debug
info is added.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants