-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use long integer in GPU kernels #3742
Merged
WeiqunZhang
merged 1 commit into
AMReX-Codes:development
from
WeiqunZhang:gpu_kernel_longint
Feb 5, 2024
Merged
Use long integer in GPU kernels #3742
WeiqunZhang
merged 1 commit into
AMReX-Codes:development
from
WeiqunZhang:gpu_kernel_longint
Feb 5, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
WeiqunZhang
force-pushed
the
gpu_kernel_longint
branch
4 times, most recently
from
February 3, 2024 05:55
147c745
to
d4bb76a
Compare
This can avoid integer overflow when box size is very big (e.g., more than 2^30 cells).
WeiqunZhang
force-pushed
the
gpu_kernel_longint
branch
from
February 3, 2024 17:51
d4bb76a
to
d87bfbb
Compare
atmyers
approved these changes
Feb 5, 2024
WeiqunZhang
added a commit
that referenced
this pull request
Feb 7, 2024
This is now necessary because the returned long int is converted to std::uint64_t in ParallelFor. This is a follow-up on #3742.
ax3l
reviewed
Feb 8, 2024
Comment on lines
+43
to
+44
typedef unsigned __int128 amrex_uint128_t; // NOLINT(modernize-use-using) | ||
typedef __int128 amrex_int128_t; // NOLINT(modernize-use-using) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVHPC seems unhappy about long ints:
ECP-WarpX/WarpX#4678
-- The C compiler identification is NVHPC 21.11.0
-- The CXX compiler identification is NVHPC 21.11.0
-- Check for working CUDA compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/21.11/compilers/bin/nvcc - skipped
"/home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_INT.H", line 43: error: expected a ";"
typedef unsigned __int128 amrex_uint128_t;
^
"/home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_INT.H", line 48: error: identifier "amrex_uint128_t" is undefined
using UInt128_t = amrex_uint128_t;
^
...
WeiqunZhang
added a commit
to WeiqunZhang/amrex
that referenced
this pull request
Feb 14, 2024
It is reported in AMReX-Codes#3742 that ROCm 5.3.0-5.7.1 fail at the link stage since `-gline-tables-only -fdebug-info-for-profiling` solves the issue. Note that for Intel SYCL compilers, we use these two arguments too. No changes are made to CMake because in Realease build type, no debug info is added.
WeiqunZhang
added a commit
to WeiqunZhang/amrex
that referenced
this pull request
Feb 14, 2024
It is reported in AMReX-Codes#3742 that ROCm 5.3.0-5.7.1 fail at the link stage since `-gline-tables-only -fdebug-info-for-profiling` solves the issue. Note that for Intel SYCL compilers, we use these two arguments too. No changes are made to CMake because in Realease build type, no debug info is added.
WeiqunZhang
added a commit
to WeiqunZhang/amrex
that referenced
this pull request
Feb 15, 2024
This is continuation of the changes in AMReX-Codes#3742 making AMReX ready for big kernels. We also store the number of points in BoxIndexer now because we always need that number in GPU kernels.
WeiqunZhang
added a commit
to WeiqunZhang/amrex
that referenced
this pull request
Feb 15, 2024
This is continuation of the changes in AMReX-Codes#3742 making AMReX ready for big kernels. We also store the number of points in BoxIndexer now because we always need that number in GPU kernels.
WeiqunZhang
added a commit
to WeiqunZhang/amrex
that referenced
this pull request
Feb 15, 2024
This is continuation of the changes in AMReX-Codes#3742 making AMReX ready for big kernels. We also store the number of points in BoxIndexer now because we always need that number in GPU kernels.
WeiqunZhang
added a commit
to WeiqunZhang/amrex
that referenced
this pull request
Feb 15, 2024
This is continuation of the changes in AMReX-Codes#3742 making AMReX ready for big kernels. We also store the number of points in BoxIndexer now because we always need that number in GPU kernels.
WeiqunZhang
added a commit
to WeiqunZhang/amrex
that referenced
this pull request
Feb 15, 2024
This is continuation of the changes in AMReX-Codes#3742 making AMReX ready for big kernels. We also store the number of points in BoxIndexer now because we always need that number in GPU kernels.
WeiqunZhang
added a commit
to WeiqunZhang/amrex
that referenced
this pull request
Feb 17, 2024
This is continuation of the changes in AMReX-Codes#3742 making AMReX ready for big kernels. We also store the number of points in BoxIndexer now because we always need that number in GPU kernels.
WeiqunZhang
added a commit
that referenced
this pull request
Feb 20, 2024
This is continuation of the changes in #3742 making AMReX ready for big kernels. We also store the number of points in BoxIndexer now because we always need that number in GPU kernels.
WeiqunZhang
added a commit
that referenced
this pull request
Feb 23, 2024
It is reported in #3759 that ROCm 5.3.0-5.7.1 fail at the link stage since #3742. Replacing `-g` with `-gline-tables-only -fdebug-info-for-profiling` solves the issue. Note that for Intel SYCL compilers, we use these two arguments too. No changes are made to CMake because in Realease build type, no debug info is added.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the current implementation of ParallelFor, we use
int
for the linear cell index in the flatten one-dimension view. This limits the size of the box to less than 2^30 cells (half of INT_MAX). The factor of half comes from the grid-stride loop. This limitation has not been a serious issue until recently because GPUs did not have that much memory. However, the total memory on the latest GPUs has increased quite a lot. Issues have been reported by users.In this PR, we have started using
std::uint64_t
as the linear cell index. An issue of usingstd::uint64_t
is that 64-bit integer division is very expensive. Fortunately, we are able to "steal" the fast division code from https://github.com/NVIDIA/cutlass. Streaming tests have shown very good performance. On A100, the FArrayBox version of the triad test achieves 1.76 TB/s, which is the same as the rate of the much simpler 1D vector version. In fact, it's slightly faster than the rate of 1.72 TB/s from the current version in the development branch.We have not made all kernel launches in AMReX safe for large sizes. Nevertheless, this PR is the first step, and it will be followed up by more PRs.