Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable GPU-aware MPI by default #4318

Open
wants to merge 5 commits into
base: development
Choose a base branch
from

Conversation

BenWibking
Copy link
Contributor

@BenWibking BenWibking commented Jan 29, 2025

Summary

This turns on GPU-aware MPI by default.

Additional background

On all current machines, simulations run faster with GPU-aware MPI enabled. Two technical issues that prevented this are now resolved: AMReX now has the communication arena, which does not use managed memory, and SLURM no longer uses cgroup isolation for GPU bindings by default.

Closes #2967.

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • include documentation in the code and/or rst files, if appropriate

This turns on GPU-aware MPI by default.

On all current machines, simulations run faster with GPU-aware MPI enabled. Two technical issues that prevented this are now resolved: AMReX now has the communication arena, which does not use managed memory, and SLURM no longer uses cgroup isolation for GPU bindings by default.

Closes AMReX-Codes#2967.
@BenWibking
Copy link
Contributor Author

It looks like the HIP EB failure is due to flaky CI.

@WeiqunZhang
Copy link
Member

I think maybe we want to do something like

diff --git a/Src/Base/AMReX_ParallelDescriptor.cpp b/Src/Base/AMReX_ParallelDescriptor.cpp
index 67848505f2..83d55bb273 100644
--- a/Src/Base/AMReX_ParallelDescriptor.cpp
+++ b/Src/Base/AMReX_ParallelDescriptor.cpp
@@ -10,6 +10,9 @@
 
 #ifdef BL_USE_MPI
 #include <AMReX_ccse-mpi.H>
+#if __has_include(<mpi-ext.h>) && defined(OPEN_MPI)
+#         include <mpi-ext.h>
+#endif
 #endif
 
 #ifdef AMREX_PMI
@@ -57,7 +60,7 @@ namespace amrex::ParallelDescriptor {
 #endif
 
 #ifdef AMREX_USE_GPU
-    bool use_gpu_aware_mpi = true;
+    bool use_gpu_aware_mpi = false;
 #else
     bool use_gpu_aware_mpi = false;
 #endif
@@ -1510,6 +1513,28 @@ ReadAndBcastFile (const std::string& filename, Vector<char>& charBuf,
 void
 Initialize ()
 {
+#if defined(AMREX_USE_CUDA)
+
+#if (defined(OMPI_HAVE_MPI_EXT_CUDA) && OMPI_HAVE_MPI_EXT_CUDA) || (defined(MPICH) && defined(MPIX_GPU_SUPPORT_CUDA))
+    use_gpu_aware_mpi = (bool) MPIX_Query_cuda_support();
+#endif
+
+#elif defined(AMREX_USE_HIP)
+
+#if defined(OMPI_HAVE_MPI_EXT_ROCM) && OMPI_HAVE_MPI_EXT_ROCM
+    use_gpu_aware_mpi = (bool) MPIX_Query_rocm_support();
+#elif defined(MPICH) && defined(MPIX_GPU_SUPPORT_HIP)
+    use_gpu_aware_mpi = (bool) MPIX_Query_hip_support();
+#endif
+
+#elif defined(AMREX_USE_SYCL)
+
+#if defined(MPICH) && defined(MPIX_GPU_SUPPORT_ZE)
+    use_gpu_aware_mpi = (bool) MPIX_Query_ze_support();
+#endif
+
+#endif
+
 #ifndef BL_AMRPROF
     ParmParse pp("amrex");
     pp.queryAdd("use_gpu_aware_mpi", use_gpu_aware_mpi);

Src/Base/AMReX_ParallelDescriptor.cpp Outdated Show resolved Hide resolved
Src/Base/AMReX_ParallelDescriptor.cpp Outdated Show resolved Hide resolved
@WeiqunZhang
Copy link
Member

WeiqunZhang commented Feb 2, 2025

I tested this on Frontier. With export MPICH_GPU_SUPPORT_ENABLED=1, MPIX_GPU_query_support(MPIX_GPU_SUPPORT_HIP, &is_supported) sets is_supported to 1. With unset MPICH_GPU_SUPPORT_ENABLED, it becomes 0.

@WeiqunZhang
Copy link
Member

Tests on Perlmutter using both cray-mpich and openmpi also passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

enable GPU-aware MPI when performance conditions are met
2 participants