Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump Parthenon and Kokkos #114

Merged
merged 19 commits into from
Sep 9, 2024
Merged

Bump Parthenon and Kokkos #114

merged 19 commits into from
Sep 9, 2024

Conversation

pgrete
Copy link
Contributor

@pgrete pgrete commented Sep 4, 2024

Updates Parthenon to 24.08 and Kokkos to 4.4.0 (both released last month).
Changes to the interface are described in the Changelog.

@pgrete pgrete requested a review from BenWibking September 4, 2024 09:45
@pgrete
Copy link
Contributor Author

pgrete commented Sep 4, 2024

Looks like I still need to update the new hst file name in the tests.

BenWibking
BenWibking previously approved these changes Sep 4, 2024
Copy link
Contributor

@BenWibking BenWibking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BenWibking BenWibking dismissed their stale review September 4, 2024 16:08

oops, missed test failure

@pgrete
Copy link
Contributor Author

pgrete commented Sep 4, 2024

@par-hermes format

@pgrete
Copy link
Contributor Author

pgrete commented Sep 6, 2024

I think I caught everything now and tests pass again.
Would you mind reviewing the changes again @BenWibking ?

Copy link
Contributor

@BenWibking BenWibking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine, but the MPI regression test looks like it's still failing.

CHANGELOG.md Outdated Show resolved Hide resolved
@BenWibking
Copy link
Contributor

The MPI regression error messages are very odd, and they all happen for the cluster_magetic_tower test:

10/12 Test #22: regression_mpi_test:cluster_magnetic_tower .........***Failed  260.63 sec

OpenMPI errors:

--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x34dc78788
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------
[d4ca0f87c519:10106] [[41645,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 501
[d4ca0f87c519:10106] [[41645,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 501
[d4ca0f87c519:10106] [[41645,0],0] ORTE_ERROR_LOG: Out of resource in file util/show_help.c at line 507

Actual error that causes the regression to fail:

Traceback (most recent call last):
  File "/__w/athenapk/athenapk/external/parthenon/scripts/python/packages/parthenon_tools/parthenon_tools/phdf.py", line 147, in __init__
    f = h.File(filename, "r")
  File "/usr/lib/python3/dist-packages/h5py/_debian_h5py_serial/_hl/files.py", line 507, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "/usr/lib/python3/dist-packages/h5py/_debian_h5py_serial/_hl/files.py", line 220, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_debian_h5py_serial/_objects.pyx", line 54, in h5py._debian_h5py_serial._objects.with_phil.wrapper
  File "h5py/_debian_h5py_serial/_objects.pyx", line 55, in h5py._debian_h5py_serial._objects.with_phil.wrapper
  File "h5py/_debian_h5py_serial/h5f.pyx", line 106, in h5py._debian_h5py_serial.h5f.open
OSError: Unable to open file (file signature not found)

@pgrete
Copy link
Contributor Author

pgrete commented Sep 7, 2024

This is all so annoying...
I tried varying things (again "investing" hours) without success.
I was not able to get a cuda-aware MPI version cleanly working with the updated (ubuntu 22.04 cuda 12.1) container.
I tried (with a small MPI ping pong test)

  • OpenMPI 5.0.3 (couldn't convince it to see that cuda-aware mpi is included though the cuda extension is being build)
  • OpenMPI 5.0.3 with the cuda-aware UCX transport layer -- again didn't work (segfail on sends from GPU buffers)
  • OpenMPI 4.0.4 (the version in the old container) - works for some cases but then fails for other with the cuIpcGetMemHandle error above (not sure where that's coming from but I assume it's related to being executed in a docker container)

So I went back to the cuda 11.6 and ubuntu 20.04 container and am now testing various combinations of scipy, h5py, and numpy with by default have some incompatibilities due to use of deprecated interfaces...

Such a mess...

@BenWibking
Copy link
Contributor

BenWibking commented Sep 7, 2024

I've gotten OpenMPI 5 + UCX to work outside of a container with CUDA-awareness, so I'm a bit surprised by that combination. Does it work outside the container? Does ompi_info and ucx_info show a CUDA entry?

@pgrete
Copy link
Contributor Author

pgrete commented Sep 9, 2024

Leaving this for posterity -- sth about the IPC is odd (note the node=787462364b88

root@787462364b88:/athenapk/build# /opt/openmpi/bin/mpirun -np 2 --mca opal_cuda_verbose 10 --mca btl_smcuda_cuda_ipc_verbose 100  /athenapk/build/bin/athenaPK -i /athenapk/inputs/cluster/hydro_agn_feedback.in parthenon/output2/id=kinetic_only_precessed_True parthenon/output2/dt=0.005 parthenon/time/tlim=0.005 hydro/gamma=1.6666666666666667 hydro/He_mass_fraction=0.25 units/code_length_cgs=3.085677580962325e+24 units/code_mass_cgs=1.98841586e+47 units/code_time_cgs=3.15576e+16 problem/cluster/uniform_gas/init_uniform_gas=true problem/cluster/uniform_gas/rho=147.7557589278723 problem/cluster/uniform_gas/ux=0.0 problem/cluster/uniform_gas/uy=0.0 problem/cluster/uniform_gas/uz=0.0 problem/cluster/uniform_gas/pres=1.5454368403867562 problem/cluster/precessing_jet/jet_phi0=1.2 problem/cluster/precessing_jet/jet_phi_dot=0 problem/cluster/precessing_jet/jet_theta=0.4 problem/cluster/agn_feedback/fixed_power=0.3319965633348792 problem/cluster/agn_feedback/efficiency=0.001 problem/cluster/agn_feedback/thermal_fraction=0.0 problem/cluster/agn_feedback/kinetic_fraction=1.0 problem/cluster/agn_feedback/magnetic_fraction=0 problem/cluster/agn_feedback/thermal_radius=0.1 problem/cluster/agn_feedback/kinetic_jet_temperature=10000000.0 problem/cluster/agn_feedback/kinetic_jet_radius=0.05 problem/cluster/agn_feedback/kinetic_jet_thickness=0.05 problem/cluster/agn_feedback/kinetic_jet_offset=0.01 --kokkos-map-device-id-by=mpi_rank
[787462364b88:10789] Sending CUDA IPC REQ (try=1): myrank=0, mydev=0, peerrank=1
[787462364b88:10790] Sending CUDA IPC REQ (try=1): myrank=1, mydev=1, peerrank=0
[787462364b88:10789] Not sending CUDA IPC ACK because request already initiated
[787462364b88:10790] Analyzed CUDA IPC request: myrank=1, mydev=1, peerrank=0, peerdev=0 --> ACCESS=1
[787462364b88:10790] BTL smcuda: rank=1 enabling CUDA IPC to rank=0 on node=787462364b88 
[787462364b88:10790] Sending CUDA IPC ACK:  myrank=1, mydev=1, peerrank=0, peerdev=0
[787462364b88:10789] Received CUDA IPC ACK, notifying PML: myrank=0, peerrank=1
[787462364b88:10789] BTL smcuda: rank=0 enabling CUDA IPC to rank=1 on node=787462364b88 
Starting up hydro driver
# Variables in use:
# Package: parthenon::resolved_state
# ---------------------------------------------------
# Variables:
# Name	Metadata flags
# ---------------------------------------------------
theta_sph                 Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
mach_sonic                Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
log10_cell_radius         Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
magnetic_tower_A          Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
prim                      Cell,Provides,Real,Derived,Hydro,parthenon::resolved_state
v_r                       Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
temperature               Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
entropy                   Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
cons                      Cell,Provides,Real,Independent,FillGhost,WithFluxes,Hydro,parthenon::resolved_state
bnd_flux::cons            Face,Provides,Real,Derived,OneCopy,Flux,parthenon::resolved_state
# ---------------------------------------------------
# Sparse Variables:
# Name	sparse id	Metadata flags
# ---------------------------------------------------
# ---------------------------------------------------
# Swarms:
# Swarm	Value	metadata
# ---------------------------------------------------


Setup complete, executing driver...

cycle=0 time=0.0000000000000000e+00 dt=5.0000000000000001e-03 zone-cycles/wsec_step=0.00e+00 wsec_total=6.69e-01 wsec_step=2.72e+00
--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x63a34e300
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------
[787462364b88:10785] *** Process received signal ***
[787462364b88:10785] Signal: Segmentation fault (11)
[787462364b88:10785] Signal code: Address not mapped (1)
[787462364b88:10785] Failing at address: (nil)
[787462364b88:10785] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fc8cb6b2090]
[787462364b88:10785] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x183bf2)[0x7fc8cb7f2bf2]
[787462364b88:10785] [ 2] /opt/openmpi/lib/libopen-rte.so.40(+0x2d821)[0x7fc8cb9a9821]
[787462364b88:10785] [ 3] /opt/openmpi/lib/libopen-rte.so.40(orte_show_help_recv+0x177)[0x7fc8cb9a9cb7]
[787462364b88:10785] [ 4] /opt/openmpi/lib/libopen-rte.so.40(orte_rml_base_process_msg+0x3e1)[0x7fc8cba077a1]
[787462364b88:10785] [ 5] /opt/openmpi/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0x7b3)[0x7fc8cb8edf13]
[787462364b88:10785] [ 6] /opt/openmpi/bin/mpirun(+0x14a1)[0x561273e924a1]
[787462364b88:10785] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fc8cb693083]
[787462364b88:10785] [ 8] /opt/openmpi/bin/mpirun(+0x11fe)[0x561273e921fe]
[787462364b88:10785] *** End of error message ***
Segmentation fault (core dumped)

@pgrete
Copy link
Contributor Author

pgrete commented Sep 9, 2024

I as getting around a couple of errors like

Setup complete, executing driver...

cycle=0 time=0.0000000000000000e+00 dt=5.0000000000000001e-03 zone-cycles/wsec_step=0.00e+00 wsec_total=3.95e-01 wsec_step=2.46e+00
--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x638e733c0
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------
[787462364b88:10817] [[2900,0],0] ORTE_ERROR_LOG: Out of resource in file util/show_help.c at line 507
[787462364b88:10817] [[2900,0],0] ORTE_ERROR_LOG: Out of resource in file util/show_help.c at line 507
cycle=1 time=5.0000000000000001e-03 dt=1.0000000000000000e-02 zone-cycles/wsec_step=1.79e+05 wsec_total=4.41e+00 wsec_step=4.02e+00

Driver completed.
time=5.00e-03 cycle=1
tlim=5.00e-03 nlim=-1

walltime used = 4.41e+00
zone-cycles/wallsecond = 1.63e+05

by disabling CUDA IPC alltogether:
export OMPI_MCA_btl_smcuda_use_cuda_ipc=0
Let's see how this goes.

@pgrete
Copy link
Contributor Author

pgrete commented Sep 9, 2024

I cannot believe it. Victory!!!

btw this unmerged (open-mpi/ompi#12137) doc (https://github.com/open-mpi/ompi/blob/909168e501b7eb144d4a361a88938af99c1a4352/docs/tuning-apps/networking/cuda.rst) was quite helpful

@pgrete pgrete merged commit 7a50cca into main Sep 9, 2024
4 checks passed
@BenWibking
Copy link
Contributor

BenWibking commented Sep 17, 2024

For future reference, OLCF has a Dockerfile for CUDA-aware MPI here: For future reference, OLCF has example CUDA-aware MPI Dockerfiles here: https://code.ornl.gov/olcfcontainers/olcfbaseimages/-/blob/master/summit/mpiimage-centos-cuda/Dockerfile?ref_type=heads.

It looks like they download the pre-built UCX and OpenMPI from NVIDIA/Mellanox:

# Accept mpi_root environment variable. Should come from host $MPI_ROOT. Should be pointing to GNU instead of XL, etc.
ARG mpi_root
 
# Set MPI environment variables
ENV PATH=$mpi_root/bin:$PATH
ENV LD_LIBRARY_PATH=$mpi_root/lib:$LD_LIBRARY_PATH
ENV LIBRARY_PATH=$mpi_root/lib:$LIBRARY_PATH
ENV INCLUDE=$mpi_root/include:$INCLUDE
ENV C_INCLUDE_PATH=$mpi_root/include:$C_INCLUDE_PATH
ENV CPLUS_INCLUDE_PATH=$mpi_root/include:$CPLUS_INCLUDE_PATH
 
# MOFED is sufficient, but is it necessary?
# Set MOFED version, OS version and platform (updated to match Summit 1/30/2024)
ENV MOFED_VER 4.9-6.0.6.1
ENV OS_VER rhel8.6
ENV PLATFORM ppc64le
ENV MOFED_DIR /mlnx

# MLNX_OFED
RUN mkdir ${MOFED_DIR} 
RUN wget https://content.mellanox.com/ofed/MLNX_OFED-${MOFED_VER}/MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}.tgz -P ${MOFED_DIR}

RUN rm -rf /var/cache/dnf \
    && fakeroot dnf install -y perl lsof numactl-libs pciutils tk libnl3 python36 tcsh gcc-gfortran tcl libmnl ethtool fuse-libs \
	&& fakeroot dnf -y install tar wget git openssh \
	&& dnf clean all
RUN cd ${MOFED_DIR} && \
    tar -xvf MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}.tgz --no-same-owner && \
    MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}/mlnxofedinstall --user-space-only --without-fw-update --distro ${OS_VER} -q && \
    cd / && \
    rm -rf ${MOFED_DIR}

@fglines-nv
Copy link

fglines-nv commented Sep 18, 2024

@pgrete @BenWibking I'm a little late to the party, it took LAMMPS running into the same issue running GPUDirect Cray MPICH and a new version of Kokkos to hit the same errors with IPC. I just figured this one out yesterday.

The core issue is that the Kokkos recently made cudaMallocAsync the default allocator for GPU memory but memory allocated with cudaMallocAsync is incompatible with the old IPC API call cuIpcGetMemHandle. It seems that almost all the MPI implementations run into this problem, at least for UCX and libfabric. See openucx/ucx#7110 and ofiwg/libfabric#10162. HPC-X (NVIDIA's OpenMPI) is the only implementation that might work with cudaMallocAsync, albeit with performance hits (see https://docs.nvidia.com/hpc-sdk//hpc-sdk-release-notes/index.html#known-limitations). I've been the new API is due to a performance issue with cuIpcGetMemHandle+cudaMallocAsync.

To switch to the new IPC, Kokkos would need to create a CUDA memory pool separate from the default pool. One would then need to get that memory pool object to create a file descriptor to pass to the MPI framework/other process to access that memory. So the total solution would involve changes to both Kokkos and the comm libraries.

For now you could disable IPC in the MPI library, or you can disable it in Kokkos with Kokkos_ENABLE_CUDA_MALLOC_ASYNC=OFF

Unless we're calling cudaMalloc very often in a very-active AMR simulation, I believe disabling cudaMallocAsync instead of IPC in the MPI layer will have a better performance outcome. IPC should be more beneficial for multi GPU-per-node systems and I believe for MPS as well.

@felker
Copy link

felker commented Sep 24, 2024

@fglines-nv I recently stumbled upon this issue with AthenaK on ALCF Polaris with Cray MPICH, and your links to the other Issues led me here. Do you know when/what version Kokkos made this change? I think we started having issues at 4.2.00, see here: kokkos/kokkos#7294

We have been recompiling with -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF--- is this not the correct flag? I see it renamed from your suggestion in
kokkos/kokkos@c3ec284

The Kokkos team removed the comprehensive list of CMake build flags from BUILD.md in April 2023: kokkos/kokkos@83873a6#diff-40f60e1037245d7b8a98a7325d53890a717da9979adeb54a61a795c4ba07f9c9R114 but their Wiki page is missing the flag... https://kokkos.org/kokkos-core-wiki/keywords.html

@fglines-nv
Copy link

@felker Supposedly this PR in Kokkos kokkos/kokkos#6402. The flag should be Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF They're reverting this change in kokkos/kokkos#7353. Looks like it might be merged soon

@pgrete pgrete deleted the pgrete/bump-parth-for-rr branch November 29, 2024 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants