Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in slate::gesv when using CUDA-aware MPI #154

Open
2 tasks done
liamscarlett opened this issue Dec 12, 2023 · 2 comments
Open
2 tasks done

Segmentation fault in slate::gesv when using CUDA-aware MPI #154

liamscarlett opened this issue Dec 12, 2023 · 2 comments

Comments

@liamscarlett
Copy link

Description
I have been successfully running gesv using GPU-aware MPI on an AMD machine with HIP (Setonix @ Pawsey Supercomputing Centre Australia). But I am getting seg faults trying to do the same on NVIDIA GPUs (both on Gadi @ NCI Australia using CUDA-aware OpenMPI, and Frontera @ TACC using CUDA-aware Intel-MPI).

I am running SLATE's provided example code for gesv (examples/ex06_linear_system_lu.cc) modified only slightly to set the target to devices in the gesv options, and with SLATE_GPU_AWARE_MPI=0 it runs fine, but with SLATE_GPU_AWARE_MPI=1 I get a seg fault with the following backtrace (on Frontera):

rank 1: void test_lu() [with scalar_type = float]
rank 2: void test_lu() [with scalar_type = float]
rank 3: void test_lu() [with scalar_type = float]
mpi_size 4, grid_p 2, grid_q 2
rank 0: void test_lu() [with scalar_type = float]
[c197-101:24026:0:24026] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b1588ec2c10)
[c197-101:24025:0:24025] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b4360ec2c10)
==== backtrace (tid:  24026) ====
 0 0x000000000004cb95 ucs_debug_print_backtrace()  ???:0
 1 0x000000000089fa45 bdw_memcpy_write()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_memcpy.h:128
 2 0x000000000089fa45 bdw_memcpy_write()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_memcpy.h:123
 3 0x000000000089bce9 write_to_cell()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_memcpy.h:326
 4 0x000000000089bce9 send_cell()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h:890
 5 0x00000000008959a4 MPIDI_POSIX_eager_send()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h:1540
 6 0x00000000004a0989 MPIDI_POSIX_eager_send()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/posix_eager_impl.h:37
 7 0x00000000004a0989 MPIDI_POSIX_am_isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_am.h:220
 8 0x00000000004a0989 MPIDI_SHM_am_isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_am.h:49
 9 0x00000000004a0989 MPIDIG_isend_impl()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:116
10 0x00000000004a176d MPIDIG_am_isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:172
11 0x00000000004a176d MPIDIG_mpi_isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:233
12 0x00000000004a176d MPIDI_POSIX_mpi_isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_send.h:59
13 0x00000000004a176d MPIDI_SHM_mpi_isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_p2p.h:187
14 0x00000000004a176d MPIDI_isend_unsafe()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:314
15 0x00000000004a176d MPIDI_isend_safe()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:609
16 0x00000000004a176d MPID_Isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:828
17 0x00000000004a176d PMPI_Isend()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/pt2pt/isend.c:132
18 0x000000000042fd7a slate::Tile<float>::isend()  ???:0
19 0x000000000043b6f4 slate::BaseMatrix<float>::tileIbcastToSet()  ???:0
20 0x00000000004c0855 slate::BaseMatrix<float>::listBcast<(slate::Target)68>()  ???:0
21 0x000000000063a0c6 slate::impl::getrf<(slate::Target)68, float>()  getrf.cc:0
22 0x00000000000163ec GOMP_taskwait()  /admin/build/admin/rpms/frontera/BUILD/gcc-9.1.0/x86_64-pc-linux-gnu/libgomp/../.././libgomp/task.c:1537
23 0x000000000061eac1 slate::impl::getrf<(slate::Target)68, float>()  getrf.cc:0
24 0x0000000000012a22 GOMP_parallel()  /admin/build/admin/rpms/frontera/BUILD/gcc-9.1.0/x86_64-pc-linux-gnu/libgomp/../.././libgomp/parallel.c:171
25 0x0000000000620256 slate::impl::getrf<(slate::Target)68, float>()  ???:0
26 0x00000000006204ff slate::getrf<float>()  ???:0
27 0x00000000006038d7 slate::gesv<float>()  ???:0
28 0x00000000004199de test_lu<float>()  ???:0
29 0x000000000041809d main()  ???:0
30 0x0000000000022555 __libc_start_main()  ???:0
31 0x0000000000411709 _start()  ???:0
=================================

and on Gadi:

mpi_size 4, grid_p 2, grid_q 2
rank 0: void test_lu() [with scalar_type = float]
rank 1: void test_lu() [with scalar_type = float]
rank 2: void test_lu() [with scalar_type = float]
rank 3: void test_lu() [with scalar_type = float]
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[1702299288.183060] [gadi-gpu-v100-0002:4184291:0]         cuda_md.c:162  UCX  ERROR cuMemGetAddressRange(0x15449a8c2e10) error: named symbol not found
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[1702299288.193415] [gadi-gpu-v100-0002:4184290:0]         cuda_md.c:162  UCX  ERROR cuMemGetAddressRange(0x1493128c2e10) error: named symbol not found
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[gadi-gpu-v100-0002:4184290:0:4184290] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x1493128c2e20)
[gadi-gpu-v100-0002:4184291:0:4184291] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15449a8c2e20)
[1702299288.438796] [gadi-gpu-v100-0002:4184288:0]         cuda_md.c:162  UCX  ERROR cuMemGetAddressRange(0x147bf7cc2440) error: named symbol not found
[gadi-gpu-v100-0002:4184288:0:4184288] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x147bf7cc2440)
==== backtrace (tid:4184291) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x00000000000cf006 __memmove_avx_unaligned_erms()  :0
 2 0x0000000000089039 ucp_eager_only_handler()  ???:0
 3 0x000000000001687d uct_mm_iface_progress()  :0
 4 0x000000000004951a ucp_worker_progress()  ???:0
 5 0x00000000000ab8cf mca_pml_ucx_recv()  /jobfs/53639599.gadi-pbs/0/openmpi/4.1.4/source/openmpi-4.1.4/ompi/mca/pml/ucx/pml_ucx.c:646
 6 0x0000000000208915 PMPI_Recv()  /jobfs/53639599.gadi-pbs/0/openmpi/4.1.4/build/gcc/ompi/precv.c:82
 7 0x000000000065a550 slate::Tile<float>::recv()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/Tile.hh:1094
 8 0x0000000000b0a1ca slate::BaseMatrix<float>::tileIbcastToSet()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/BaseMatrix.hh:2403
 9 0x0000000000b0a1ca ???()  /half-root/usr/include/c++/8/bits/shared_ptr_base.h:1013
10 0x0000000000b0a1ca ???()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/BaseMatrix.hh:492
11 0x0000000000b0a1ca slate::BaseMatrix<float>::tileIbcastToSet()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/BaseMatrix.hh:2404
12 0x0000000000c5d690 slate::BaseMatrix<float>::listBcast<(slate::Target)68>()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/BaseMatrix.hh:1979
13 0x000000000101dcf3 slate::impl::getrf<(slate::Target)68, float>()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/getrf.cc:106
14 0x0000000000109f09 _INTERNAL8bc508f1::__kmp_invoke_task()  /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_tasking.cpp:1856
15 0x000000000011094b __kmp_omp_task()  /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_tasking.cpp:1974
16 0x0000000000101007 __kmpc_omp_task_with_deps()  /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_taskdeps.cpp:734
17 0x000000000101d2e0 L__ZN5slate4impl5getrfILNS_6TargetE68EfEEvRNS_6MatrixIT0_EERSt6vectorIS7_INS_5PivotESaIS8_EESaISA_EERKSt3mapINS_6OptionENS_11OptionValueESt4lessISF_ESaISt4pairIKSF_SG_EEE_84__par_region0_2_615()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/getrf.cc:93
18 0x0000000000163493 __kmp_invoke_microtask()  ???:0
19 0x00000000000d1ca4 _INTERNAL49d8b4ea::__kmp_serial_fork_call()  /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_runtime.cpp:2004
20 0x00000000000d1ca4 __kmp_fork_call()  /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_runtime.cpp:2329
21 0x0000000000089d23 __kmpc_fork_call()  /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_csupport.cpp:350
22 0x000000000101c2bb slate::impl::getrf<(slate::Target)68, float>()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/getrf.cc:84
23 0x0000000001014fa5 slate::getrf<float>()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/getrf.cc:341
24 0x0000000000f3597e slate::gesv<float>()  /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/gesv.cc:95
25 0x000000000041cb18 test_lu<float>()  /scratch/d35/lhs573/test_slate/ex06_linear_system_lu.cc:28
26 0x000000000041ab6c main()  /scratch/d35/lhs573/test_slate/ex06_linear_system_lu.cc:132
27 0x000000000003ad85 __libc_start_main()  ???:0
28 0x000000000041a98e _start()  ???:0
=================================

Steps To Reproduce

  1. Modify the ex06_linear_system_lu.cc test code to pass the option {{slate::Option::Target, slate::Target::Devices}} to the slate::gesv call
  2. export SLATE_GPU_AWARE_MPI=1
  3. Run a multi-GPU job on an NVIDIA machine

Environment
The more information that you can provide about your environment, the simpler it is for us to understand and reproduce the issue.
BELOW INFORMATION GIVEN FOR FRONTERA (TACC) MACHINE

  • SLATE version / commit ID (e.g., git log --oneline -n 1): d514136
  • How installed:
    • git clone
  • How compiled:
    • makefile (include your make.inc)
      CXX=mpicc
      FC=mpif90
      CXXFLAGS=-I/home1/apps/cuda/12.2/include -I/opt/intel/compilers_and_libraries_2020.1.217/linux/mkl/include
      LDFLAGS=-L/home1/apps/cuda/12.2/lib64 -L/opt/intel/compilers_and_libraries_2020.1.217/linux/mkl/lib/intel64 -lstdc++ -lcudadevrt -lcudart
      blas=mkl
      gpu_backend=cuda
      mpi=1
      openmp=1
  • Compiler & version (e.g., mpicxx --version): g++ (GCC) 9.1.0
  • BLAS library (e.g., MKL, ESSL, OpenBLAS) & version: mkl/19.1.1
  • CUDA / ROCm / oneMKL version (e.g., nvcc --version): 12.2
  • MPI library & version (MPICH, Open MPI, Intel MPI, IBM Spectrum, Cray MPI, etc. Sometimes mpicxx -v gives info.): Intel MPI 19.0.9
  • OS: Linux
  • Hardware (CPUs, GPUs, nodes): CPUs: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, GPUs: Quadro RTX 5000
@mgates3
Copy link
Collaborator

mgates3 commented Dec 12, 2023

Thanks for the detailed bug report. We will need to investigate.

@lzjia-jia
Copy link

What is the make.inc configuration file you used when installing SLATE with HIP?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants