You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
I have been successfully running gesv using GPU-aware MPI on an AMD machine with HIP (Setonix @ Pawsey Supercomputing Centre Australia). But I am getting seg faults trying to do the same on NVIDIA GPUs (both on Gadi @ NCI Australia using CUDA-aware OpenMPI, and Frontera @ TACC using CUDA-aware Intel-MPI).
I am running SLATE's provided example code for gesv (examples/ex06_linear_system_lu.cc) modified only slightly to set the target to devices in the gesv options, and with SLATE_GPU_AWARE_MPI=0 it runs fine, but with SLATE_GPU_AWARE_MPI=1 I get a seg fault with the following backtrace (on Frontera):
mpi_size 4, grid_p 2, grid_q 2
rank 0: void test_lu() [with scalar_type = float]
rank 1: void test_lu() [with scalar_type = float]
rank 2: void test_lu() [with scalar_type = float]
rank 3: void test_lu() [with scalar_type = float]
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[1702299288.183060] [gadi-gpu-v100-0002:4184291:0] cuda_md.c:162 UCX ERROR cuMemGetAddressRange(0x15449a8c2e10) error: named symbol not found
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[1702299288.193415] [gadi-gpu-v100-0002:4184290:0] cuda_md.c:162 UCX ERROR cuMemGetAddressRange(0x1493128c2e10) error: named symbol not found
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[gadi-gpu-v100-0002:4184290:0:4184290] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x1493128c2e20)
[gadi-gpu-v100-0002:4184291:0:4184291] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15449a8c2e20)
[1702299288.438796] [gadi-gpu-v100-0002:4184288:0] cuda_md.c:162 UCX ERROR cuMemGetAddressRange(0x147bf7cc2440) error: named symbol not found
[gadi-gpu-v100-0002:4184288:0:4184288] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x147bf7cc2440)
==== backtrace (tid:4184291) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x00000000000cf006 __memmove_avx_unaligned_erms() :0
2 0x0000000000089039 ucp_eager_only_handler() ???:0
3 0x000000000001687d uct_mm_iface_progress() :0
4 0x000000000004951a ucp_worker_progress() ???:0
5 0x00000000000ab8cf mca_pml_ucx_recv() /jobfs/53639599.gadi-pbs/0/openmpi/4.1.4/source/openmpi-4.1.4/ompi/mca/pml/ucx/pml_ucx.c:646
6 0x0000000000208915 PMPI_Recv() /jobfs/53639599.gadi-pbs/0/openmpi/4.1.4/build/gcc/ompi/precv.c:82
7 0x000000000065a550 slate::Tile<float>::recv() /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/Tile.hh:1094
8 0x0000000000b0a1ca slate::BaseMatrix<float>::tileIbcastToSet() /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/BaseMatrix.hh:2403
9 0x0000000000b0a1ca ???() /half-root/usr/include/c++/8/bits/shared_ptr_base.h:1013
10 0x0000000000b0a1ca ???() /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/BaseMatrix.hh:492
11 0x0000000000b0a1ca slate::BaseMatrix<float>::tileIbcastToSet() /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/BaseMatrix.hh:2404
12 0x0000000000c5d690 slate::BaseMatrix<float>::listBcast<(slate::Target)68>() /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/./include/slate/BaseMatrix.hh:1979
13 0x000000000101dcf3 slate::impl::getrf<(slate::Target)68, float>() /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/getrf.cc:106
14 0x0000000000109f09 _INTERNAL8bc508f1::__kmp_invoke_task() /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_tasking.cpp:1856
15 0x000000000011094b __kmp_omp_task() /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_tasking.cpp:1974
16 0x0000000000101007 __kmpc_omp_task_with_deps() /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_taskdeps.cpp:734
17 0x000000000101d2e0 L__ZN5slate4impl5getrfILNS_6TargetE68EfEEvRNS_6MatrixIT0_EERSt6vectorIS7_INS_5PivotESaIS8_EESaISA_EERKSt3mapINS_6OptionENS_11OptionValueESt4lessISF_ESaISt4pairIKSF_SG_EEE_84__par_region0_2_615() /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/getrf.cc:93
18 0x0000000000163493 __kmp_invoke_microtask() ???:0
19 0x00000000000d1ca4 _INTERNAL49d8b4ea::__kmp_serial_fork_call() /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_runtime.cpp:2004
20 0x00000000000d1ca4 __kmp_fork_call() /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_runtime.cpp:2329
21 0x0000000000089d23 __kmpc_fork_call() /nfs/site/proj/openmp/promo/20230428/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.tcm0.t1..h1.w1-anompbdwlin05/../../src/kmp_csupport.cpp:350
22 0x000000000101c2bb slate::impl::getrf<(slate::Target)68, float>() /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/getrf.cc:84
23 0x0000000001014fa5 slate::getrf<float>() /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/getrf.cc:341
24 0x0000000000f3597e slate::gesv<float>() /jobfs/91791886.gadi-pbs/0/slate/2023.06.00/source/slate-2023.06.00/src/gesv.cc:95
25 0x000000000041cb18 test_lu<float>() /scratch/d35/lhs573/test_slate/ex06_linear_system_lu.cc:28
26 0x000000000041ab6c main() /scratch/d35/lhs573/test_slate/ex06_linear_system_lu.cc:132
27 0x000000000003ad85 __libc_start_main() ???:0
28 0x000000000041a98e _start() ???:0
=================================
Steps To Reproduce
Modify the ex06_linear_system_lu.cc test code to pass the option {{slate::Option::Target, slate::Target::Devices}} to the slate::gesv call
export SLATE_GPU_AWARE_MPI=1
Run a multi-GPU job on an NVIDIA machine
Environment
The more information that you can provide about your environment, the simpler it is for us to understand and reproduce the issue.
BELOW INFORMATION GIVEN FOR FRONTERA (TACC) MACHINE
SLATE version / commit ID (e.g., git log --oneline -n 1): d514136
Description
I have been successfully running gesv using GPU-aware MPI on an AMD machine with HIP (Setonix @ Pawsey Supercomputing Centre Australia). But I am getting seg faults trying to do the same on NVIDIA GPUs (both on Gadi @ NCI Australia using CUDA-aware OpenMPI, and Frontera @ TACC using CUDA-aware Intel-MPI).
I am running SLATE's provided example code for gesv (examples/ex06_linear_system_lu.cc) modified only slightly to set the target to devices in the gesv options, and with SLATE_GPU_AWARE_MPI=0 it runs fine, but with SLATE_GPU_AWARE_MPI=1 I get a seg fault with the following backtrace (on Frontera):
and on Gadi:
Steps To Reproduce
Environment
The more information that you can provide about your environment, the simpler it is for us to understand and reproduce the issue.
BELOW INFORMATION GIVEN FOR FRONTERA (TACC) MACHINE
git log --oneline -n 1
): d514136make.inc
)CXX=mpicc
FC=mpif90
CXXFLAGS=-I/home1/apps/cuda/12.2/include -I/opt/intel/compilers_and_libraries_2020.1.217/linux/mkl/include
LDFLAGS=-L/home1/apps/cuda/12.2/lib64 -L/opt/intel/compilers_and_libraries_2020.1.217/linux/mkl/lib/intel64 -lstdc++ -lcudadevrt -lcudart
blas=mkl
gpu_backend=cuda
mpi=1
openmp=1
mpicxx --version
): g++ (GCC) 9.1.0nvcc --version
): 12.2mpicxx -v
gives info.): Intel MPI 19.0.9The text was updated successfully, but these errors were encountered: