-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG, MAINT: segfaults through libfabric->ucx #10001
Comments
Also, pretty sure that in this case I rebuilt NVSHMEM against Probably the most useful thing is this--do I have a shot at debugging this/fixing this without basically needing to rebuild my whole dependency chain? If I could just adjust |
Clean rebuild of dependency chain with libfabric at
|
Changing this from version |
I've simplified the reproducer to remove GROMACS entirely, using only the cuFFTMp example at: https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/samples/r2c_c2r_slabs_GROMACS. Interactive run script for 2 nodes (4 A100 GPUs each) #!/bin/bash -l
#
# setup the runtime environment
export NVSHMEM_DEBUG=TRACE
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.18.1-opv2jutclmudyzxdeud4xjggqrubip3u/lib:$LD_LIBRARY_PATH"
export PATH="$PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin"
export PATH="$PATH:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin"
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/ucx:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib"
export NVSHMEM_DISABLE_CUDA_VMM=1
export FI_CXI_OPTIMIZED_MRS=false
export NVSHMEM_REMOTE_TRANSPORT=libfabric
export MPI_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install
export CUFFT_LIB=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib
export CUFFT_INC=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp
export NVSHMEM_LIB=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib
export NVSHMEM_INC=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include
cd /lustre/scratch5/treddy/march_april_2024_testing/github_projects/CUDALibrarySamples/cuFFTMp/samples/r2c_c2r_slabs_GROMACS
make clean
make build
make run Diff on above Makefile: diff --git a/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile b/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile
index 5d9fa3e..64e39be 100644
--- a/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile
+++ b/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile
@@ -15,4 +15,4 @@ $(exe): $(exe).cu
build: $(exe)
run: $(exe)
- LD_LIBRARY_PATH="${NVSHMEM_LIB}:${CUFFT_LIB}:${LD_LIBRARY_PATH}" mpirun -oversubscribe -n 4 $(exe)
+ LD_LIBRARY_PATH="${NVSHMEM_LIB}:${CUFFT_LIB}:${LD_LIBRARY_PATH}" /lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin/mpirun -oversubscribe -n 8 -N 4 $(exe) Output:
|
@tylerjereddy We occasionally see similar segfault at finalization phase inside ucp_worker_destroy(), the exact reason has not been identified with the most like guess being some race-condition inside ucx. I don't see any libfabric related symbols from your trace. I would suggest run with debug build of libfabric and ucx to help to locate where the segfault happens. |
@j-xiong I swapped in debug version of
I'm not sure I see anything more helpful on the backtraces? The added debug log info may help though. So far, it looks like NVSHMEM tries to call From the extra fabric debug info, if I do
Maybe the 200 missing Edit: |
@tylerjereddy Those warnings about missing Based on the log, the libfabric installation doesn't have ucx provider at all. The top part of the stack trace indicates that the ucx error occurred in another thread (probably spawned by other part of NVSHMEM or OpenMPI).
|
@j-xiong interesting... If I do
However, So, what's the conclusion here? The I'll take a shot at turning on even more |
Turning on a ton more
|
The output of The fact that
The ucx provider has similar parameter definitions at the very beginning of the provider initialization code and we expect to see similar output for the ucx provider. Could you try again with |
Sure, see the attached file below (it is too large to paste in its entirety). I do see Are there some common causes of |
Yes, that's the info I want to see. The ucx provider looks for devices under See the code here: https://github.com/ofiwg/libfabric/blob/v1.18.x/prov/ucx/src/ucx_init.c#L207 |
The |
Not surprising at all. That just confirms that libfabric not picking up the ucx provider is an expected behavior. Back to the segfault, what is going on inside
|
Using good old " Final section of C++ function source that gets printed before crash1440 printf("** nvshmemt_libfabric_finalize checkpoint 7\n");
1441
1442 if (libfabric_state->addresses) {
1443 for (int i = 0; i < NVSHMEMT_LIBFABRIC_DEFAULT_NUM_EPS; i++) {
1444 printf("** nvshmemt_libfabric_finalize checkpoint 7b\n");
1445 status = fi_close(&libfabric_state->addresses[i]->fid);
1446 printf("** nvshmemt_libfabric_finalize checkpoint 7c\n");
1447 if (status) {
1448 NVSHMEMI_WARN_PRINT("Unable to close fabric address vector: %d: %s\n", status,
1449 fi_strerror(status * -1));
1450 }
1451 printf("** nvshmemt_libfabric_finalize checkpoint 7d\n");
1452 }
1453 }
1454 printf("** nvshmemt_libfabric_finalize checkpoint 8\n"); Based on grepping the output log with
So, crash at |
So the failure happened when closing the endpoints. Question: Since fi_getinfo() previously returned -61, is there another fi_getinfo() call that succeeded later? If so, which provider was actually being used? If not, how it comes to the point to close the endpoint which is not opened before? |
Hi all, The segfault is happening in an error path inside of nvshmemt_init. It seems we are missing a null check for endpoints in an error path. (Instancing a bug for this internally). We should fail gracefully here, but do not. I see two issues before this one that lead to us entering this error path in the first place though:
These errors will disable HMEM CUDA support required by NVSHMEM to register device memory. This is a configure-time option in libfabric. if you pass --with-cuda[=dir] to configure, it will fail if it can't find the directory. |
CXI is closed source, so for now I switch to our HPC module |
@tylerjereddy #9835 the CXI provider is open source, but it cannot be built on any machine we have come across without significant hacks. |
Ok, I'm referring to the build system we usually use which indicates: https://github.com/spack/spack/blob/develop/var/spack/repos/builtin/packages/libfabric/package.py#L87
maybe they haven't updated yet, but the effect is the same I guess |
On latest
I get:
The |
@tylerjereddy the configure issues are fixed in #9793 if you want to try and cherry-pick those commits. Make sure to re-run |
@raffenet Cool, I ended up having to use https://github.com/thomasgillis/libfabric/tree/dev-cxi after following some more cross-linked issues/PRs that complain about LANL-specific problems, and this does seem to get me a new segfault at runtime at least, and I think the problem is still related to the CUDA provider based on the out_cxi_and_cuda_libfabric.txt It looks like libfabric does correctly identify the number of CUDA devices based on the verbose output there, so some progress is being made. Just before the call to Edit: here's my current configure line for
|
cc @hppritcha as well perhaps |
I checked that addition of out_cxi_and_cuda_and_mrail_libfabric.txt I think I'm still misunderstanding something though, because I can get the same kinds of errors with I wonder if this is relevant:
|
@tylerjereddy You don't need to enable the mrail provider. The line There is no provider called |
I get the same backtraces with |
@seth-howell I built NVSHMEM with Looks like a mixture of remote memory access (RMA), GDRCopy, CXI, and CUDA-related complaints, but not sure if it is clear what needs to be addressed.
|
Here's a shorter ouput log with different debug/verbosity settings and also
|
It looks like an NVIDIA engineer responded to the
|
When I build In particular, I see the cuFFTMp reproducer code hang on two nodes instead of segfaulting. Not sure if this might be diagnostically useful cc @pakmarkthub. |
I guess the Slingshot 11 network was having problems (based on feedback from local HPC). When I re-run today with
|
I believe using the correct CXI provider build of We're now seeing other crashes in |
I'm seeing a segfault/backtrace for NVSHMEM ->
libfabric
->ucx
control flow for a 2-node test run of GROMACS on one of our supercomputers with OpenMPI5.0.2
on Cray Slingshot 11. I think what I'm really looking for is clear runtime error messages that tell me what is wrong (API, ABI, whatever version mismatches, etc.) before I ever get to a segfault. I've labelled this a bug on the sole basis that I shouldn't be able to segfault, but it could be that the error resides with i.e., the use offi_getinfo()
"upstream" of the segfault happening (i.e., that NVSHMEM should handle their runtime check differently?).I've talked to NVIDIA engineers about this, and the problem really isn't clear to them. I did some experiments with runtime swapping of
libfabric
versions. NVSHMEM was built from source againstlibfabric 1.20.1
fromspack
.Here is what happens if I use
libfabric
1.18.1
at runtime instead:/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available
and then a segfault atucx
again.The local C++ code they're using looks like this:
where those two version variables are set to
1
and5
, respectively. I know I've had some success in the past using libfabric1.18.1
if I build NVSHMEM against that directly and use it at runtime. Is there a good reason I wouldn't be able to use1.20.1
, and if so how should the NVSHMEM folks guard against it?The text was updated successfully, but these errors were encountered: