-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aurora: Segfaults when message arrvives via shm memory #7203
Comments
Thanks! @abagusetty |
@abagusetty How did you get the backtrace? Do you have the location of the segfault? |
@hzhou The backtrace was generated from a core-dump on Aurora that segfaulted only at large node counts. I could run the app with debug version of mpich and get a better backtrace |
@abagusetty Yeah, that will be helpful. I am curious on which line that segfaults. |
Here's a full backtrace from a debug build of (gdb) bt
#0 0x0000147172408960 in MPIDIG_mpi_cancel_recv (rreq=0x147172e83e90 <MPIR_Request_direct+496>)
at ./src/mpid/ch4/src/mpidig_recv.h:377
#1 0x00001471724097d5 in MPIDI_POSIX_mpi_cancel_recv (rreq=0x147172e83e90 <MPIR_Request_direct+496>)
at ./src/mpid/ch4/shm/src/../posix/posix_recv.h:80
#2 0x000014717240885b in MPIDI_SHM_mpi_cancel_recv (rreq=0x147172e83e90 <MPIR_Request_direct+496>)
at ./src/mpid/ch4/shm/src/shm_p2p.h:94
#3 0x0000147172407ae6 in MPIDI_anysrc_try_cancel_partner (rreq=0x147172e84650 <MPIR_Request_direct+2480>,
is_cancelled=0x7ffc4845098c) at ./src/mpid/ch4/src/mpidig_request.h:130
#4 0x0000147172407453 in MPIDI_OFI_recv_event (vci=0, wc=0x7ffc48450a80,
rreq=0x147172e84650 <MPIR_Request_direct+2480>, event_id=2)
at ./src/mpid/ch4/netmod/include/../ofi/ofi_events.h:163
#5 0x000014717240719c in MPIDI_OFI_dispatch_optimized (vci=0, wc=0x7ffc48450a80,
req=0x147172e84650 <MPIR_Request_direct+2480>) at ./src/mpid/ch4/netmod/include/../ofi/ofi_events.h:205
#6 0x0000147172403a9b in MPIDI_OFI_handle_cq_entries (vci=0, wc=0x7ffc48450a50, num=2)
at ./src/mpid/ch4/netmod/include/../ofi/ofi_progress.h:61
#7 0x0000147172403273 in MPIDI_NM_progress (vci=0, made_progress=0x7ffc48450c08)
at ./src/mpid/ch4/netmod/include/../ofi/ofi_progress.h:105
#8 0x0000147172403047 in MPIDI_OFI_progress_uninlined (vci=0) at src/mpid/ch4/netmod/ofi/ofi_progress.c:13
#9 0x0000147172344321 in MPIDI_NM_mpi_cancel_recv (rreq=0x147172e84650 <MPIR_Request_direct+2480>,
is_blocking=true) at ./src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:460
#10 0x0000147172343bd0 in MPIDI_anysrc_try_cancel_partner (rreq=0x147172e83e90 <MPIR_Request_direct+496>,
is_cancelled=0x7ffc484510e8) at ./src/mpid/ch4/src/mpidig_request.h:108
#11 0x0000147172336be2 in match_posted_rreq (rank=1, tag=0, context_id=0, vci=0, is_local=true,
req=0x7ffc48451158) at src/mpid/ch4/src/mpidig_pt2pt_callbacks.c:225
#12 0x00001471723365f2 in MPIDIG_send_target_msg_cb (am_hdr=0x147157f168f0, data=0x147157f16920,
in_data_sz=16, attr=1, req=0x0) at src/mpid/ch4/src/mpidig_pt2pt_callbacks.c:384
#13 0x00001471721e8219 in MPIDI_POSIX_progress_recv (vci=0, made_progress=0x7ffc48451460)
at ./src/mpid/ch4/shm/src/../posix/posix_progress.h:60
#14 0x00001471721e7eca in MPIDI_POSIX_progress (vci=0, made_progress=0x7ffc48451460)
at ./src/mpid/ch4/shm/src/../posix/posix_progress.h:147
#15 0x00001471721e7a68 in MPIDI_SHM_progress (vci=0, made_progress=0x7ffc48451460)
at ./src/mpid/ch4/shm/src/shm_progress.h:18
#16 0x00001471721e6fbc in MPIDI_progress_test (state=0x7ffc48451568)
at ./src/mpid/ch4/src/ch4_progress.h:142
#17 0x00001471721deafa in MPID_Progress_test (state=0x7ffc48451568)
at ./src/mpid/ch4/src/ch4_progress.h:241
#18 0x00001471721e0525 in MPID_Progress_wait (state=0x7ffc48451568)
at ./src/mpid/ch4/src/ch4_progress.h:296
#19 0x00001471721e0446 in MPIR_Wait_state (request_ptr=0x147172e83e90 <MPIR_Request_direct+496>,
status=0x7ffc4845175c, state=0x7ffc48451568) at src/mpi/request/request_impl.c:707
#20 0x00001471721e09ae in MPID_Wait (request_ptr=0x147172e83e90 <MPIR_Request_direct+496>,
status=0x7ffc4845175c) at ./src/mpid/ch4/src/ch4_wait.h:100
#21 0x00001471721e0868 in MPIR_Wait (request_ptr=0x147172e83e90 <MPIR_Request_direct+496>,
--Type <RET> for more, q to quit, c to continue without paging--
status=0x7ffc4845175c) at src/mpi/request/request_impl.c:750
#22 0x0000147171bbc58a in internal_Recv (buf=0x7ffc48451770, count=4, datatype=1275069445, source=-2,
tag=0, comm=1140850688, status=0x7ffc4845175c) at src/binding/c/pt2pt/recv.c:117
#23 0x0000147171bbb953 in PMPI_Recv (buf=0x7ffc48451770, count=4, datatype=1275069445, source=-2, tag=0,
comm=1140850688, status=0x7ffc4845175c) at src/binding/c/pt2pt/recv.c:169
#24 0x0000000000401d11 in main () at foo.c:20 |
I believe what was happening is: When shmem matches, it tries to call netmod cancel partner, but netmod can't cancel if it already matched, so it will instead cancel the shmem part. |
@raffenet Can you confirm that line 377 is
? |
@raffenet If you try remove that branch altogether -- so it leaks -- will the test run? EDIT: I guess we need the shmem cancel to work. How about just set the condition to |
Yes |
I guess it is somewhat a recursive situation. In mpich/src/mpid/ch4/netmod/ofi/ofi_recv.h Lines 456 to 467 in 6dc849e
anysrc_partner before we call the progress.
|
I think we have to do it inside |
Give it a try? :) |
I will. Lost my session, but this is my thought diff --git a/src/mpid/ch4/src/mpidig_request.h b/src/mpid/ch4/src/mpidig_request.h
index 8c2d374e..8e0f16fb 100644
--- a/src/mpid/ch4/src/mpidig_request.h
+++ b/src/mpid/ch4/src/mpidig_request.h
@@ -105,6 +105,8 @@ MPL_STATIC_INLINE_PREFIX int MPIDI_anysrc_try_cancel_partner(MPIR_Request * rreq
* ref count here to prevent free since here we will check
* the request status */
MPIR_Request_add_ref(anysrc_partner);
+ /* unset the partner request's partner to prevent recursive cancelation */
+ anysrc_parter->dev.anysrc_partner = NULL;
mpi_errno = MPIDI_NM_mpi_cancel_recv(anysrc_partner, true); /* blocking */
MPIR_ERR_CHECK(mpi_errno);
if (!MPIR_STATUS_GET_CANCEL_BIT(anysrc_partner->status)) { |
This just causes a deadlock at the first anysrc partner cancel operation 😦. I'll try doing it in the netmod layer before calling it for the night. |
Avoid recursively canceling partner requests for MPI_ANY_SOURCE recv operations. Fixes pmodels#7203.
Avoid recursively canceling partner requests for MPI_ANY_SOURCE recv operations. Fixes pmodels#7203.
@raffenet I just ran the app on a 4k nodes of Aurora using your PR(built today) and hit a segfault with slightly different backtrace than the one posted above:
The complaining API from the app side is still the same pointing to any_source usage. |
Not sure if I was prematurely testing the PR |
@abagusetty thanks for trying it out. I will see if I can reproduce and update the PR if I find the issue. |
@abagusetty Give #7223 a try. |
Thanks to @raffenet @hzhou for figuring it out. Creating the reproducer that was created by @raffenet Needs an Aurora label. Adding info from internal slack:
Also reproducible with upstream commits.
Backtrace from an app running with the commit: 204f8cd
Reproducer:
The text was updated successfully, but these errors were encountered: