Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug report for UD connection when creq length > max_inline_size #10423

Open
LeDong98 opened this issue Jan 16, 2025 · 4 comments
Open

Bug report for UD connection when creq length > max_inline_size #10423

LeDong98 opened this issue Jan 16, 2025 · 4 comments
Labels

Comments

@LeDong98
Copy link

LeDong98 commented Jan 16, 2025

Describe the bug

For the UD transport layer, if the creq in the connection packet is greater than the value of max_inline_size, the connection will established in non-inline mode. In this case, a bug exists. The following is an example:
#6040 Problem treating incoming creq as ack

Steps to Reproduce

  • Cmd:mpirun --allow-run-as-root --mca coll ^ucg --mca pml ucx -mca btl ^vader,tcp,openib,uct,ofi,usnic -np 800 -N 400 --hostfile ./hostfile -x PATH -x LD_LIBRARY_PATH -x UCX_LOG_LEVEL=0 -x VERBS_LOG_LEVEL=0 -x UCX_TLS=ud -x UCX_RNDV_THRESH=8k ./osu_ialltoallw

Setup and versions

  • OS version (Linux openEuler 22.03) + CPU architecture (aarch64)

Additional information (depending on the issue)

  • OpenMPI version:openmpi 4.1.4

###ucx error log:

ud_ep.c:901 Assertion ep->dest_ep_id == ctl->conn_rep.src_ep_id' failed: ep=0x10c35670 [id=668 dest_ep_id=747 flags=0x0] crep [neth->dest=348 dst_ep_id=668 src_ep_id=690] ud_ep.c:901 Assertion ep->dest_ep_id == ctl-conn_rep.src_ep_id' failed: ep=0x7f81a0 [id=449 dest_ep_id=515 flags=0x0] crep [neth->dest=348 dst_ep_id=449 src_ep_id=486]
ud_ep.c:901 Assertion ep->dest_ep_id == ctl->conn_rep.src_ep_id' failed: ep=0x3f4a32b0 [id=571 dest_ep_id=434 flags=0x0] crep [neth->dest=476 dst_ep_id=571 src_ep_id=645] ud_ep.c:901 Assertion ep->dest_ep_id == ctl->conn_rep.src_ep_id' failed: ep=0x7a7bec0 [id=623 dest_ep_id=429 flags=0x0] crep [neth->dest=860 dst_ep_id=623 src_ep_id=538]

uct_iface.c:92 UCX WARN got active message id 0, but no handler installed
uct_iface.c:93 UCX WARN payload 57 of 57 bytes:
uct_iface.c:93 UCX WARN 5896dd29:00400000:630a0000:00000000
uct_iface.c:93 UCX WARN c0e4cc29:00400000:00000000:00000000
uct_iface.c:93 UCX WARN 607f9327:00000000:48000000:00000000
uct_iface.c:93 UCX WARN 000000ff:ffc04302:09
==== backtrace (tid: 608516) ====
0 0x000000000005c2e4 uct_ud_ep_process_rx()
1 0x000000000005e5f0 uct_ud_verbs_iface_progress()
2 0x0000000000044e10 ucp_worker_progress()
3 0x0000000000035cb8 opal_progress()
4 0x000000000004b5b4 ompi_request_default_wait()
5 0x000000000008952c MPI_Wait()
6 0x0000000000401ac4 main()
7 0x000000000002afc0 __libc_init_first()
8 0x000000000002b098 __libc_start_main()
9 0x00000000004025f0 _start()
=================================
*** Process received signal ***
Signal: Aborted (6)
Signal code: (-6)
[ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x400036ebf910]
[ 1] /usr/lib64/libc.so.6(+0x83dc0)[0x400037359dc0]
[ 2] /usr/lib64/libc.so.6(raise+0x1c)[0x400037312f7c]
[ 3] /usr/lib64/libc.so.6(abort+0xe4)[0x400037300d30]
[ 4] /ucx/lib/libucs.so.0(ucs_fatal_error_format+0x0)[0x400039ad25d4]
[ 5] /ucx/lib/libucs.so.0(+0x7e688)[0x400039ad2688]
[ 6] /ucx/lib/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0xdc4)[0x400039cf22e4]
[ 7] /ucx/lib/ucx/libuct_ib.so.0(+0x5e5f0)[0x400039cf45f0]
[ 8] /ucx/lib/libucp.so.0(ucp_worker_progress+0x30)[0x400039124e10]
[ 9] /mpi/lib/libopen-pal.so.40(opal_progress+0x38)[0x4000375b0cb8]
[10] /mpi/lib/libmpi.so.40(ompi_request_default_wait+0x104)[0x400036f275b4]
[11] /mpi/lib/libmpi.so.40(PMPI_Wait+0x5c)[0x400036f6552c]
[12] ./osu_ialltoallw[0x401ac4]
[13] /usr/lib64/libc.so.6(+0x2afc0)[0x400037300fc0]
[14] /usr/lib64/libc.so.6(__libc_start_main+0x94)[0x400037301098]
[15] ./osu_ialltoallw[0x4025f0]
*** End of error message ***
@LeDong98 LeDong98 added the Bug label Jan 16, 2025
@yosefe
Copy link
Contributor

yosefe commented Jan 16, 2025

@LeDong98 maybe it's an issue with retransmissions of CREQ packet?
What is the supported inline size and what is the size of the CREQ packet in this case?

@LeDong98
Copy link
Author

LeDong98 commented Jan 16, 2025

@yosefe
Thanks for your reply! In this case,the CREQ length is 66B and the supported inline size is 64B in the default config table of IB.
I'd like to ask if you seen the example in #6040 ? This analysis is consistent with the situation I've encountered.
I think this problem may exist in this logic(maybe a bug?). In the 'uct_ud_ep_rx_creq' function, if creq use the non-inline mode, the SKB buffer will be used. If the SKB buffer is released in advance, the corresponding address is incorrectly accessed during subsequent ep connections.

@LeDong98
Copy link
Author

@yosefe Thanks for your reply! In this case,the CREQ length is 66B and the supported inline size is 64B in the default config table of IB. I'd like to ask if you seen the example in #6040 ? This analysis is consistent with the situation I've encountered. I think this problem may exist in this logic(maybe a bug?). In the 'uct_ud_ep_rx_creq' function, if creq use the non-inline mode, the SKB buffer will be used. If the SKB buffer is released in advance, the corresponding address is incorrectly accessed during subsequent ep connections.

I can add "-x UCX_UD_VERBS_TX_MIN_INLINE=128" in cmd to circumvent this problem, however, I think this is not a "solution", but a "workaround". This processing mechanism when creq length greater than max_inline_size could be considered a bug?

@LeDong98
Copy link
Author

@yosefe I drew a flowchart, if you have time could you help with it?
1.According to ucx's logic, duplicate ep connect requests are discarded, and there is enough connect information in conn_req or conn_rsp;

Image

2.Ucx chose to release skb at an earlier date, eliminating the ep connect process on this end(side).

Image

3.When the resource is actually released, the RDMA doolbell may have already been issued and the RDMA engine has not yet read away the skb, resulting in an error message.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants