Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data race detected in libfabric shm provider with multi-threaded client-server setup #10528

Open
piotrchmiel opened this issue Nov 8, 2024 · 2 comments
Assignees

Comments

@piotrchmiel
Copy link
Contributor

piotrchmiel commented Nov 8, 2024

Describe the bug
When using the shm provider in a simple setup with one client thread and one server thread within a single process, a data race is detected when compiled with clang-19 and run with ThreadSanitizer. The client performs one fi_send, and the server performsone fi_recv, with a message size of 1000 bytes. The data race appears during the fi_cq_read operation on the server side and the fi_send operation on the client side.

To Reproduce

  1. Set up a process with two threads: one acting as a client and the other as a server.
  2. Use the shm provider with libfabric version 1.22.0.
  3. Compile the program with clang-19 and enable ThreadSanitizer (-fsanitize=thread).
  4. In the client thread, perform an fi_send operation with a message of 1000 bytes.
  5. In the server thread, perform an fi_recv operation to receive the message.
    Observe that ThreadSanitizer reports a data race during execution

Expected behavior
No data race should be detected when performing simple send and receive operations between a client and server thread within the same process using the shm provider.

Output
WARNING: ThreadSanitizer: data race (pid=543759)

Server thread:
Read of size 8 at 0x7febb28f3000 by thread T4 (mutexes: write M0, write M1, write M2):
#0 strncmp /home/piotrchmiel/llvm-project/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors.inc:487:3 (test+0x9d66d)
#1 smr_name_compare /home/piotrchmiel/test/third_party/libfabric/prov/shm/src/smr_util.c:351:9 (libfabric.so.1+0xe5a33)

  • during cq_read

Client thread:
Previous write of size 8 at 0x7febb28f3000 by main thread:
#0 memcpy /home/piotrchmiel/llvm-project/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:115:5 (test+0x8e9de)
#1 smr_send_name /home/piotrchmiel/test/third_party/libfabric/prov/shm/src/smr_ep.c:206:2 (libfabric.so.1+0xdcfcd)

  • during fi_send

Environment:

  • OS (if not Linux), provider, endpoint type, etc.
  • Operating System: Ubuntu 22.04
  • Compiler: clang-19 with ThreadSanitizer
  • libfabric Version: 1.22.0
  • Provider: shm
  • Endpoint Type: FI_EP_RDM

Additional context
The data race occurs specifically on memory access in smr_name_compare (during fi_cq_read on the server side) and in smr_send_name (during fi_send on the client side)

@piotrchmiel piotrchmiel added the bug label Nov 8, 2024
@aingerson aingerson self-assigned this Nov 8, 2024
@aingerson
Copy link
Contributor

@piotrchmiel Thanks for reporting! Any chance you have an existing reproducer you could send so I don't have to try to implement it?

@piotrchmiel
Copy link
Contributor Author

@aingerson I prepared reproducer:
https://github.com/piotrchmiel/shm_fabric_reproducer
Steps to reproduce the problem:

  1. ./build.sh
  2. FI_LOG_LEVEL=Debug ./build/reproducer (issue reproduces more often with logs)

I'm using clang19.1.2 (https://github.com/llvm/llvm-project/tree/llvmorg-19.1.2)

Thread sanitizer logs from reproducer:

WARNING: ThreadSanitizer: data race (pid=607864)
  Read of size 8 at 0x7f16ccd75000 by thread T1 (mutexes: write M43, write M30, write M28, write M41):
    #0 strncmp ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:483 (libtsan.so.0+0x67576)
    #1 strncmp ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:479 (libtsan.so.0+0x67576)
    #2 smr_name_compare prov/shm/src/smr_util.c:351 (libfabric.so.1+0x155b6d)
    #3 operator() /home/piotrc/reproducer/main.cpp:176 (reproducer+0xa1e7)
    #4 __invoke_impl<void, main()::<lambda()> > /usr/include/c++/11/bits/invoke.h:61 (reproducer+0xc0d8)
    #5 __invoke<main()::<lambda()> > /usr/include/c++/11/bits/invoke.h:96 (reproducer+0xbfc6)
    #6 _M_invoke<0> /usr/include/c++/11/bits/std_thread.h:259 (reproducer+0xbe88)
    #7 operator() /usr/include/c++/11/bits/std_thread.h:266 (reproducer+0xbdd4)
    #8 _M_run /usr/include/c++/11/bits/std_thread.h:211 (reproducer+0xbd40)
    #9 <null> <null> (libstdc++.so.6+0xdc252)

  Previous write of size 8 at 0x7f16ccd75000 by thread T2 (mutexes: write M14):
    #0 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827 (libtsan.so.0+0x6243e)
    #1 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:819 (libtsan.so.0+0x6243e)
    #2 smr_send_name prov/shm/src/smr_ep.c:206 (libfabric.so.1+0x14d58e)
    #3 operator() /home/piotrc/reproducer/main.cpp:250 (reproducer+0xaceb)
    #4 __invoke_impl<void, main()::<lambda()> > /usr/include/c++/11/bits/invoke.h:61 (reproducer+0xc04f)
    #5 __invoke<main()::<lambda()> > /usr/include/c++/11/bits/invoke.h:96 (reproducer+0xbf27)
    #6 _M_invoke<0> /usr/include/c++/11/bits/std_thread.h:259 (reproducer+0xbe2e)
    #7 operator() /usr/include/c++/11/bits/std_thread.h:266 (reproducer+0xbd8a)
    #8 _M_run /usr/include/c++/11/bits/std_thread.h:211 (reproducer+0xbcee)
    #9 <null> <null> (libstdc++.so.6+0xdc252)

Reproducer shows also another issue that appears more often:

WARNING: ThreadSanitizer: data race (pid=607864)
  Read of size 8 at 0x7f16ccd75018 by thread T1 (mutexes: write M43, write M30, write M28):
    #0 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827 (libtsan.so.0+0x6243e)
    #1 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:819 (libtsan.so.0+0x6243e)
    #2 ofi_memcpy include/ofi_hmem.h:263 (libfabric.so.1+0x2dc82)
    #3 operator() /home/piotrc/reproducer/main.cpp:176 (reproducer+0xa1e7)
    #4 __invoke_impl<void, main()::<lambda()> > /usr/include/c++/11/bits/invoke.h:61 (reproducer+0xc0d8)
    #5 __invoke<main()::<lambda()> > /usr/include/c++/11/bits/invoke.h:96 (reproducer+0xbfc6)
    #6 _M_invoke<0> /usr/include/c++/11/bits/std_thread.h:259 (reproducer+0xbe88)
    #7 operator() /usr/include/c++/11/bits/std_thread.h:266 (reproducer+0xbdd4)
    #8 _M_run /usr/include/c++/11/bits/std_thread.h:211 (reproducer+0xbd40)
    #9 <null> <null> (libstdc++.so.6+0xdc252)

  Previous write of size 8 at 0x7f16ccd75018 by thread T2 (mutexes: write M14, write M13083449636225080):
    #0 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827 (libtsan.so.0+0x6243e)
    #1 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:819 (libtsan.so.0+0x6243e)
    #2 ofi_memcpy include/ofi_hmem.h:263 (libfabric.so.1+0x2dc82)
    #3 operator() /home/piotrc/reproducer/main.cpp:250 (reproducer+0xaceb)
    #4 __invoke_impl<void, main()::<lambda()> > /usr/include/c++/11/bits/invoke.h:61 (reproducer+0xc04f)
    #5 __invoke<main()::<lambda()> > /usr/include/c++/11/bits/invoke.h:96 (reproducer+0xbf27)
    #6 _M_invoke<0> /usr/include/c++/11/bits/std_thread.h:259 (reproducer+0xbe2e)
    #7 operator() /usr/include/c++/11/bits/std_thread.h:266 (reproducer+0xbd8a)
    #8 _M_run /usr/include/c++/11/bits/std_thread.h:211 (reproducer+0xbcee)
    #9 <null> <null> (libstdc++.so.6+0xdc252)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants