Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Circular lock dependancy between js_maybeFetchMore, natsConn_processMsg, and _subscribeMulti #823

Open
joel22b opened this issue Nov 27, 2024 · 0 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@joel22b
Copy link

joel22b commented Nov 27, 2024

Observed behavior

The program hangs because of a deadlock. Seems to be due to locking order.

js_maybeFetchMore lock order:

  1. Executes nats_lockSubAndDispatcher(sub); to lock the subscription
  2. It then calls _sendPullRequest which calls natsConnection_PublishRequest which then tries to lock the connection's mutex with natsConn_Lock(nc); which is held by _subscribeMulti

natsConn_processMsg lock order:

  1. Executes natsMutex_Lock(nc->subsMu); to lock to connection's sub mutex: subMu
  2. It then tries to lock the subscription using nats_lockRetainSubAndDispatcher(sub); which is held by js_maybeFetchMore

_subscribeMulti lock order:

  1. Calls natsConn_subscribeImpl with lock == true so it locks the connection's mutex with natsConn_Lock(nc);
  2. Tries to lock the connection's sub mutex with natsMutex_Lock(nc->subsMu); which is held by natsConn_processMsg

Expected behavior

That it wouldn't deadlock in this scenario.

Server and client version

Nats-sever: 2.10.20
Nats.c: 3.9.1
Nats: 0.1.2 (Not relevant for this)

Host environment

OS: Rocky Linux 8.4
Arch: amd64
GCC: 13.2.0

Linked libraries:
linux-vdso.so.1 (0x00007ffd9effe000)
libcurl.so.4 => /lib64/libcurl.so.4 (0x00007f42cf147000)
libz.so.1 => /lib64/libz.so.1 (0x00007f42cef2f000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f42ced2b000)
libprotobuf-c.so.1 => /lib64/libprotobuf-c.so.1 (0x00007f42ceb22000)
libcrypto.so.1.1 => /lib64/libcrypto.so.1.1 (0x00007f42ce637000)
libssl.so.1.1 => /lib64/libssl.so.1.1 (0x00007f42ce3a3000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f42ce183000)
libstdc++.so.6 => /work/shared/devtools/gccbin/gcc_13_2_0/lib64/libstdc++.so.6 (0x00007f42cdd21000)
libm.so.6 => /lib64/libm.so.6 (0x00007f42cd99f000)
libgcc_s.so.1 => /work/shared/devtools/gccbin/gcc_13_2_0/lib64/libgcc_s.so.1 (0x00007f42cd77b000)
libc.so.6 => /lib64/libc.so.6 (0x00007f42cd3b6000)
/lib64/ld-linux-x86-64.so.2 (0x00007f42cf3cc000)
libnghttp2.so.14 => /lib64/libnghttp2.so.14 (0x00007f42cd18f000)
libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00007f42ccf3a000)
libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00007f42ccc4f000)
libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00007f42cca38000)
libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00007f42cc834000)
libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00007f42cc623000)
libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00007f42cc41f000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f42cc207000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f42cbfdc000)
libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f42cbd58000)

Steps to reproduce

This hard to reproduce, general steps:

  1. Have nats-server running is JetStream mode and load in thousands of messages to 2 subjects
  2. In separate client, create a PullSubscriberAsync to the first subject with a NextHandler defined that requests batches of 1000 messages. This subscriber immediately started requesting and receiving messages due to the subject already having data.
  3. Create a second PullSubscriberAsync to the second subject with a NextHandler defined that requests batches of 1000 as well. This call will hang as stated in "Observed behaviour" for _subscribeMulti due to a deadlock with the previous subscription processing messages and fetching more as it tries to create the second subscriber.

My current fix is to add a wait of about 100 milliseconds before doing the second subscriber to let the first subscriber sort itself out.

@joel22b joel22b added the defect Suspected defect such as a bug or regression label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

1 participant