[BUG] KeyDB deadlock #883

swdev128 · 2024-11-29T15:08:37Z

Describe the bug

I'm running two instances of KeyDB (replication). Each of them tends to occasionally move into a total deadlock condition. Neither the application I'm developing nor keydb-cli binary can connect to keydb-server.

Gdb attached to gdb-server shows all threads are awaiting each other on futexes in readWriteLock and a mutex in AsyncWorkQueue::m_mutex.

My observations from gdb investigation:
-- bgsaveCommand attempting to acquire global WRITE lock with aeAcquireForkLock (with g_forkLock::m_readCount tends to be around 1-3, preventing new global READ locks)
-- AsyncWorkerQueue::WorkerThreadMain (1) stuck on trying to acquire global READ lock with aeProcessOnline while owning lock on AsyncWorkQueue::m_mutex
-- AsyncWorkerQueue::WorkerThreadMain (2) owning global READ lock after calling aeProcessOnline, and stuck attempting to lock AsyncWorkQueue::m_mutex (locked by (1))

To reproduce

Run two KeyDB instances in replication mode. Binary comes from a compilation of "RELEASE_6_3_4" branch on Github

Expected behavior

No deadlocks while running under low-moderate load.

Additional information

After the deadlock CPU usage reported by 'top' is 0% and CPU time of the process does not change.
While working, transaction load is constant (about 500tps) with about 300 keys in DB, DB size is about 2MB.
KeyDB is running within Docker image (managed by Kubernetes) with up to 4GB of RAM and 3 CPUs. 'top' command shows the resources are more than sufficient.
I tried enabling/disabling/re-configuring features trying to nail down the root cause and the scenario where it shows up most frequently, but not much luck.
I tried turning off/on background save, switching to AOF, and also tweaking server settings:
repl-ping-replica-period, repl-backlog-size, repl-timeout, server-threads, min-clients-per-thread, active-client-balancing, timeout.
Unfortunately none of these changes resulted in fixing the issue.

Kindly please advise possible root cause, workaround or the best code solution.

keithchew · 2024-12-03T19:56:26Z

From your gdb observations above, the 3 threads you have shown seem to be behaving correctly. Note that there can be multiple readers, so they are not in a deadlock condition. You need to dig a bit deeper into other thread ids in gdb, perhaps you have missed a deadlock condition caused by other threads.

On the side note, do you have modules enabled? I found a deadlock condition between g_forkLock and s_moduleGIL here:
#766

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] KeyDB deadlock #883

[BUG] KeyDB deadlock #883

swdev128 commented Nov 29, 2024

keithchew commented Dec 3, 2024

[BUG] KeyDB deadlock #883

[BUG] KeyDB deadlock #883

Comments

swdev128 commented Nov 29, 2024

keithchew commented Dec 3, 2024