You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running two instances of KeyDB (replication). Each of them tends to occasionally move into a total deadlock condition. Neither the application I'm developing nor keydb-cli binary can connect to keydb-server.
Gdb attached to gdb-server shows all threads are awaiting each other on futexes in readWriteLock and a mutex in AsyncWorkQueue::m_mutex.
My observations from gdb investigation:
-- bgsaveCommand attempting to acquire global WRITE lock with aeAcquireForkLock (with g_forkLock::m_readCount tends to be around 1-3, preventing new global READ locks)
-- AsyncWorkerQueue::WorkerThreadMain (1) stuck on trying to acquire global READ lock with aeProcessOnline while owning lock on AsyncWorkQueue::m_mutex
-- AsyncWorkerQueue::WorkerThreadMain (2) owning global READ lock after calling aeProcessOnline, and stuck attempting to lock AsyncWorkQueue::m_mutex (locked by (1))
To reproduce
Run two KeyDB instances in replication mode. Binary comes from a compilation of "RELEASE_6_3_4" branch on Github
Expected behavior
No deadlocks while running under low-moderate load.
Additional information
After the deadlock CPU usage reported by 'top' is 0% and CPU time of the process does not change.
While working, transaction load is constant (about 500tps) with about 300 keys in DB, DB size is about 2MB.
KeyDB is running within Docker image (managed by Kubernetes) with up to 4GB of RAM and 3 CPUs. 'top' command shows the resources are more than sufficient.
I tried enabling/disabling/re-configuring features trying to nail down the root cause and the scenario where it shows up most frequently, but not much luck.
I tried turning off/on background save, switching to AOF, and also tweaking server settings:
repl-ping-replica-period, repl-backlog-size, repl-timeout, server-threads, min-clients-per-thread, active-client-balancing, timeout.
Unfortunately none of these changes resulted in fixing the issue.
Kindly please advise possible root cause, workaround or the best code solution.
The text was updated successfully, but these errors were encountered:
From your gdb observations above, the 3 threads you have shown seem to be behaving correctly. Note that there can be multiple readers, so they are not in a deadlock condition. You need to dig a bit deeper into other thread ids in gdb, perhaps you have missed a deadlock condition caused by other threads.
On the side note, do you have modules enabled? I found a deadlock condition between g_forkLock and s_moduleGIL here: #766
Describe the bug
I'm running two instances of KeyDB (replication). Each of them tends to occasionally move into a total deadlock condition. Neither the application I'm developing nor keydb-cli binary can connect to keydb-server.
Gdb attached to gdb-server shows all threads are awaiting each other on futexes in readWriteLock and a mutex in AsyncWorkQueue::m_mutex.
My observations from gdb investigation:
-- bgsaveCommand attempting to acquire global WRITE lock with aeAcquireForkLock (with g_forkLock::m_readCount tends to be around 1-3, preventing new global READ locks)
-- AsyncWorkerQueue::WorkerThreadMain (1) stuck on trying to acquire global READ lock with aeProcessOnline while owning lock on AsyncWorkQueue::m_mutex
-- AsyncWorkerQueue::WorkerThreadMain (2) owning global READ lock after calling aeProcessOnline, and stuck attempting to lock AsyncWorkQueue::m_mutex (locked by (1))
To reproduce
Run two KeyDB instances in replication mode. Binary comes from a compilation of "RELEASE_6_3_4" branch on Github
Expected behavior
No deadlocks while running under low-moderate load.
Additional information
After the deadlock CPU usage reported by 'top' is 0% and CPU time of the process does not change.
While working, transaction load is constant (about 500tps) with about 300 keys in DB, DB size is about 2MB.
KeyDB is running within Docker image (managed by Kubernetes) with up to 4GB of RAM and 3 CPUs. 'top' command shows the resources are more than sufficient.
I tried enabling/disabling/re-configuring features trying to nail down the root cause and the scenario where it shows up most frequently, but not much luck.
I tried turning off/on background save, switching to AOF, and also tweaking server settings:
repl-ping-replica-period, repl-backlog-size, repl-timeout, server-threads, min-clients-per-thread, active-client-balancing, timeout.
Unfortunately none of these changes resulted in fixing the issue.
Kindly please advise possible root cause, workaround or the best code solution.
The text was updated successfully, but these errors were encountered: