Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] KeyDB deadlock #883

Open
swdev128 opened this issue Nov 29, 2024 · 1 comment
Open

[BUG] KeyDB deadlock #883

swdev128 opened this issue Nov 29, 2024 · 1 comment

Comments

@swdev128
Copy link

Describe the bug

I'm running two instances of KeyDB (replication). Each of them tends to occasionally move into a total deadlock condition. Neither the application I'm developing nor keydb-cli binary can connect to keydb-server.

Gdb attached to gdb-server shows all threads are awaiting each other on futexes in readWriteLock and a mutex in AsyncWorkQueue::m_mutex.

My observations from gdb investigation:
-- bgsaveCommand attempting to acquire global WRITE lock with aeAcquireForkLock (with g_forkLock::m_readCount tends to be around 1-3, preventing new global READ locks)
-- AsyncWorkerQueue::WorkerThreadMain (1) stuck on trying to acquire global READ lock with aeProcessOnline while owning lock on AsyncWorkQueue::m_mutex
-- AsyncWorkerQueue::WorkerThreadMain (2) owning global READ lock after calling aeProcessOnline, and stuck attempting to lock AsyncWorkQueue::m_mutex (locked by (1))

To reproduce

Run two KeyDB instances in replication mode. Binary comes from a compilation of "RELEASE_6_3_4" branch on Github

Expected behavior

No deadlocks while running under low-moderate load.

Additional information

After the deadlock CPU usage reported by 'top' is 0% and CPU time of the process does not change.
While working, transaction load is constant (about 500tps) with about 300 keys in DB, DB size is about 2MB.
KeyDB is running within Docker image (managed by Kubernetes) with up to 4GB of RAM and 3 CPUs. 'top' command shows the resources are more than sufficient.
I tried enabling/disabling/re-configuring features trying to nail down the root cause and the scenario where it shows up most frequently, but not much luck.
I tried turning off/on background save, switching to AOF, and also tweaking server settings:
repl-ping-replica-period, repl-backlog-size, repl-timeout, server-threads, min-clients-per-thread, active-client-balancing, timeout.
Unfortunately none of these changes resulted in fixing the issue.

Kindly please advise possible root cause, workaround or the best code solution.

@keithchew
Copy link

From your gdb observations above, the 3 threads you have shown seem to be behaving correctly. Note that there can be multiple readers, so they are not in a deadlock condition. You need to dig a bit deeper into other thread ids in gdb, perhaps you have missed a deadlock condition caused by other threads.

On the side note, do you have modules enabled? I found a deadlock condition between g_forkLock and s_moduleGIL here:
#766

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants