-
Notifications
You must be signed in to change notification settings - Fork 30
Deadlocks when fetching messages from queue #160
Comments
@tadasauciunas An deadlock exception can always occur in a relational database system I am afraid. Having said that: we could think about solving this in the library. As this query polls for a new queue entry. And if that fails, we could simply retry. However I should say, we never have this issue in our production stack. @tadasauciunas What indexes did you place on this table? |
The problem is far worse than this. We've expanded our background queue system and suddenly a lot more of these DeadlockExceptions are cropping up. Btw we're not using SlmQueue anymore but a heavily modified variant. A worker popping a job can handle this exception fairly easily: Just reset the entity manager (-> new entity manager reference! You have to update all injected dependencies) or just restart the worker. Example scenario A: Worker A with Job A is in execution (STATUS_RUNNING) and is now done. Worker B tries to get a new Job and locks the table. The worker A now wants to write back the changes of Job A (now STATUS_BURIED ?) and gets the DeadlockException. Job A is now forever in STATUS_RUNNING. Scenario B: The web frontend wants to add a new Job, but Worker C tries to get a new job. C has the table locked and the frontend cant insert the new job! This job is now missing forever in the queue and the entity manager of the web frontend is now closed -> nothing else can be done in this request. The root problem is that MariaDB / MySQL locks ALL rows that are scanned for a "SELECT ... FOR UPDATE" statement and not just the single row we care about. See https://stackoverflow.com/questions/6066205/when-using-mysqls-for-update-locking-what-is-exactly-locked for more info. I dont know if PostgreSQL has the same problem... Proposed change (in pseudo code):
The critical path here is the first UPDATE. Since no locking and no transaction is used this UPDATE shouldnt block any other queries. AFAIK it is still atomic so we only get one worker for a single job even if the UPDATE-queries are fired simultaneously. |
@MatthiasKuehneEllerhold At first I told you whe did not suffer from this problem. But indeed the cases you now mention we do encounter. So also I gladly would like to see this fixed. Good find. I did not know it would lock the whole table. Sounds silly, but indeed that seems to be the case from what I read. The solution you propose sound legit, although you would think / hope there would be some other solution using SQL that would be more efficient .. PostgreSQL: Documentation: 9.0: SELECT
I also do not use PostgreSQL, but this seems to indicate Postres does lock correctly. But I'm in favor for a PR that simply replaces the mechanism we know have with the one you propose. Would you like to take a stab at this @MatthiasKuehneEllerhold?
@MatthiasKuehneEllerhold Care to tell what you add to the system 🥳? Maybe some of that juice that flow back to the main repo? Would be willing to help in that effort! |
I do not fully understand the cause of this issue in-depth. As a non expert I would expect a db to handle this for me. But this is what I do know. We are currently using Postgres with a modest of amount of runners and jobs and never have seen these exceptions. We switched from As for a solution; what I have been reading seems to suggest these errors are to be expected (again, as a non db expert that surprises me) and the remedy would be to simply catch and retry then for a few times... @roelvanduijnhoven also has suggested something like that above. |
@basz Yes. You can never 100% prevent deadlocks. But.. MySQL locking a whole table is kinda troublesome and makes this far more likely to occur! We have been seeying these errors using MySQL as well, more than we used to. So I guess it indeed depends on the size of the queueing system.
Catching and retrying works in the case where the query that atomicly fetches a new row fails. But not in the other cases that @MatthiasKuehneEllerhold described. Those would all need to be handled in user-land code, and that sucks. The mechanism @MatthiasKuehneEllerhold describes will resolve the full table lock. But will decrease performance. The question is: do we want to keep a separate mechanism for Postgres? Maybe we should try to keep it? |
If Postgres handles this correctly the adapter could switch between both implementations. Use the old "FOR UPDATE" statement for postgres and use another implementation for a more generic MySQL flavor like MariaDB. I can provide a PR if you want, but I'm unable to test it because as I said we dont use SlmQueue anymore. Our system added a lot of features that (we thought) were missing from SlmQueue. We needed locks (e. g. 2 different jobs of different queues may not run together if they're accessing the same entity), more granular canExecute() checks and doExecute() returns, direct FKs to business-logic entities for easier access, a messaging-system to log more information per job (e. g. "did x, failed at y, restarted at 09:00, succeeded with y", per-status timeouts (e. g. reschedule a job if its in "executing" for more than 10 minutes), ... |
Added a WIP pull request. |
Once per week a cronjob creates puts approx. 2000 messages into the queue during one transaction in our system. We have 4 workers processing messages from this doctrine queue. Unfortunately, several workers usually end up in deadlocks when fetching these messags from the queue. I tried looking or solutions but did not manage to come up with anything, also did not find anything similar in the already existing issues, so I'll leave the exception information below.
The text was updated successfully, but these errors were encountered: