You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, The current implementation which (worker) - outof (total workers) picks jobs by partitioning the id space. If for whatever reason one of the worker is stuck processing a job, all the current and future jobs assigned to this worker will stall. Is there any plan (or ideas) to make this more fault tolerant? i.e. a worker can pick any job that is scheduled to run now (irrespective of the id of the job)? This will also make adding and removing worker much easier.
The text was updated successfully, but these errors were encountered:
This is indeed a weakness of the current implementation. We do have some solutions in mind, but it will likely complicate the infrastructure needed to execute the queues.
There's a few things we do about this that helps. Keep jobs small -- if there is a lot of processing then use a group to manage the jobs. Progress can be tracked across the group. Jobs that are failing should just throw an exception and let the worker retry it later, i.e. don't have the job itself do any retries.
If you are using multiple workers it's really important that you think through the implications of different isolation levels -- I'd strongly recommend using SERIALIZABLE isolation. If your system won't run cleanly with that turned on then it may imply that you're actually getting some data corruption at lower isolation levels.
What we're looking at in the longer term is to have a process that makes use of the Posgres LISTEN/NOTIFY system to see new jobs and changes to jobs and then use that process to launch individual jobs, or batches of jobs. We run many microservices so this would allow us to reduce latency, increase parallel execution of jobs and do it with less workers overall.
For other projects we've been developing a tool that would allow this, the wright-exec-helper. It multiplexes jobs using a fairly simple mechanism through printing and reading to/from stdout/stdin. The downside to it, from this project's perspective at least, is that it is native code.
The protocol is pretty simple though and it should be possible to implement something that performs the same function (albeit a bit more slowly) in Python.
Hi, The current implementation which (worker) - outof (total workers) picks jobs by partitioning the id space. If for whatever reason one of the worker is stuck processing a job, all the current and future jobs assigned to this worker will stall. Is there any plan (or ideas) to make this more fault tolerant? i.e. a worker can pick any job that is scheduled to run now (irrespective of the id of the job)? This will also make adding and removing worker much easier.
The text was updated successfully, but these errors were encountered: