-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Really fast jobs run one at a time #871
Comments
Btw I'm not sure it has to do with the sheer volume of jobs, i had a colleague queue up like 30k very fast jobs and the same thing happened. |
we have our query limit set to 1000 btw |
Hmmm. Can you explain a bit more about the jobs? Are they doing I/O (Database, Network, File operations) or are they strictly Ruby (rendering templates, doing math)? I don't think ~1 second is particularly short for a job, so I'd still expect to see some parallelism (assuming there is IO). Also, how were you queuing the jobs? FYI, GoodJob does not have a mechanism for additive increase of threads (e.g. when 1 thread finishes a job, it will only next lead to 1 thread working 1 job, not 2 threads working 2 jobs, etc.). With LISTEN/NOTIFY and the job cache, the threads should be woken up directly. If LISTEN/NOTIFY or the cache is disabled, then the poll-interval would be the mechanism for additive increase of threads (e.g. every N seconds an additional thread will be created to execution jobs) |
Sure let me add more context, the job in question is just a couple of db reads and db writes, very simple < 1 second each, I don't know how fast exactly but probably less than half a second even, basically instant. I used batch.enqueue but same thing happens with any method of enqueueing. I'm a bit confused by your response, the way I would have intuitively thought this works is that all the available threads are fed a job as soon as they're free. So regardless of what any one job is doing, i should roughly have 10*25 = 250 jobs running at the same time. Are you saying that's not how it works? That a new thread is only populated after the first thread gets overwhelmed, then the second, then the third etc. I don't think there's a problem with LISTEN/NOTIFY or the job cache since i've run this with longer running jobs and had 250 threads running at the same time. |
Let me focus just on the explanation, though it sounds like it is not relevant if you're using LISTEN/NOTIFY and the cache. If LISTEN/NOTIFY or the cache is disabled... then there is only the Poller. The Poller will, every N seconds, wakes 1 thread from the pool of available threads. That 1 woken thread queries for 1 job. If the query returns a job, then the job is executed, and then the thread will stay awake and query for another job. If no job is returned from the query, the thread will be returned to the pool and be put asleep. I think the misconception maybe is that there is a query for X jobs, and then those X jobs are fed into X threads. It doesn't do that. When LISTEN/NOTIFY is enabled..., then GoodJob will wake up a thread for every notify until all threads are awake and executing jobs. And then those threads will be returned to the pool/asleep when they query for a job and there isn't one. |
Gotcha so yeah it should work like i'm expecting if LISTEN/NOTIFY is working properly, i'm going to have to devise a test to see if that's the case. |
To think some more about your case, I think it could be some combination of:
|
Sure the thread count could be changed and that's something we can tweak to get perfect, but i'd expect filling them to be fast regardless. Let's focus on the fetch-and-lock query, because we are seeing db performance spikes, is there any solution to that problem or is there just a builtin fixed cost to this that we can't overcome. Like i said we observed similar behavior with just 30k really fast jobs enqueued. |
You might be hitting the limits of using Postgres as a jobs backend, as well as the limits of GoodJob's current Advisory Lock strategy. A couple of related links:
|
I'm still holding out hope that we've just configured something horribly wrong, to get concrete here with these fast running jobs lets say 100k queued up (btw queuing up this many in bulk also takes like 3 mins), i'm getting like 25 jobs/s finishing when i just poll GoodJob::Execution.finished.count. Does that sound off to you? |
Even more details: 10 processes, 25 max threads each, and i think 10 puma threads per process configured in rails. Then also pool count is set to 35. And finally maybe doesn't matter, but i call |
I think it's too many threads, but you should also increase your db pool size. You'll need ~38 threads for all of GoodJob (there's ~2-4 additional threads that can be created that touch the database), but I'd set 50 just to be safe that you're not doing anything else in the background (e.g. Otherwise that all seems fairly normal to me. For batch enqueue, I think you should probably batch the inserts 1k or 10k at a time (I'd love to know if you find a more efficient batch size) |
i just got a huge throughput boost by enqueuing 1000 at a time for the 100k size batch.... something like 4 to 6x which doesn't make a whole lot of sense to me even knowing the fetch is slower with more upfront queued jobs, this is a really big deal considering massive dumps of queue jobs happen infrequently. this also implies that just adding some middleware to queue up jobs when jobs.count < threshold would basically give a free massive perf boost without worrying about how someone is queuing up their jobs but also feels like this means something is fundamentally wrong, if it's true that you cache 10k jobs at any given time, wouldn't the cost of fetch & lock not matter as much until the 10k is depleted, i need to try to understand that part of the codebase better slowly getting through it all. |
I wanted to respond quickly on 1 item:
The "cache" is not caching jobs or dequeueing multiple jobs or "read ahead". It's caching only the I wonder if what you're seeing is either:
|
Hey @v2kovac - No idea if it's in any way related to your issues, but uuidv4 primary keys are bad for performance after a certain point. I wonder if using some sort of sequential uuid as the primary key would help... |
@bensheldon i can confirm i do get a bump in both throughput and parallel running jobs right as a bulk enqueue finishes. It does make me wonder about the listen/notify as well. I'm doing a workaround at this point with the large wait times for enqueueing that could be interrupted by deploys, where i queue up a callback for the batch, pass all the arguments into the batch object itself and then queue up the jobs 1000 at a time in the callback, and then update on_finish to whatever i actually want the on_finish callback to be. (My batch service assumes all jobs are the same type). It's not ideal, but it works.
@reczy I don't think so but good to know. |
^ i haven't tested this thoroughly with really really big batches, so might not be performant because of the large memory usage especially if you're using the batch within each job. Might be better to just enqueue a job with all those arguments stuffed in, ie have one job with a large number of arguments intstead of the batch record carrying that burden. |
This has been working fine for jobs where they run kinda slow, the threads all get populated with a job (10 processes, 25 threads each).
But I queued up 700k jobs that run for less than second each, and I noticed there was only ever one job running at a time. Total throughput was like 20 jobs a second slower than serially running them.
Do you know what would cause this? The db load was pretty high, I don't know if that's because of all the jobs enqueued? Do I need to have a shorter time to refetch jobs like less than 5 seconds?
I thought that the cache gets like 10k jobs so you should always have enough ready to go populate all the threads.
The text was updated successfully, but these errors were encountered: