-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uncommon/alarming increased latency for GoodJob query on production #813
Comments
Yikes! That's not good! I have a couple of ideas for what could be causing the spike; it sounds a little similar to #809. When I see slow job-locking queries like that, it's usually when there is a very deep queue. Are you able to run a report like this to count cumulative executions: SELECT date_trunc('hour', created_at), count(*)
FROM good_jobs
WHERE created_at > NOW() - '7 days'::interval
GROUP BY date_trunc('hour', created_at) I wonder if for some reason you have 100k+ executions (e.g. retries on the same job or something) leading to that performance degradation. Also, would you be able to pull tracing data for the Slow Dashboard problem? I'm curious if there is a specific query going very slowly, or whether it's the Rails Application blowing up trying to generate models or something for too much data. |
Seems that there are no retries listed in GoodJobs admin. However we have dealt another incident again within the weekend and we managed resolved the issue by cleaning all performed jobs older than 7 days. We were using the default (14 days preservation) This is a major concern to our project, as the current data load is about 5% of what we expect to be within the end of the year.. One possible solution could be the indexing of the good_jobs table to be improved, we ll try investing on this approach but please let us know if we are missing something or we could try a different approach or configuration improvement. The following is the results of your question
as run today. Do note that we had the same incident started on Saturday and ended just a few minutes ago by clearing the old job records.
|
@remoteweb hmmm. It looks like there are some spikes in executions, I dunno if they line up with the resource usage that you're seeing: You might also try the Queue Select Limit feature of GoodJob: https://github.com/bensheldon/good_job#queue-performance-with-queue-select-limit I'd be curious if you find additional indexing opportunities. This was the most recent index added: #726 |
Hey guys,
thanks for the work done in good job.
Description
We had an issue in our production kubernetes RoR stack, that was caught by our APM Datadog monitor related to the following query that occured for almost 24 hours.
Problem
Query usually has around 30ms latency while in the specific timeframe this was increased to 1.7sec AVG.
Facts
Maybe a Related Issue
We also have huge page loading times on /good_job/jobs Admin page (~4min)
Possible cause
It could be that we need to add some queue limits, but we don't want to blindly use those in a trial and error situation we need to fully understand what could be causing this.
Trigger
Every day at that time there is scheduled jobs load of a similar amount
Incident
Started at 15:08, Jan 21th
On Jan 21th and for 24h timespan, 1.4M of this select with p99 : 43ms
p99 Latency
Datadog Resources Usage for Postgres
GCloud CPU Usage
On a normal day, Jan 19th and for 24h timespan, 2.4M of this select with p99 : 43ms
p99 Latency
Datadog Resources Usage for Postgres
GCloud CPU Usage
Any advice/help/information would be much appreciated.
App Configuration
Stack Configuration
The text was updated successfully, but these errors were encountered: