Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncommon/alarming increased latency for GoodJob query on production #813

Open
remoteweb opened this issue Jan 27, 2023 · 4 comments
Open

Comments

@remoteweb
Copy link

Hey guys,

thanks for the work done in good job.

Description

We had an issue in our production kubernetes RoR stack, that was caught by our APM Datadog monitor related to the following query that occured for almost 24 hours.

SELECT good_jobs . * 
FROM good_jobs 
WHERE good_jobs . id IN ( WITH rows 
SELECT good_jobs . id, good_jobs . active_job_id 
FROM good_jobs 
WHERE good_jobs . finished_at IS ? AND ( good_jobs . scheduled_at <= ? OR good_jobs . scheduled_at IS ? ) 
ORDER BY priority DESC NULLS LAST, created_at ASC ) 
SELECT rows . id 
FROM rows 
WHERE pg_try_advisory_lock ( ( ? | | substr ( md5 ( ? | | ? | | rows . active_job_id :: text ), ? ) ) :: bit ( ? ) :: bigint ) 
LIMIT ? 
) 
ORDER BY priority DESC NULLS LAST, created_at ASC

Problem

Query usually has around 30ms latency while in the specific timeframe this was increased to 1.7sec AVG.

Facts

  • The workload of the good job queue was normal
  • Resources of the postgres were having only a small increase when the latency remained at increased levels.
  • Total Processed jobs is around 4.5M

Maybe a Related Issue

We also have huge page loading times on /good_job/jobs Admin page (~4min)

Possible cause

It could be that we need to add some queue limits, but we don't want to blindly use those in a trial and error situation we need to fully understand what could be causing this.

Trigger

Every day at that time there is scheduled jobs load of a similar amount

Incident

Started at 15:08, Jan 21th

On Jan 21th and for 24h timespan, 1.4M of this select with p99 : 43ms

p99 Latency

image

Datadog Resources Usage for Postgres

image

GCloud CPU Usage

image

On a normal day, Jan 19th and for 24h timespan, 2.4M of this select with p99 : 43ms

p99 Latency

image

Datadog Resources Usage for Postgres

image

GCloud CPU Usage

image

Any advice/help/information would be much appreciated.

App Configuration

application.config.active_job.queue_adapter = :good_job
application.config.good_job.preserve_job_records = true
application.config.good_job.retry_on_unhandled_error = false
application.config.good_job.enable_cron =true
application.config.good_job.max_threads = 10

Stack Configuration

GJ processes as decoupled from Web Server ones
5 pods for Rails Puma
5 pods of GoodJob Workers  => 5 Good Job processes
4 vcores each pod
10 threads limit
20 connections database.yml pool
4G ram each pod

GCloud Postgres Flags
cloudsql.enable_pglogical on
cloudsql.logical_decoding on
max_replication_slots 22
max_wal_senders 22
max_worker_processes 16
@bensheldon
Copy link
Owner

Yikes! That's not good! I have a couple of ideas for what could be causing the spike; it sounds a little similar to #809.

When I see slow job-locking queries like that, it's usually when there is a very deep queue. Are you able to run a report like this to count cumulative executions:

SELECT date_trunc('hour', created_at), count(*)
FROM good_jobs
WHERE created_at > NOW() - '7 days'::interval
GROUP BY date_trunc('hour', created_at)

I wonder if for some reason you have 100k+ executions (e.g. retries on the same job or something) leading to that performance degradation.

Also, would you be able to pull tracing data for the Slow Dashboard problem? I'm curious if there is a specific query going very slowly, or whether it's the Rails Application blowing up trying to generate models or something for too much data.

@remoteweb
Copy link
Author

remoteweb commented Jan 30, 2023

Seems that there are no retries listed in GoodJobs admin.

image

However we have dealt another incident again within the weekend and we managed resolved the issue by cleaning all performed jobs older than 7 days.

We were using the default (14 days preservation)

This is a major concern to our project, as the current data load is about 5% of what we expect to be within the end of the year..

One possible solution could be the indexing of the good_jobs table to be improved, we ll try investing on this approach but please let us know if we are missing something or we could try a different approach or configuration improvement.

The following is the results of your question

SELECT date_trunc('hour', created_at), count(*)
FROM good_jobs
WHERE created_at > NOW() - '7 days'::interval
GROUP BY date_trunc('hour', created_at)

as run today.

Do note that we had the same incident started on Saturday and ended just a few minutes ago by clearing the old job records.

[{"date_trunc"=>"2023-01-27T10:00:00.000Z", "count"=>13186},
 {"date_trunc"=>"2023-01-27T11:00:00.000Z", "count"=>13385},
 {"date_trunc"=>"2023-01-27T12:00:00.000Z", "count"=>13603},
 {"date_trunc"=>"2023-01-27T13:00:00.000Z", "count"=>94565},
 {"date_trunc"=>"2023-01-27T14:00:00.000Z", "count"=>18039},
 {"date_trunc"=>"2023-01-27T15:00:00.000Z", "count"=>17452},
 {"date_trunc"=>"2023-01-27T16:00:00.000Z", "count"=>14841},
 {"date_trunc"=>"2023-01-27T17:00:00.000Z", "count"=>13566},
 {"date_trunc"=>"2023-01-27T18:00:00.000Z", "count"=>11319},
 {"date_trunc"=>"2023-01-27T19:00:00.000Z", "count"=>10310},
 {"date_trunc"=>"2023-01-27T20:00:00.000Z", "count"=>10597},
 {"date_trunc"=>"2023-01-27T21:00:00.000Z", "count"=>9916},
 {"date_trunc"=>"2023-01-27T22:00:00.000Z", "count"=>8234},
 {"date_trunc"=>"2023-01-27T23:00:00.000Z", "count"=>8545},
 {"date_trunc"=>"2023-01-28T00:00:00.000Z", "count"=>7355},
 {"date_trunc"=>"2023-01-28T01:00:00.000Z", "count"=>8558},
 {"date_trunc"=>"2023-01-28T02:00:00.000Z", "count"=>6452},
 {"date_trunc"=>"2023-01-28T03:00:00.000Z", "count"=>9651},
 {"date_trunc"=>"2023-01-28T04:00:00.000Z", "count"=>6017},
 {"date_trunc"=>"2023-01-28T05:00:00.000Z", "count"=>5324},
 {"date_trunc"=>"2023-01-28T06:00:00.000Z", "count"=>5693},
 {"date_trunc"=>"2023-01-28T07:00:00.000Z", "count"=>5630},
 {"date_trunc"=>"2023-01-28T08:00:00.000Z", "count"=>8454},
 {"date_trunc"=>"2023-01-28T09:00:00.000Z", "count"=>6095},
 {"date_trunc"=>"2023-01-28T10:00:00.000Z", "count"=>6191},
 {"date_trunc"=>"2023-01-28T11:00:00.000Z", "count"=>6059},
 {"date_trunc"=>"2023-01-28T12:00:00.000Z", "count"=>6192},
 {"date_trunc"=>"2023-01-28T13:00:00.000Z", "count"=>39156},
 {"date_trunc"=>"2023-01-28T14:00:00.000Z", "count"=>19982},
 {"date_trunc"=>"2023-01-28T15:00:00.000Z", "count"=>4652},
 {"date_trunc"=>"2023-01-28T16:00:00.000Z", "count"=>4844},
 {"date_trunc"=>"2023-01-28T17:00:00.000Z", "count"=>4501},
 {"date_trunc"=>"2023-01-28T18:00:00.000Z", "count"=>9211},
 {"date_trunc"=>"2023-01-28T19:00:00.000Z", "count"=>10556},
 {"date_trunc"=>"2023-01-28T20:00:00.000Z", "count"=>5867},
 {"date_trunc"=>"2023-01-28T21:00:00.000Z", "count"=>5217},
 {"date_trunc"=>"2023-01-28T22:00:00.000Z", "count"=>7441},
 {"date_trunc"=>"2023-01-28T23:00:00.000Z", "count"=>6516},
 {"date_trunc"=>"2023-01-29T00:00:00.000Z", "count"=>6169},
 {"date_trunc"=>"2023-01-29T01:00:00.000Z", "count"=>8033},
 {"date_trunc"=>"2023-01-29T02:00:00.000Z", "count"=>9285},
 {"date_trunc"=>"2023-01-29T03:00:00.000Z", "count"=>5897},
 {"date_trunc"=>"2023-01-29T04:00:00.000Z", "count"=>12038},
 {"date_trunc"=>"2023-01-29T05:00:00.000Z", "count"=>7046},
 {"date_trunc"=>"2023-01-29T06:00:00.000Z", "count"=>9354},
 {"date_trunc"=>"2023-01-29T07:00:00.000Z", "count"=>11706},
 {"date_trunc"=>"2023-01-29T08:00:00.000Z", "count"=>15192},
 {"date_trunc"=>"2023-01-29T09:00:00.000Z", "count"=>10413},
 {"date_trunc"=>"2023-01-29T10:00:00.000Z", "count"=>11147},
 {"date_trunc"=>"2023-01-29T11:00:00.000Z", "count"=>10778},
 {"date_trunc"=>"2023-01-29T12:00:00.000Z", "count"=>10323},
 {"date_trunc"=>"2023-01-29T13:00:00.000Z", "count"=>43119},
 {"date_trunc"=>"2023-01-29T14:00:00.000Z", "count"=>6739},
 {"date_trunc"=>"2023-01-29T15:00:00.000Z", "count"=>6277},
 {"date_trunc"=>"2023-01-29T16:00:00.000Z", "count"=>10768},
 {"date_trunc"=>"2023-01-29T17:00:00.000Z", "count"=>14020},
 {"date_trunc"=>"2023-01-29T18:00:00.000Z", "count"=>9518},
 {"date_trunc"=>"2023-01-29T19:00:00.000Z", "count"=>7995},
 {"date_trunc"=>"2023-01-29T20:00:00.000Z", "count"=>10505},
 {"date_trunc"=>"2023-01-29T21:00:00.000Z", "count"=>11742},
 {"date_trunc"=>"2023-01-29T22:00:00.000Z", "count"=>10603},
 {"date_trunc"=>"2023-01-29T23:00:00.000Z", "count"=>10903},
 {"date_trunc"=>"2023-01-30T00:00:00.000Z", "count"=>11246},
 {"date_trunc"=>"2023-01-30T01:00:00.000Z", "count"=>11896},
 {"date_trunc"=>"2023-01-30T02:00:00.000Z", "count"=>12190},
 {"date_trunc"=>"2023-01-30T03:00:00.000Z", "count"=>12084},
 {"date_trunc"=>"2023-01-30T04:00:00.000Z", "count"=>12665},
 {"date_trunc"=>"2023-01-30T05:00:00.000Z", "count"=>11166},
 {"date_trunc"=>"2023-01-30T06:00:00.000Z", "count"=>12479},
 {"date_trunc"=>"2023-01-30T07:00:00.000Z", "count"=>11962},
 {"date_trunc"=>"2023-01-30T08:00:00.000Z", "count"=>13768},
 {"date_trunc"=>"2023-01-30T09:00:00.000Z", "count"=>12051},
 {"date_trunc"=>"2023-01-30T10:00:00.000Z", "count"=>1715}]

@bensheldon
Copy link
Owner

@remoteweb hmmm. It looks like there are some spikes in executions, I dunno if they line up with the resource usage that you're seeing:

Screenshot 2023-01-30 at 11 30 28 AM

You might also try the Queue Select Limit feature of GoodJob: https://github.com/bensheldon/good_job#queue-performance-with-queue-select-limit

I'd be curious if you find additional indexing opportunities. This was the most recent index added: #726

@remoteweb
Copy link
Author

remoteweb commented Jan 31, 2023

image

Attached above, is a 7 day SQL CPU resources chart in relation with the previous results

Thanks for referencing the PR with the indexes. I am thinking that we might also not get the most out of the indexes as we are using postgres 11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox
Development

No branches or pull requests

2 participants