Uncommon/alarming increased latency for GoodJob query on production #813

remoteweb · 2023-01-27T10:40:51Z

Hey guys,

thanks for the work done in good job.

Description

We had an issue in our production kubernetes RoR stack, that was caught by our APM Datadog monitor related to the following query that occured for almost 24 hours.

SELECT good_jobs . * 
FROM good_jobs 
WHERE good_jobs . id IN ( WITH rows 
SELECT good_jobs . id, good_jobs . active_job_id 
FROM good_jobs 
WHERE good_jobs . finished_at IS ? AND ( good_jobs . scheduled_at <= ? OR good_jobs . scheduled_at IS ? ) 
ORDER BY priority DESC NULLS LAST, created_at ASC ) 
SELECT rows . id 
FROM rows 
WHERE pg_try_advisory_lock ( ( ? | | substr ( md5 ( ? | | ? | | rows . active_job_id :: text ), ? ) ) :: bit ( ? ) :: bigint ) 
LIMIT ? 
) 
ORDER BY priority DESC NULLS LAST, created_at ASC

Problem

Query usually has around 30ms latency while in the specific timeframe this was increased to 1.7sec AVG.

Facts

The workload of the good job queue was normal
Resources of the postgres were having only a small increase when the latency remained at increased levels.
Total Processed jobs is around 4.5M

Maybe a Related Issue

We also have huge page loading times on /good_job/jobs Admin page (~4min)

Possible cause

It could be that we need to add some queue limits, but we don't want to blindly use those in a trial and error situation we need to fully understand what could be causing this.

Trigger

Every day at that time there is scheduled jobs load of a similar amount

Incident

Started at 15:08, Jan 21th

On Jan 21th and for 24h timespan, 1.4M of this select with p99 : 43ms

p99 Latency

Datadog Resources Usage for Postgres

GCloud CPU Usage

On a normal day, Jan 19th and for 24h timespan, 2.4M of this select with p99 : 43ms

p99 Latency

Datadog Resources Usage for Postgres

GCloud CPU Usage

Any advice/help/information would be much appreciated.

App Configuration

application.config.active_job.queue_adapter = :good_job
application.config.good_job.preserve_job_records = true
application.config.good_job.retry_on_unhandled_error = false
application.config.good_job.enable_cron =true
application.config.good_job.max_threads = 10

Stack Configuration

GJ processes as decoupled from Web Server ones
5 pods for Rails Puma
5 pods of GoodJob Workers  => 5 Good Job processes
4 vcores each pod
10 threads limit
20 connections database.yml pool
4G ram each pod

GCloud Postgres Flags
cloudsql.enable_pglogical on
cloudsql.logical_decoding on
max_replication_slots 22
max_wal_senders 22
max_worker_processes 16

The text was updated successfully, but these errors were encountered:

bensheldon · 2023-01-27T18:04:04Z

Yikes! That's not good! I have a couple of ideas for what could be causing the spike; it sounds a little similar to #809.

When I see slow job-locking queries like that, it's usually when there is a very deep queue. Are you able to run a report like this to count cumulative executions:

SELECT date_trunc('hour', created_at), count(*)
FROM good_jobs
WHERE created_at > NOW() - '7 days'::interval
GROUP BY date_trunc('hour', created_at)

I wonder if for some reason you have 100k+ executions (e.g. retries on the same job or something) leading to that performance degradation.

Also, would you be able to pull tracing data for the Slow Dashboard problem? I'm curious if there is a specific query going very slowly, or whether it's the Rails Application blowing up trying to generate models or something for too much data.

remoteweb · 2023-01-30T10:58:14Z

Seems that there are no retries listed in GoodJobs admin.

However we have dealt another incident again within the weekend and we managed resolved the issue by cleaning all performed jobs older than 7 days.

We were using the default (14 days preservation)

This is a major concern to our project, as the current data load is about 5% of what we expect to be within the end of the year..

One possible solution could be the indexing of the good_jobs table to be improved, we ll try investing on this approach but please let us know if we are missing something or we could try a different approach or configuration improvement.

The following is the results of your question

SELECT date_trunc('hour', created_at), count(*)
FROM good_jobs
WHERE created_at > NOW() - '7 days'::interval
GROUP BY date_trunc('hour', created_at)

as run today.

Do note that we had the same incident started on Saturday and ended just a few minutes ago by clearing the old job records.

[{"date_trunc"=>"2023-01-27T10:00:00.000Z", "count"=>13186},
 {"date_trunc"=>"2023-01-27T11:00:00.000Z", "count"=>13385},
 {"date_trunc"=>"2023-01-27T12:00:00.000Z", "count"=>13603},
 {"date_trunc"=>"2023-01-27T13:00:00.000Z", "count"=>94565},
 {"date_trunc"=>"2023-01-27T14:00:00.000Z", "count"=>18039},
 {"date_trunc"=>"2023-01-27T15:00:00.000Z", "count"=>17452},
 {"date_trunc"=>"2023-01-27T16:00:00.000Z", "count"=>14841},
 {"date_trunc"=>"2023-01-27T17:00:00.000Z", "count"=>13566},
 {"date_trunc"=>"2023-01-27T18:00:00.000Z", "count"=>11319},
 {"date_trunc"=>"2023-01-27T19:00:00.000Z", "count"=>10310},
 {"date_trunc"=>"2023-01-27T20:00:00.000Z", "count"=>10597},
 {"date_trunc"=>"2023-01-27T21:00:00.000Z", "count"=>9916},
 {"date_trunc"=>"2023-01-27T22:00:00.000Z", "count"=>8234},
 {"date_trunc"=>"2023-01-27T23:00:00.000Z", "count"=>8545},
 {"date_trunc"=>"2023-01-28T00:00:00.000Z", "count"=>7355},
 {"date_trunc"=>"2023-01-28T01:00:00.000Z", "count"=>8558},
 {"date_trunc"=>"2023-01-28T02:00:00.000Z", "count"=>6452},
 {"date_trunc"=>"2023-01-28T03:00:00.000Z", "count"=>9651},
 {"date_trunc"=>"2023-01-28T04:00:00.000Z", "count"=>6017},
 {"date_trunc"=>"2023-01-28T05:00:00.000Z", "count"=>5324},
 {"date_trunc"=>"2023-01-28T06:00:00.000Z", "count"=>5693},
 {"date_trunc"=>"2023-01-28T07:00:00.000Z", "count"=>5630},
 {"date_trunc"=>"2023-01-28T08:00:00.000Z", "count"=>8454},
 {"date_trunc"=>"2023-01-28T09:00:00.000Z", "count"=>6095},
 {"date_trunc"=>"2023-01-28T10:00:00.000Z", "count"=>6191},
 {"date_trunc"=>"2023-01-28T11:00:00.000Z", "count"=>6059},
 {"date_trunc"=>"2023-01-28T12:00:00.000Z", "count"=>6192},
 {"date_trunc"=>"2023-01-28T13:00:00.000Z", "count"=>39156},
 {"date_trunc"=>"2023-01-28T14:00:00.000Z", "count"=>19982},
 {"date_trunc"=>"2023-01-28T15:00:00.000Z", "count"=>4652},
 {"date_trunc"=>"2023-01-28T16:00:00.000Z", "count"=>4844},
 {"date_trunc"=>"2023-01-28T17:00:00.000Z", "count"=>4501},
 {"date_trunc"=>"2023-01-28T18:00:00.000Z", "count"=>9211},
 {"date_trunc"=>"2023-01-28T19:00:00.000Z", "count"=>10556},
 {"date_trunc"=>"2023-01-28T20:00:00.000Z", "count"=>5867},
 {"date_trunc"=>"2023-01-28T21:00:00.000Z", "count"=>5217},
 {"date_trunc"=>"2023-01-28T22:00:00.000Z", "count"=>7441},
 {"date_trunc"=>"2023-01-28T23:00:00.000Z", "count"=>6516},
 {"date_trunc"=>"2023-01-29T00:00:00.000Z", "count"=>6169},
 {"date_trunc"=>"2023-01-29T01:00:00.000Z", "count"=>8033},
 {"date_trunc"=>"2023-01-29T02:00:00.000Z", "count"=>9285},
 {"date_trunc"=>"2023-01-29T03:00:00.000Z", "count"=>5897},
 {"date_trunc"=>"2023-01-29T04:00:00.000Z", "count"=>12038},
 {"date_trunc"=>"2023-01-29T05:00:00.000Z", "count"=>7046},
 {"date_trunc"=>"2023-01-29T06:00:00.000Z", "count"=>9354},
 {"date_trunc"=>"2023-01-29T07:00:00.000Z", "count"=>11706},
 {"date_trunc"=>"2023-01-29T08:00:00.000Z", "count"=>15192},
 {"date_trunc"=>"2023-01-29T09:00:00.000Z", "count"=>10413},
 {"date_trunc"=>"2023-01-29T10:00:00.000Z", "count"=>11147},
 {"date_trunc"=>"2023-01-29T11:00:00.000Z", "count"=>10778},
 {"date_trunc"=>"2023-01-29T12:00:00.000Z", "count"=>10323},
 {"date_trunc"=>"2023-01-29T13:00:00.000Z", "count"=>43119},
 {"date_trunc"=>"2023-01-29T14:00:00.000Z", "count"=>6739},
 {"date_trunc"=>"2023-01-29T15:00:00.000Z", "count"=>6277},
 {"date_trunc"=>"2023-01-29T16:00:00.000Z", "count"=>10768},
 {"date_trunc"=>"2023-01-29T17:00:00.000Z", "count"=>14020},
 {"date_trunc"=>"2023-01-29T18:00:00.000Z", "count"=>9518},
 {"date_trunc"=>"2023-01-29T19:00:00.000Z", "count"=>7995},
 {"date_trunc"=>"2023-01-29T20:00:00.000Z", "count"=>10505},
 {"date_trunc"=>"2023-01-29T21:00:00.000Z", "count"=>11742},
 {"date_trunc"=>"2023-01-29T22:00:00.000Z", "count"=>10603},
 {"date_trunc"=>"2023-01-29T23:00:00.000Z", "count"=>10903},
 {"date_trunc"=>"2023-01-30T00:00:00.000Z", "count"=>11246},
 {"date_trunc"=>"2023-01-30T01:00:00.000Z", "count"=>11896},
 {"date_trunc"=>"2023-01-30T02:00:00.000Z", "count"=>12190},
 {"date_trunc"=>"2023-01-30T03:00:00.000Z", "count"=>12084},
 {"date_trunc"=>"2023-01-30T04:00:00.000Z", "count"=>12665},
 {"date_trunc"=>"2023-01-30T05:00:00.000Z", "count"=>11166},
 {"date_trunc"=>"2023-01-30T06:00:00.000Z", "count"=>12479},
 {"date_trunc"=>"2023-01-30T07:00:00.000Z", "count"=>11962},
 {"date_trunc"=>"2023-01-30T08:00:00.000Z", "count"=>13768},
 {"date_trunc"=>"2023-01-30T09:00:00.000Z", "count"=>12051},
 {"date_trunc"=>"2023-01-30T10:00:00.000Z", "count"=>1715}]

bensheldon · 2023-01-30T19:40:05Z

@remoteweb hmmm. It looks like there are some spikes in executions, I dunno if they line up with the resource usage that you're seeing:

You might also try the Queue Select Limit feature of GoodJob: https://github.com/bensheldon/good_job#queue-performance-with-queue-select-limit

I'd be curious if you find additional indexing opportunities. This was the most recent index added: #726

remoteweb · 2023-01-31T09:49:24Z

Attached above, is a 7 day SQL CPU resources chart in relation with the previous results

Thanks for referencing the PR with the indexes. I am thinking that we might also not get the most out of the indexes as we are using postgres 11.

bensheldon added this to GoodJob Backlog v2 Jan 27, 2023

github-project-automation bot moved this to Inbox in GoodJob Backlog v2 Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uncommon/alarming increased latency for GoodJob query on production #813

Uncommon/alarming increased latency for GoodJob query on production #813

remoteweb commented Jan 27, 2023

bensheldon commented Jan 27, 2023

remoteweb commented Jan 30, 2023 •

edited

Loading

bensheldon commented Jan 30, 2023

remoteweb commented Jan 31, 2023 •

edited

Loading

Uncommon/alarming increased latency for GoodJob query on production #813

Uncommon/alarming increased latency for GoodJob query on production #813

Comments

remoteweb commented Jan 27, 2023

Description

Problem

Facts

Maybe a Related Issue

Possible cause

Trigger

Incident

Started at 15:08, Jan 21th

p99 Latency

Datadog Resources Usage for Postgres

GCloud CPU Usage

p99 Latency

Datadog Resources Usage for Postgres

GCloud CPU Usage

bensheldon commented Jan 27, 2023

remoteweb commented Jan 30, 2023 • edited Loading

bensheldon commented Jan 30, 2023

remoteweb commented Jan 31, 2023 • edited Loading

remoteweb commented Jan 30, 2023 •

edited

Loading

remoteweb commented Jan 31, 2023 •

edited

Loading