Notifier keepalives #1417

bensheldon · 2024-07-10T22:59:51Z

tldr: maybe GoodJob's Notifier should emit a noop SQL query every N seconds to keep the connection alive. Also

Better lifecycle handling so that if the notifier shuts down the entire process should shut down (or at least stop running jobs)

does anyone have experience with good_job deployed on fly.io?
here's my problem: I've an app deployed there with good_job running on it. after 30 or so minutes, the process crash with this error:
PQsocket() can't get socket descriptor
...
PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly (PG::ConnectionBad)
I read that people had the same issue and seems like Fly's Postgres proxy has a 30min timeout or something like that, and kills idle connections (e.g. no job ran in the past 30min). I solved by enqueuing a cron job that collect some pg metrics every 10min.
now, after a couple days with clear skies, this error popped back up and the process crashed..except it didn't? Processes tab shows no processes at all but all my jobs are up and running, being enqueued and everything is working as expected.

Related: https://community.fly.io/t/postgres-dropping-connection-after-30s-when-using-listen-notify/12188

I noticed the sequence of operations in the Notifier is:

UNLISTEN the connection
Drop the advisory lock from the process
Destroy the process

So if the connection is dead, maybe GoodJob is still trying to do those further operations on it, and makes it worse. Like if it has a ConnectionBad, GoodJob should just throw it away rather than trying to clean up the connection.

But the underlying issue here is...

When the postgres connection is in LISTEN mode and just looping with wait_for_notify(1.second) the connection doesn't look active to the database/proxy so it closes the connection after 60 seconds.
So that means maybe GoodJob should do something dumb like doing SELECT 1 as one every 15 seconds or something and that would be enough to not cut the connection. Does that seem valid?

The text was updated successfully, but these errors were encountered:

luizkowalski · 2024-07-15T08:46:00Z

After 4 days of running without issues on fly.io, it happened again (and it is happening every hour)

Opened a thread on Fly's community forum

luizkowalski · 2024-07-16T14:46:31Z

I have some more findings that I think it's worth mentioning.

So, I have a cron task that runs every 30min and collects some data. now, back to square one, the processes "died", disappearing from the list but still running. What I did then was stop the machine on Fly and restart it, which brought back up the process on the list, but strangely enough I'm seeing these errors too:

2024-07-16T11:30:00.009 app[28674d2c9214e8] ams [info] E, [2024-07-16T11:30:00.009660 #314] ERROR -- : [ActiveJob] Failed enqueuing CaptureQueryStatsJob to GoodJob(default): ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_good_jobs_on_cron_key_and_cron_at_cond"

almost like there's a phantom process still picking up the cron task...I can't really say it's this because Fly always uses the PID 314 so that's what's always printed in the logs

now, this is interesting: I mentioned the job runs every 30min and I'm looking at the timestamps of things:

2024-07-16T13:30:00.059 app[28674d2c9214e8] ams [info] I, [2024-07-16T13:30:00.059861 #314] INFO -- : GoodJob destroyed 0 preserved job execution records finished before 2024-07-02 15:30:00 +0200.

2024-07-16T13:59:59.975 app[28674d2c9214e8] ams [info] I, [2024-07-16T13:59:59.975112 #314] INFO -- : Notifier unsubscribed with UNLISTEN

2024-07-16T13:59:59.979 app[28674d2c9214e8] ams [info] I, [2024-07-16T13:59:59.979226 #314] INFO -- : [ActiveJob] Enqueued CaptureQueryStatsJob (Job ID: b3701d5e-ef07-4041-b907-7b470971c67b) to GoodJob(default)

2024-07-16T13:59:59.986 app[28674d2c9214e8] ams [info] E, [2024-07-16T13:59:59.981809 #314] ERROR -- : Notifier errored: ActiveRecord::ConnectionNotEstablished: PQsocket() can't get socket descriptor

2024-07-16T13:59:59.986 app[28674d2c9214e8] ams [info] ["/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:56:in `exec'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:56:in `block (2 levels) in raw_execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb:1004:in `block in with_raw_connection'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/concurrency/null_lock.rb:9:in `synchronize'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb:976:in `with_raw_connection'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:55:in `block in raw_execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/notifications/instrumenter.rb:58:in `instrument'", "/rails/vendor/ruby/3.3.0/gems/sentry-rails-5.18.1/lib/sentry/rails/tracing.rb:54:in `instrument'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb:1119:in `log'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:54:in `raw_execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract/database_statements.rb:538:in `internal_execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract/database_statements.rb:137:in `execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract/query_cache.rb:26:in `execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:48:in `execute'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:239:in `block (5 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/notifications.rb:210:in `block in instrument'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/notifications/instrumenter.rb:58:in `instrument'", "/rails/vendor/ruby/3.3.0/gems/sentry-rails-5.18.1/lib/sentry/rails/tracing.rb:54:in `instrument'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/notifications.rb:210:in `instrument'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:237:in `block (4 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/callbacks.rb:110:in `run_callbacks'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:235:in `block (3 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/execution_wrapper.rb:91:in `wrap'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:234:in `ensure in block (2 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:234:in `block (2 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:260:in `with_connection'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:201:in `block in create_listen_task'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/safe_task_executor.rb:24:in `block in execute'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `block in synchronize'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `synchronize'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `synchronize'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/safe_task_executor.rb:22:in `execute'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/ivar.rb:170:in `safe_execute'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/scheduled_task.rb:298:in `process_task'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/timer_set.rb:98:in `block in ns_post_task'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:359:in `run_task'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:350:in `block (3 levels) in create_worker'", "<internal:kernel>:187:in `loop'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:341:in `block (2 levels) in create_worker'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:340:in `catch'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:340:in `block in create_worker'"]

2024-07-16T14:00:00.009 app[28674d2c9214e8] ams [info] E, [2024-07-16T14:00:00.009459 #314] ERROR -- : [ActiveJob] Failed enqueuing CaptureQueryStatsJob to GoodJob(default): ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_good_jobs_on_cron_key_and_cron_at_cond"

2024-07-16T14:00:00.009 app[28674d2c9214e8] ams [info] DETAIL: Key (cron_key, cron_at)=(query_stats, 2024-07-16 14:00:00) already exists.

2024-07-16T14:00:00.009 app[28674d2c9214e8] ams [info] )

2024-07-16T14:00:04.993 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:04.993913 #314] INFO -- : Notifier subscribed with LISTEN

2024-07-16T14:00:08.528 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:08.528055 #314] INFO -- : [ActiveJob] [CaptureQueryStatsJob] [b3701d5e-ef07-4041-b907-7b470971c67b] Performing CaptureQueryStatsJob (Job ID: b3701d5e-ef07-4041-b907-7b470971c67b) from GoodJob(default) enqueued at 2024-07-16T13:59:59.966956766Z

2024-07-16T14:00:08.575 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:08.575681 #314] INFO -- : [ActiveJob] [CaptureQueryStatsJob] [b3701d5e-ef07-4041-b907-7b470971c67b] Performed CaptureQueryStatsJob (Job ID: b3701d5e-ef07-4041-b907-7b470971c67b) from GoodJob(default) in 47.68ms

2024-07-16T14:00:08.580 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:08.576431 #314] INFO -- : Executed GoodJob b3701d5e-ef07-4041-b907-7b470971c67b

2024-07-16T14:00:08.591 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:08.591800 #314] INFO -- : GoodJob destroyed 0 preserved job execution records finished before 2024-07-02 16:00:08 +0200.

2024-07-16T14:29:59.973 app[28674d2c9214e8] ams [info] D, [2024-07-16T14:29:59.973645 #314] DEBUG -- : Notifier received payload: {"queue_name":"default","scheduled_at":"2024-07-16T16:29:59.967+02:00"}

2024-07-16T14:29:59.975 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:29:59.975119 #314] INFO -- : [ActiveJob] Enqueued CaptureQueryStatsJob (Job ID: c1a9c689-f2b8-4576-aeba-7202d3072e8e) to GoodJob(default)

2024-07-16T14:29:59.988 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:29:59.987955 #314] INFO -- : [ActiveJob] [CaptureQueryStatsJob] [c1a9c689-f2b8-4576-aeba-7202d3072e8e] Performing CaptureQueryStatsJob (Job ID: c1a9c689-f2b8-4576-aeba-7202d3072e8e) from GoodJob(default) enqueued at 2024-07-16T14:29:59.967247969Z

2024-07-16T14:30:00.012 app[28674d2c9214e8] ams [info] E, [2024-07-16T14:30:00.012721 #314] ERROR -- : [ActiveJob] Failed enqueuing CaptureQueryStatsJob to GoodJob(default): ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_good_jobs_on_cron_key_and_cron_at_cond"

2024-07-16T14:30:00.012 app[28674d2c9214e8] ams [info] DETAIL: Key (cron_key, cron_at)=(query_stats, 2024-07-16 14:30:00) already exists.

2024-07-16T14:30:00.012 app[28674d2c9214e8] ams [info] )

2024-07-16T14:30:00.033 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:30:00.033113 #314] INFO -- : [ActiveJob] [CaptureQueryStatsJob] [c1a9c689-f2b8-4576-aeba-7202d3072e8e] Performed CaptureQueryStatsJob (Job ID: c1a9c689-f2b8-4576-aeba-7202d3072e8e) from GoodJob(default) in 45.22ms

2024-07-16T14:30:00.033 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:30:00.033415 #314] INFO -- : Executed GoodJob c1a9c689-f2b8-4576-aeba-7202d3072e8e

2024-07-16T14:30:00.049 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:30:00.049093 #314] INFO -- : GoodJob destroyed 0 preserved job execution records finished before 2024-07-02 16:30:00 +0200.

13:59:59.975112: milliseconds prior to the cron task being enqueued/processed: UNLISTEN
13:59:59.979226: Cron task tries to enqueue
13:59:59.986: Error
14:00:00.009459 Failed to enqueue job: unique index validation
14:30:00.033113: Job picked up (by God knows what) and processed

luizkowalski · 2024-07-16T18:35:14Z

Testing Good Job 4.1 now, I will let it run for some time, see if it fails again or not

luizkowalski · 2024-07-18T14:22:24Z

Running without issues for 2 days now. seems like the keepalive fixed it!

bensheldon · 2024-07-18T14:35:43Z

🙌 Yay! Thank you so much for reporting this and the follow-up!

luizkowalski · 2024-07-18T15:17:52Z

thank you as always for the quick help and fix!

github-project-automation bot added this to GoodJob Backlog v2 Jul 10, 2024

github-project-automation bot moved this to Inbox in GoodJob Backlog v2 Jul 10, 2024

bensheldon mentioned this issue Jul 15, 2024

Add keepalive SQL query to Notifier #1423

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notifier keepalives #1417

Notifier keepalives #1417

bensheldon commented Jul 10, 2024 •

edited

Loading

luizkowalski commented Jul 15, 2024

luizkowalski commented Jul 16, 2024 •

edited

Loading

luizkowalski commented Jul 16, 2024

luizkowalski commented Jul 18, 2024

bensheldon commented Jul 18, 2024

luizkowalski commented Jul 18, 2024

Notifier keepalives #1417

Notifier keepalives #1417

Comments

bensheldon commented Jul 10, 2024 • edited Loading

luizkowalski commented Jul 15, 2024

luizkowalski commented Jul 16, 2024 • edited Loading

luizkowalski commented Jul 16, 2024

luizkowalski commented Jul 18, 2024

bensheldon commented Jul 18, 2024

luizkowalski commented Jul 18, 2024

bensheldon commented Jul 10, 2024 •

edited

Loading

luizkowalski commented Jul 16, 2024 •

edited

Loading