Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notifier keepalives #1417

Open
bensheldon opened this issue Jul 10, 2024 · 6 comments
Open

Notifier keepalives #1417

bensheldon opened this issue Jul 10, 2024 · 6 comments

Comments

@bensheldon
Copy link
Owner

bensheldon commented Jul 10, 2024

tldr: maybe GoodJob's Notifier should emit a noop SQL query every N seconds to keep the connection alive. Also

  • Better lifecycle handling so that if the notifier shuts down the entire process should shut down (or at least stop running jobs)

does anyone have experience with good_job deployed on fly.io?
here's my problem: I've an app deployed there with good_job running on it. after 30 or so minutes, the process crash with this error:

PQsocket() can't get socket descriptor
...
PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly (PG::ConnectionBad)

I read that people had the same issue and seems like Fly's Postgres proxy has a 30min timeout or something like that, and kills idle connections (e.g. no job ran in the past 30min). I solved by enqueuing a cron job that collect some pg metrics every 10min.
now, after a couple days with clear skies, this error popped back up and the process crashed..except it didn't? Processes tab shows no processes at all but all my jobs are up and running, being enqueued and everything is working as expected.

Related: https://community.fly.io/t/postgres-dropping-connection-after-30s-when-using-listen-notify/12188

I noticed the sequence of operations in the Notifier is:

  • UNLISTEN the connection
  • Drop the advisory lock from the process
  • Destroy the process

So if the connection is dead, maybe GoodJob is still trying to do those further operations on it, and makes it worse. Like if it has a ConnectionBad, GoodJob should just throw it away rather than trying to clean up the connection.

But the underlying issue here is...

When the postgres connection is in LISTEN mode and just looping with wait_for_notify(1.second) the connection doesn't look active to the database/proxy so it closes the connection after 60 seconds.
So that means maybe GoodJob should do something dumb like doing SELECT 1 as one every 15 seconds or something and that would be enough to not cut the connection. Does that seem valid?

@luizkowalski
Copy link
Contributor

After 4 days of running without issues on fly.io, it happened again (and it is happening every hour)

Opened a thread on Fly's community forum

@luizkowalski
Copy link
Contributor

luizkowalski commented Jul 16, 2024

I have some more findings that I think it's worth mentioning.

So, I have a cron task that runs every 30min and collects some data. now, back to square one, the processes "died", disappearing from the list but still running. What I did then was stop the machine on Fly and restart it, which brought back up the process on the list, but strangely enough I'm seeing these errors too:

2024-07-16T11:30:00.009 app[28674d2c9214e8] ams [info] E, [2024-07-16T11:30:00.009660 #314] ERROR -- : [ActiveJob] Failed enqueuing CaptureQueryStatsJob to GoodJob(default): ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_good_jobs_on_cron_key_and_cron_at_cond"

almost like there's a phantom process still picking up the cron task...I can't really say it's this because Fly always uses the PID 314 so that's what's always printed in the logs

now, this is interesting: I mentioned the job runs every 30min and I'm looking at the timestamps of things:

2024-07-16T13:30:00.059 app[28674d2c9214e8] ams [info] I, [2024-07-16T13:30:00.059861 #314] INFO -- : GoodJob destroyed 0 preserved job execution records finished before 2024-07-02 15:30:00 +0200.

2024-07-16T13:59:59.975 app[28674d2c9214e8] ams [info] I, [2024-07-16T13:59:59.975112 #314] INFO -- : Notifier unsubscribed with UNLISTEN

2024-07-16T13:59:59.979 app[28674d2c9214e8] ams [info] I, [2024-07-16T13:59:59.979226 #314] INFO -- : [ActiveJob] Enqueued CaptureQueryStatsJob (Job ID: b3701d5e-ef07-4041-b907-7b470971c67b) to GoodJob(default)

2024-07-16T13:59:59.986 app[28674d2c9214e8] ams [info] E, [2024-07-16T13:59:59.981809 #314] ERROR -- : Notifier errored: ActiveRecord::ConnectionNotEstablished: PQsocket() can't get socket descriptor

2024-07-16T13:59:59.986 app[28674d2c9214e8] ams [info] ["/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:56:in `exec'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:56:in `block (2 levels) in raw_execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb:1004:in `block in with_raw_connection'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/concurrency/null_lock.rb:9:in `synchronize'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb:976:in `with_raw_connection'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:55:in `block in raw_execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/notifications/instrumenter.rb:58:in `instrument'", "/rails/vendor/ruby/3.3.0/gems/sentry-rails-5.18.1/lib/sentry/rails/tracing.rb:54:in `instrument'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb:1119:in `log'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:54:in `raw_execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract/database_statements.rb:538:in `internal_execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract/database_statements.rb:137:in `execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract/query_cache.rb:26:in `execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:48:in `execute'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:239:in `block (5 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/notifications.rb:210:in `block in instrument'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/notifications/instrumenter.rb:58:in `instrument'", "/rails/vendor/ruby/3.3.0/gems/sentry-rails-5.18.1/lib/sentry/rails/tracing.rb:54:in `instrument'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/notifications.rb:210:in `instrument'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:237:in `block (4 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/callbacks.rb:110:in `run_callbacks'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:235:in `block (3 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/execution_wrapper.rb:91:in `wrap'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:234:in `ensure in block (2 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:234:in `block (2 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:260:in `with_connection'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:201:in `block in create_listen_task'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/safe_task_executor.rb:24:in `block in execute'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `block in synchronize'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `synchronize'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `synchronize'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/safe_task_executor.rb:22:in `execute'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/ivar.rb:170:in `safe_execute'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/scheduled_task.rb:298:in `process_task'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/timer_set.rb:98:in `block in ns_post_task'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:359:in `run_task'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:350:in `block (3 levels) in create_worker'", "<internal:kernel>:187:in `loop'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:341:in `block (2 levels) in create_worker'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:340:in `catch'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:340:in `block in create_worker'"]

2024-07-16T14:00:00.009 app[28674d2c9214e8] ams [info] E, [2024-07-16T14:00:00.009459 #314] ERROR -- : [ActiveJob] Failed enqueuing CaptureQueryStatsJob to GoodJob(default): ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_good_jobs_on_cron_key_and_cron_at_cond"

2024-07-16T14:00:00.009 app[28674d2c9214e8] ams [info] DETAIL: Key (cron_key, cron_at)=(query_stats, 2024-07-16 14:00:00) already exists.

2024-07-16T14:00:00.009 app[28674d2c9214e8] ams [info] )

2024-07-16T14:00:04.993 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:04.993913 #314] INFO -- : Notifier subscribed with LISTEN

2024-07-16T14:00:08.528 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:08.528055 #314] INFO -- : [ActiveJob] [CaptureQueryStatsJob] [b3701d5e-ef07-4041-b907-7b470971c67b] Performing CaptureQueryStatsJob (Job ID: b3701d5e-ef07-4041-b907-7b470971c67b) from GoodJob(default) enqueued at 2024-07-16T13:59:59.966956766Z

2024-07-16T14:00:08.575 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:08.575681 #314] INFO -- : [ActiveJob] [CaptureQueryStatsJob] [b3701d5e-ef07-4041-b907-7b470971c67b] Performed CaptureQueryStatsJob (Job ID: b3701d5e-ef07-4041-b907-7b470971c67b) from GoodJob(default) in 47.68ms

2024-07-16T14:00:08.580 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:08.576431 #314] INFO -- : Executed GoodJob b3701d5e-ef07-4041-b907-7b470971c67b

2024-07-16T14:00:08.591 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:08.591800 #314] INFO -- : GoodJob destroyed 0 preserved job execution records finished before 2024-07-02 16:00:08 +0200.

2024-07-16T14:29:59.973 app[28674d2c9214e8] ams [info] D, [2024-07-16T14:29:59.973645 #314] DEBUG -- : Notifier received payload: {"queue_name":"default","scheduled_at":"2024-07-16T16:29:59.967+02:00"}

2024-07-16T14:29:59.975 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:29:59.975119 #314] INFO -- : [ActiveJob] Enqueued CaptureQueryStatsJob (Job ID: c1a9c689-f2b8-4576-aeba-7202d3072e8e) to GoodJob(default)

2024-07-16T14:29:59.988 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:29:59.987955 #314] INFO -- : [ActiveJob] [CaptureQueryStatsJob] [c1a9c689-f2b8-4576-aeba-7202d3072e8e] Performing CaptureQueryStatsJob (Job ID: c1a9c689-f2b8-4576-aeba-7202d3072e8e) from GoodJob(default) enqueued at 2024-07-16T14:29:59.967247969Z

2024-07-16T14:30:00.012 app[28674d2c9214e8] ams [info] E, [2024-07-16T14:30:00.012721 #314] ERROR -- : [ActiveJob] Failed enqueuing CaptureQueryStatsJob to GoodJob(default): ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_good_jobs_on_cron_key_and_cron_at_cond"

2024-07-16T14:30:00.012 app[28674d2c9214e8] ams [info] DETAIL: Key (cron_key, cron_at)=(query_stats, 2024-07-16 14:30:00) already exists.

2024-07-16T14:30:00.012 app[28674d2c9214e8] ams [info] )

2024-07-16T14:30:00.033 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:30:00.033113 #314] INFO -- : [ActiveJob] [CaptureQueryStatsJob] [c1a9c689-f2b8-4576-aeba-7202d3072e8e] Performed CaptureQueryStatsJob (Job ID: c1a9c689-f2b8-4576-aeba-7202d3072e8e) from GoodJob(default) in 45.22ms

2024-07-16T14:30:00.033 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:30:00.033415 #314] INFO -- : Executed GoodJob c1a9c689-f2b8-4576-aeba-7202d3072e8e

2024-07-16T14:30:00.049 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:30:00.049093 #314] INFO -- : GoodJob destroyed 0 preserved job execution records finished before 2024-07-02 16:30:00 +0200.

13:59:59.975112: milliseconds prior to the cron task being enqueued/processed: UNLISTEN
13:59:59.979226: Cron task tries to enqueue
13:59:59.986: Error
14:00:00.009459 Failed to enqueue job: unique index validation
14:30:00.033113: Job picked up (by God knows what) and processed

@luizkowalski
Copy link
Contributor

Testing Good Job 4.1 now, I will let it run for some time, see if it fails again or not

@luizkowalski
Copy link
Contributor

Running without issues for 2 days now. seems like the keepalive fixed it!

@bensheldon
Copy link
Owner Author

🙌 Yay! Thank you so much for reporting this and the follow-up!

@luizkowalski
Copy link
Contributor

thank you as always for the quick help and fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox
Development

No branches or pull requests

2 participants