-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notifier keepalives #1417
Comments
After 4 days of running without issues on fly.io, it happened again (and it is happening every hour) Opened a thread on Fly's community forum |
I have some more findings that I think it's worth mentioning. So, I have a cron task that runs every 30min and collects some data. now, back to square one, the processes "died", disappearing from the list but still running. What I did then was stop the machine on Fly and restart it, which brought back up the process on the list, but strangely enough I'm seeing these errors too: 2024-07-16T11:30:00.009 app[28674d2c9214e8] ams [info] E, [2024-07-16T11:30:00.009660 #314] ERROR -- : [ActiveJob] Failed enqueuing CaptureQueryStatsJob to GoodJob(default): ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_good_jobs_on_cron_key_and_cron_at_cond" almost like there's a phantom process still picking up the cron task...I can't really say it's this because Fly always uses the PID 314 so that's what's always printed in the logs now, this is interesting: I mentioned the job runs every 30min and I'm looking at the timestamps of things:
|
Testing Good Job 4.1 now, I will let it run for some time, see if it fails again or not |
Running without issues for 2 days now. seems like the keepalive fixed it! |
🙌 Yay! Thank you so much for reporting this and the follow-up! |
thank you as always for the quick help and fix! |
tldr: maybe GoodJob's Notifier should emit a noop SQL query every N seconds to keep the connection alive. Also
I noticed the sequence of operations in the Notifier is:
So if the connection is dead, maybe GoodJob is still trying to do those further operations on it, and makes it worse. Like if it has a ConnectionBad, GoodJob should just throw it away rather than trying to clean up the connection.
But the underlying issue here is...
When the postgres connection is in LISTEN mode and just looping with wait_for_notify(1.second) the connection doesn't look active to the database/proxy so it closes the connection after 60 seconds.
So that means maybe GoodJob should do something dumb like doing
SELECT 1 as one
every 15 seconds or something and that would be enough to not cut the connection. Does that seem valid?The text was updated successfully, but these errors were encountered: