Consider making tobira worker
deal with loss of DB connection
#732
Labels
area:backend
Everything backend related
area:database
The Tobira database
area:sync
Syncing with an Opencast instance
kind:improvement
In #201 I decided to let our
worker
command just fail & exit when anything happens. As it should run as a service anyway, configuring it to restart automatically again should be easy. However, the standardrestart
behavior of systemd for example is not particularly helpful. It just restarts Tobira for a max of 10 times or whatever before stopping. And since usually the DB doesn't get back for some time, all those restarts fail too.I think we can improve the out of box situation here.
One simple idea would be to have all processes always establish a new DB connection before they get active (i.e. to have no long lived DB connections). That would mean that every sync attempt or whatever would fail, but we would make sure not to bring down the whole process. But I don't think this is optimal as it would result in lots of log spam since these processes would try every 3s (search index) or 15s (sync) or something like that.
Instead, I think I would catch DB connection errors at the top level. If that happens, I would try to reestablish the connection with some exponential backoff or something. Further, there we could also find out of the DB is in a special state (e.g. hot standby) and try again until we have a DB connection in a state the works for
worker
.The text was updated successfully, but these errors were encountered: