You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When switching from standby to primary, we first send an HTTP PATCH to the Patroni REST API to tell the databse to become a Leader, then we update the custom resource accordingly.
When updating the custom resource to replication leader, the operator immediately tries to update the ROLES in the databse. If the result of the Patroni REST call is not finished before that role update, the database is still read-only, and that update of the ROLES fails, hence the SyncFailed status.
This status will remain untill the next resync cycle, which happens every 30mins.
Brainstorming:
To fix the actual problem, we should introduce a small wait between the REST call and the update of the custom resource. And that wait would only be neccessary when the status actually changed (aka we performed a switch). So we need to identify if a switch was performed. Here is my idea: instead of always doing the PATCH request, we should perform a GET and check the actual Patroni status. If all is correct, continue as before. If, however, the status is not what it should be, perform the PATCH and reschedule the postgres reconcile. Here, we can decide how long to wait with the reconcile (via Result.RequeueAfter).
The only problem I see right now is when patroni is not responding, we wouldn't be updating the custom resource anymore but reconcile forever. A reconcile would also be performed on initial creation of the the databse as well. So maybe we should simply continue when the GET to patroni fails?
Update: To remedy the last problem, we defer the reconcile depending on when and where the update of the patroni config fails. that way, we can continue with updating the custom resource if desired and reconcile then (to try updating the patroni config again)
The text was updated successfully, but these errors were encountered:
When switching from standby to primary, we first send an HTTP PATCH to the Patroni REST API to tell the databse to become a
Leader
, then we update the custom resource accordingly.When updating the custom resource to replication leader, the operator immediately tries to update the
ROLES
in the databse. If the result of the Patroni REST call is not finished before that role update, the database is still read-only, and that update of theROLES
fails, hence theSyncFailed
status.This status will remain untill the next resync cycle, which happens every 30mins.
Brainstorming:
To fix the actual problem, we should introduce a small wait between the REST call and the update of the custom resource. And that wait would only be neccessary when the status actually changed (aka we performed a switch). So we need to identify if a switch was performed. Here is my idea: instead of always doing the PATCH request, we should perform a GET and check the actual Patroni status. If all is correct, continue as before. If, however, the status is not what it should be, perform the PATCH and reschedule the postgres reconcile. Here, we can decide how long to wait with the reconcile (via
Result.RequeueAfter
).The only problem I see right now is when patroni is not responding, we wouldn't be updating the custom resource anymore but reconcile forever. A reconcile would also be performed on initial creation of the the databse as well. So maybe we should simply continue when the GET to patroni fails?
Update: To remedy the last problem, we defer the reconcile depending on when and where the update of the patroni config fails. that way, we can continue with updating the custom resource if desired and reconcile then (to try updating the patroni config again)
The text was updated successfully, but these errors were encountered: