Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SyncFailed after promote-to-primary #397

Open
eberlep opened this issue Jul 27, 2022 · 0 comments · May be fixed by #423
Open

Fix SyncFailed after promote-to-primary #397

eberlep opened this issue Jul 27, 2022 · 0 comments · May be fixed by #423

Comments

@eberlep
Copy link
Collaborator

eberlep commented Jul 27, 2022

When switching from standby to primary, we first send an HTTP PATCH to the Patroni REST API to tell the databse to become a Leader, then we update the custom resource accordingly.

When updating the custom resource to replication leader, the operator immediately tries to update the ROLES in the databse. If the result of the Patroni REST call is not finished before that role update, the database is still read-only, and that update of the ROLES fails, hence the SyncFailed status.

This status will remain untill the next resync cycle, which happens every 30mins.

Brainstorming:

To fix the actual problem, we should introduce a small wait between the REST call and the update of the custom resource. And that wait would only be neccessary when the status actually changed (aka we performed a switch). So we need to identify if a switch was performed. Here is my idea: instead of always doing the PATCH request, we should perform a GET and check the actual Patroni status. If all is correct, continue as before. If, however, the status is not what it should be, perform the PATCH and reschedule the postgres reconcile. Here, we can decide how long to wait with the reconcile (via Result.RequeueAfter).

The only problem I see right now is when patroni is not responding, we wouldn't be updating the custom resource anymore but reconcile forever. A reconcile would also be performed on initial creation of the the databse as well. So maybe we should simply continue when the GET to patroni fails?

Update: To remedy the last problem, we defer the reconcile depending on when and where the update of the patroni config fails. that way, we can continue with updating the custom resource if desired and reconcile then (to try updating the patroni config again)

@eberlep eberlep linked a pull request Aug 31, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant