[1.17] Retry on leader lease renewal failure #9639
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Backport of solo-io#9563
After a container has become a leader, any Kube API server unavailability results in the container crashing. This happens as the leader is unable to renew the lease. This is by design as outlined here
However, crashing the gloo pods can also lead to an outage during scaling as the gateway-proxy pods that come up cannot fetch any configs and all routes result in 404s.
Now instead of crashing, the gloo pod falls back to a follower. This prevents an outage and any other pod can take over as leader if possible
This is fine as a leader only writes reports / statuses over here and here. On any failure, the pod becomes a follower and if elected back as a leader will continue to write reports.
Code changes
Reset
method on the identity implementation that allows an identity to fall back to a followerCI changes
Instead, cilium is installed as CNI as we need to test Kube API server unavailability
Context
Kube API unavailability results in a gloo container crash
When leader election fails, gloo crashes
Design Doc
Interesting decisions
Testing steps
Kube2e tests to verify the following :
Notes for reviewers
Be sure to verify intended behavior by ...
Please proofread comments on ...
This is a complex PR and may require a huddle to discuss ...
Checklist:
BOT NOTES:
resolves solo-io#8107