Add an alert to catch mismatch of StatefulSet replicas vs expected/ready #654
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We have had a couple of issues, where specifically, something went wrong with one of the nodes Thanos receive was scheduled on. The replica goes into a terminating state which has a grace period of 900 seconds.
If the node does not become healthy after 10 minutes, openshift machine api operator removes it from the set and this appears to leave the replica in a zombie state where it gets stuck in terminating and won't reschedule due to affinity.
In this case force deleting seems to work but we want to avoid rolling out in such cases and effecting hashring stability. I've set the value for 20m to extend past the grace period and only alert in cases where it really is a problem.