Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an alert to catch mismatch of StatefulSet replicas vs expected/ready #654

Merged
merged 1 commit into from
Nov 29, 2023

Conversation

philipgough
Copy link
Contributor

We have had a couple of issues, where specifically, something went wrong with one of the nodes Thanos receive was scheduled on. The replica goes into a terminating state which has a grace period of 900 seconds.

If the node does not become healthy after 10 minutes, openshift machine api operator removes it from the set and this appears to leave the replica in a zombie state where it gets stuck in terminating and won't reschedule due to affinity.

In this case force deleting seems to work but we want to avoid rolling out in such cases and effecting hashring stability. I've set the value for 20m to extend past the grace period and only alert in cases where it really is a problem.

Copy link
Member

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding runbook too!

@philipgough philipgough merged commit 1ce21b2 into rhobs:main Nov 29, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants