Add an alert to catch mismatch of StatefulSet replicas vs expected/ready #654

philipgough · 2023-11-29T10:58:30Z

We have had a couple of issues, where specifically, something went wrong with one of the nodes Thanos receive was scheduled on. The replica goes into a terminating state which has a grace period of 900 seconds.

If the node does not become healthy after 10 minutes, openshift machine api operator removes it from the set and this appears to leave the replica in a zombie state where it gets stuck in terminating and won't reschedule due to affinity.

In this case force deleting seems to work but we want to avoid rolling out in such cases and effecting hashring stability. I've set the value for 20m to extend past the grace period and only alert in cases where it really is a problem.

saswatamcode

LGTM! Thanks for adding runbook too!

Add an alert to catch mismtach of sts replicas vs expected/ready

d59d484

saswatamcode approved these changes Nov 29, 2023

View reviewed changes

philipgough merged commit 1ce21b2 into rhobs:main Nov 29, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an alert to catch mismatch of StatefulSet replicas vs expected/ready #654

Add an alert to catch mismatch of StatefulSet replicas vs expected/ready #654

philipgough commented Nov 29, 2023

saswatamcode left a comment

Add an alert to catch mismatch of StatefulSet replicas vs expected/ready #654

Add an alert to catch mismatch of StatefulSet replicas vs expected/ready #654

Conversation

philipgough commented Nov 29, 2023

saswatamcode left a comment

Choose a reason for hiding this comment