-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iocs need a readyness probe #22
Comments
@gilesknap The case we are considering is a failing IOC. I do the following to get a failing ioc-instance: In the config ioc.yaml - I make an entity "goodluck" that ibek will not recognise
I deploy this to my personal namespace in argus. This results in these logs:
after 3 minutes I get an email
At this point there were 5 restarts after 15 minutes I get an email
After 15 minutes after I get the same. After 30 minutes another. etc. Discussion:
|
Our liveness check definition does have "initialDelaySeconds":120 |
including @tghartland as my assertion that we need a readiness probe is based on a conversation in a helpdesk issue. (unfortunately I asked Thomas to resolve that issue and now cannot find it). I don't think 15 minute emails are acceptable because they go to lots of people and therefore waste lots of people's time. So I think we want to fix this. Thomas, please can you remind me of your reasoning for why you believed we needed a readiness probe to resolve this? |
The ticket Giles mentions was for one IOC pod in p45 crash looping because one of the other devices it connects to was powered off. I'll copy my analysis from that ticket:
In the same way as a standard webserver readiness probe is to have it make a get request to @marcelldls when you say IOCs are not self healing and need manual input to restart, does that include this situation above where the IOC is failing to start at all until another dependency comes up? Or would this be a case where the try-until-success loop works and is desirable? Even though the main functional purpose of Readiness is to indicate being ready to receive requests (from kubernetes services), I think this state is prolific in enough interfaces ( |
Regarding self healing. The example Marcell used was to deploy a broken IOC and that would clearly not self heal. But once a working IOC has been deployed the most likely cause of boot loops is that its device is unavailable. We would want it to keep trying until its device became available in that instance. But even for broken IOCs, reducing the amount of alerts is still desirable. And it seems the readiness probe can allow that. an aside re IOC failure behaviour
|
When an IOC is failing we get many many messages from K8S.
That is because the IOC takes long enough to start and crash that K8S defaults to considering it READY.
We should add a readiness probe along the same lines as our liveness probe except that it loops until the PV is available and exits with fail after 30 secs or so if it does not become available.
Warning. If the IOC crashes after ioc_init then the status PVs may appear briefly so we''d need to cope with that - perhaps make sure the PV stays available for some count of seconds before returning success.
The text was updated successfully, but these errors were encountered: