-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(server): Report unhealthy instead of terminating on panic #4255
Conversation
Co-authored-by: Sebastian Zivota <[email protected]>
This reverts commit d6bbdcc.
This reverts commit 82729a8.
This reverts commit d0b4af9.
This reverts commit d46f370.
0555648
to
43eca3f
Compare
HealthCheck::IsHealthy(message, sender) => { | ||
let update = update_rx.borrow(); | ||
sender.send(if matches!(message, IsHealthy::Liveness) { | ||
Status::Healthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We considered failing the Liveness
check once a service has crashed, but IIUC this would mean that Kubernetes would kill the process immediately. We want to keep the process alive so other services still have a chance to finish their work.
If the liveness probe fails, the kubelet kills the container
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#types-of-probe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually make the situation better? For me this seems like it introduces more problems than it solves.
You potentially solve the problem of 'leftover' data on crash. But for most cases this does not actually work, e.g. if one of the fundamental services crashes, ProjectCache
or Upstream
or Store
, this does not help.
At the same time you introduce another failure mode into the cluster. Pods that are forever unhealthy and never self recover basically forcing immediate manual intervention.
This is especially bad for Proxy and Managed Relays as well as Self-Hosted.
I think making the services more resilient (e.g. isolating message processing) solves this problem better.
This PR treats an edge case that we've seen in production only once. The team agrees that crashing relay is better than silent service failure (see #4249), but does not unanimously agree on health check failure being better than crashing. Handling service panics through the health check forces relay users outside of SaaS to monitor the health check endpoint in the same way we do. |
Follow-up to #4249: instead of terminating the process when a service panics, report unhealthy on the health check endpoint. This gives other services the chance to finish processing and / or flush out data.
NOTE: This is the same PR #4250, which I accidentally merged by pushing to the wrong branch.
Fixes #4037.