Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server): Report unhealthy instead of terminating on panic #4255

Closed
wants to merge 30 commits into from

Conversation

jjbayer
Copy link
Member

@jjbayer jjbayer commented Nov 15, 2024

Follow-up to #4249: instead of terminating the process when a service panics, report unhealthy on the health check endpoint. This gives other services the chance to finish processing and / or flush out data.

NOTE: This is the same PR #4250, which I accidentally merged by pushing to the wrong branch.

Fixes #4037.

Base automatically changed from joris/join to master November 19, 2024 07:22
HealthCheck::IsHealthy(message, sender) => {
let update = update_rx.borrow();
sender.send(if matches!(message, IsHealthy::Liveness) {
Status::Healthy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We considered failing the Liveness check once a service has crashed, but IIUC this would mean that Kubernetes would kill the process immediately. We want to keep the process alive so other services still have a chance to finish their work.

If the liveness probe fails, the kubelet kills the container

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#types-of-probe

@jjbayer jjbayer marked this pull request as ready for review November 26, 2024 13:20
@jjbayer jjbayer requested a review from a team as a code owner November 26, 2024 13:20
Copy link
Member

@Dav1dde Dav1dde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually make the situation better? For me this seems like it introduces more problems than it solves.

You potentially solve the problem of 'leftover' data on crash. But for most cases this does not actually work, e.g. if one of the fundamental services crashes, ProjectCache or Upstream or Store, this does not help.

At the same time you introduce another failure mode into the cluster. Pods that are forever unhealthy and never self recover basically forcing immediate manual intervention.

This is especially bad for Proxy and Managed Relays as well as Self-Hosted.

I think making the services more resilient (e.g. isolating message processing) solves this problem better.

@jjbayer
Copy link
Member Author

jjbayer commented Nov 26, 2024

This PR treats an edge case that we've seen in production only once. The team agrees that crashing relay is better than silent service failure (see #4249), but does not unanimously agree on health check failure being better than crashing. Handling service panics through the health check forces relay users outside of SaaS to monitor the health check endpoint in the same way we do.

@jjbayer jjbayer closed this Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Declare Relay unhealthy when a service panics
3 participants