feat(server): Report unhealthy instead of terminating on panic #4255

jjbayer · 2024-11-15T10:24:56Z

Follow-up to #4249: instead of terminating the process when a service panics, report unhealthy on the health check endpoint. This gives other services the chance to finish processing and / or flush out data.

NOTE: This is the same PR #4250, which I accidentally merged by pushing to the wrong branch.

Fixes #4037.

Co-authored-by: Sebastian Zivota <[email protected]>

This reverts commit d6bbdcc.

This reverts commit 82729a8.

This reverts commit d0b4af9.

This reverts commit d46f370.

jjbayer · 2024-11-26T13:20:16Z

relay-server/src/services/health_check.rs

+                HealthCheck::IsHealthy(message, sender) => {
+                    let update = update_rx.borrow();
+                    sender.send(if matches!(message, IsHealthy::Liveness) {
+                        Status::Healthy


We considered failing the Liveness check once a service has crashed, but IIUC this would mean that Kubernetes would kill the process immediately. We want to keep the process alive so other services still have a chance to finish their work.

If the liveness probe fails, the kubelet kills the container

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#types-of-probe

Dav1dde

Does this actually make the situation better? For me this seems like it introduces more problems than it solves.

You potentially solve the problem of 'leftover' data on crash. But for most cases this does not actually work, e.g. if one of the fundamental services crashes, ProjectCache or Upstream or Store, this does not help.

At the same time you introduce another failure mode into the cluster. Pods that are forever unhealthy and never self recover basically forcing immediate manual intervention.

This is especially bad for Proxy and Managed Relays as well as Self-Hosted.

I think making the services more resilient (e.g. isolating message processing) solves this problem better.

jjbayer · 2024-11-26T14:26:51Z

This PR treats an edge case that we've seen in production only once. The team agrees that crashing relay is better than silent service failure (see #4249), but does not unanimously agree on health check failure being better than crashing. Handling service panics through the health check forces relay users outside of SaaS to monitor the health check endpoint in the same way we do.

jjbayer and others added 25 commits November 13, 2024 15:13

rm start_in

3fab72e

wip: easy cases

ff88466

spawn

45f580a

clean

4b2c4f9

wip: service runner

20e403c

update usage

d246af0

fix remaining 2

ffa83ca

lint

97cd873

Merge remote-tracking branch 'origin/master' into joris/join

ced514b

doc

a9a5b6d

lint

0a28e44

health check

d6bbdcc

changelog

d0b4af9

Update relay-server/src/services/projects/source/mod.rs

e885bda

Co-authored-by: Sebastian Zivota <[email protected]>

start_with

2260462

naming

61a8c14

Merge remote-tracking branch 'origin/master' into joris/panic-unhealthy

5803f9e

merge

ad28282

Revert "health check"

82729a8

This reverts commit d6bbdcc.

Merge remote-tracking branch 'origin/joris/join' into joris/join

f92a26d

push

740bc33

Revert "Revert "health check""

f07042f

This reverts commit 82729a8.

Revert "changelog"

d46f370

This reverts commit d0b4af9.

Merge branch 'joris/join' into joris/panic-unhealthy

0092ece

Revert "Revert "changelog""

63a88f9

This reverts commit d46f370.

Base automatically changed from joris/join to master November 19, 2024 07:22

jjbayer added 4 commits November 20, 2024 10:17

Merge remote-tracking branch 'origin/master' into joris/panic-unhealthy

a7fe76f

Let service handle it

f66f835

Merge branch 'master' into joris/panic-unhealthy

e49dae0

cleanup

43eca3f

jjbayer force-pushed the joris/panic-unhealthy branch from 0555648 to 43eca3f Compare November 26, 2024 13:14

jjbayer commented Nov 26, 2024

View reviewed changes

jjbayer marked this pull request as ready for review November 26, 2024 13:20

jjbayer requested a review from a team as a code owner November 26, 2024 13:20

loewenheim approved these changes Nov 26, 2024

View reviewed changes

fix changelog

04ac05f

Dav1dde reviewed Nov 26, 2024

View reviewed changes

jjbayer closed this Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): Report unhealthy instead of terminating on panic #4255

feat(server): Report unhealthy instead of terminating on panic #4255

jjbayer commented Nov 15, 2024

jjbayer Nov 26, 2024

Dav1dde left a comment •

edited

Loading

jjbayer commented Nov 26, 2024

feat(server): Report unhealthy instead of terminating on panic #4255

feat(server): Report unhealthy instead of terminating on panic #4255

Conversation

jjbayer commented Nov 15, 2024

jjbayer Nov 26, 2024

Choose a reason for hiding this comment

Dav1dde left a comment • edited Loading

Choose a reason for hiding this comment

jjbayer commented Nov 26, 2024

Dav1dde left a comment •

edited

Loading