-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful shutdown period #2105
Comments
@keeferrourke @swankjesse Do either of you have opinions on this? |
For Kubernetes+Istio specifically, we may not need to do this directly inside of Misk, because Kubernetes will immediately remove the Pod from the Endpoint when terminating, and Istio will pick that up. For any sort of external load balancer, though (such as F5 or AWS ALB), it would still be beneficial to gracefully signal shutdown to the LB before dying. |
I was looking into this, and noticed that Misk already configures Jetty to do a graceful shutdown (SO). Then, because |
I just noticed a poor shutdown experience in merbro. It starts like this (each block is newest to oldest messages due to LQ's ordering, sorry):
Then 20 milliseconds later we start shutting down the DB:
Then 80ms later we work on shutting down the containers:
At this point we have our first error caused by work happening after the DB has been shut down. This is an part of servicing an ongoing web action request (underlying error is
300ms later we report this error to the caller of the action:
We then begin to shut down Jetty:
20ms later we encounter another failure when processing an async SQS job that's caused by the DB having been shut down:
Jetty is reported stopped 400ms after that:
At this point we then shut down the SQS poller:
And we finally trigger a graceful termination of istio:
And a readiness probe begins to return 503:
The ordering here seems unfortunate. The DB gets shut down before other services that depend upon it are shut down, and it's not clear whether the readiness probe actually returns a failure before jetty finishes shutting down. |
I caught another case where we see weird errors because graceful shutdowns do not appear to be taking place. First merbro receives the signal to start shutting down:
Then a web action that is servicing a request sees an HTTP stream reset error:
Then jetty reports that is starting to shut down:
Why are we shutting down the server before we've finished servicing requests? Why can't we make the server stop accepting new requests, then have a reasonable timeout period before shutting down the server and other services like the DB? |
This is a good idea, we do it on our internal service container. The best process is roughly:
|
I found something interesting when poking at this:
Would setting a non-default health_port in our service config allow us to experiment with whether a shutdown sleep would address this issue? I'm having trouble figuring out the implications of the 10s timeout during the JettyService shutdown with the health executor enabled. |
Update: #2238 merged, and is a stopgap solution. Affected services (and there are a lot of them) can set the
|
When deploying a new version of a service behind a load balancer, it's useful to "gracefully" terminate the instances of the old version while directing all new traffic to the instances of the new version. In Kubernetes-land, this is usually done by signaling to the soon-to-be-terminated instances to stop responding successfully to health checks. When the load balancer notices they're failing health checks, it will stop sending new requests to them and direct all traffic to the new instances that are passing health checks. Then, after some reasonable grace period, such as 30 seconds, the old instances can shut down entirely.
From what I understand, Misk's current shutdown behavior is to just race to shutdown (while preserving
CoordinatedService
ordering), with no provision for a grace period of 1) intentionally failing health checks 2) refusing new requests or 3) both to give the upstream load balancer an opportunity to re-route traffic.This manifests as spurious errors in services that depend on Misk services, because sometimes they make calls during the brief shutdown period and those calls fail in weird ways due to the rushed shutdown.
The text was updated successfully, but these errors were encountered: