-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock during shutdown which prevented leader election cleanup #1688
Conversation
Before this fix, the deadlock would prevent the leader pod from giving up its lease, which would make it take several minutes for new pods to be allowed to elect a new leader. During that time, no Pinniped controllers could write to the Kube API, so important resources were not being updated during that window. It would also make pod shutdown take about 1 minute. After this fix, the leader gives up its lease immediately, and pod shutdown takes about 1 second. This improves restart/upgrade time and also fixes the problem where there was no leader for several minutes after a restart/upgrade. The deadlock was between the post-start hook and the pre-shutdown hook. The pre-shutdown hook blocked until a certain background goroutine in the post-start hook finished, but that goroutine could not finish until the pre-shutdown hook finished. Thus, they were both blocked, waiting for each other infinitely. Eventually the process would be externally killed. This deadlock was most likely introduced by some change in Kube's generic api server package related to how the many complex channels used during server shutdown interact with each other, and was not noticed when we upgraded to the version which introduced the change.
368b1e8
to
ca6c29e
Compare
go func() { | ||
defer cancel() | ||
|
||
<-postStartContext.StopCh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the deadlock was happening. This StopCh
channel is not going to be closed until our pre-shutdown hook finishes running. But our pre-shutdown hook calls shutdown.Wait()
which is effectively waiting for this goroutine to end (because this goroutine cancels the context which allows the runControllers()
call to stop blocking, which in turn ends the WaitGroup
that the pre-shutdown hook is waiting for).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯
Codecov Report
@@ Coverage Diff @@
## main #1688 +/- ##
==========================================
- Coverage 79.21% 79.15% -0.06%
==========================================
Files 163 163
Lines 15758 15769 +11
==========================================
Hits 12482 12482
- Misses 2961 2972 +11
Partials 315 315
|
Nice find! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
843189f
to
389adfa
Compare
// Skip tailing pod logs for test runs that are using alternate group suffixes. There seems to be a bug in our | ||
// kubeclient package which causes an "unable to find resp serialier" (sic) error for pod log API responses when | ||
// the middleware is active. Since we do not tail pod logs in production code (or anywhere else at this time), | ||
// we don't need to fix that bug right now just for this test. | ||
if env.APIGroupSuffix == "pinniped.dev" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new integration test failed for CI test jobs which use an alternate API group suffix. It looks like there is a bug in our kubeclient
package for tailing logs when the middleware is active. For now, I just worked around it by only tailing logs in this test when the API group suffix is equal to the default.
389adfa
to
5e06c6d
Compare
// Only run this test in CI on Kind clusters, because something about restarting the pods | ||
// in this test breaks the "kubectl port-forward" commands that we are using in CI for | ||
// AKS, EKS, and GKE clusters. The Go code that we wrote for graceful pod shutdown should | ||
// not be sensitive to which distribution it runs on, so running this test only on Kind | ||
// should give us sufficient coverage for what we are trying to test here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also had to work around an unexpected problem in CI where restarting the pods somehow breaks the kubectl port-forward
that we use in CI during AKS/EKS/GKE integration testing.
Before this fix, the deadlock would prevent the leader pod from giving up its lease, which would make it take several minutes for new pods to be allowed to elect a new leader. During that time, no Pinniped controllers could write to the Kube API, so important resources were not being updated during that window. It would also make pod shutdown take about 1 minute.
After this fix, the leader gives up its lease immediately, and pod shutdown takes about 1 second. This improves restart/upgrade time and also fixes the problem where there was no leader for several minutes after a restart/upgrade.
The deadlock was between the post-start hook and the pre-shutdown hook. The pre-shutdown hook blocked until a certain background goroutine in the post-start hook finished, but that goroutine could not finish until the pre-shutdown hook finished. Thus, they were both blocked, waiting for each other infinitely. Eventually the process would be externally killed.
This deadlock was most likely introduced by some change in Kube's generic api server package related to how the many complex channels used during server shutdown interact with each other, and was not noticed when we upgraded to the version which introduced the change.
The bug first appears in Pinniped v0.18.0, in which the Kube libraries were updated from 0.23.6 to 0.24.1. The relevant code in Pinniped had no changes around the time of that release. However, the kube library changed to wait for the pre-shutdown hooks to finish before continuing the remainder of the shutdown process (see the file genericapiserver.go in this diff) thus creating this deadlock.
Release note: