-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition between signalling pid file and certificate reading #250
Comments
ringerc
added a commit
to ringerc/spiffe-helper-PRs
that referenced
this issue
Feb 2, 2025
There is a race where the process pointed to by pid_file_name could be restarted and read the old certificates but not have its new pid written into pid_file_name yet. In this case, we would try to signal the old process and fail, then give up, so the new process would never get its certs refreshed. On initial startup, unavoidable errors may also be logged when the process to be launched has not yet been started (because its certs don't exist yet) so pid_file_name is missing or invalid. These shouldn't be logged as errors, but we don't really want to suppress all errors for all pid file signalling since that would hide misconfigurations. Handle both issues by adding a retry-backoff loop for signalling pid_file_name. On failure, messages are logged at Info level until the retry limit is exceeded, then an Error is logged. During startup the controlling process is expected to be watching for the creation of the certificates, and to launch the controlled process and create the pid file in a timely manner once the certs appear. So it becomes possible to start up without "normal" errors appearing. For the restart-race case, we will initially fail to signal the stale pid in the pid file, so we will retry until the new pid is written, so the new process that loaded stale certs will be properly signalled to reload. Fixes spiffe#250, spiffe#251
ringerc
added a commit
to ringerc/spiffe-helper-PRs
that referenced
this issue
Feb 4, 2025
There is a race where the process pointed to by pid_file_name could be restarted and read the old certificates but not have its new pid written into pid_file_name yet. In this case, we would try to signal the old process and fail, then give up, so the new process would never get its certs refreshed. On initial startup, unavoidable errors may also be logged when the process to be launched has not yet been started (because its certs don't exist yet) so pid_file_name is missing or invalid. These shouldn't be logged as errors, but we don't really want to suppress all errors for all pid file signalling since that would hide misconfigurations. Handle both issues by adding a retry-backoff loop for signalling pid_file_name. On failure, messages are logged at Info level until the retry limit is exceeded, then an Error is logged. During startup the controlling process is expected to be watching for the creation of the certificates, and to launch the controlled process and create the pid file in a timely manner once the certs appear. So it becomes possible to start up without "normal" errors appearing. For the restart-race case, we will initially fail to signal the stale pid in the pid file, so we will retry until the new pid is written, so the new process that loaded stale certs will be properly signalled to reload. Fixes spiffe#250, spiffe#251 Signed-off-by: Craig Ringer <[email protected]>
ringerc
added a commit
to ringerc/spiffe-helper-PRs
that referenced
this issue
Feb 5, 2025
There is a race where the process pointed to by pid_file_name could be restarted and read the old certificates but not have its new pid written into pid_file_name yet. In this case, we would try to signal the old process and fail, then give up, so the new process would never get its certs refreshed. On initial startup, unavoidable errors may also be logged when the process to be launched has not yet been started (because its certs don't exist yet) so pid_file_name is missing or invalid. These shouldn't be logged as errors, but we don't really want to suppress all errors for all pid file signalling since that would hide misconfigurations. Handle both issues by adding a retry-backoff loop for signalling pid_file_name. On failure, messages are logged at Info level until the retry limit is exceeded, then an Error is logged. During startup the controlling process is expected to be watching for the creation of the certificates, and to launch the controlled process and create the pid file in a timely manner once the certs appear. So it becomes possible to start up without "normal" errors appearing. For the restart-race case, we will initially fail to signal the stale pid in the pid file, so we will retry until the new pid is written, so the new process that loaded stale certs will be properly signalled to reload. To enable better testing for this, the retry count is exposed in the test channel for process signalling. Fixes spiffe#250, spiffe#251 Signed-off-by: Craig Ringer <[email protected]>
ringerc
added a commit
to ringerc/spiffe-helper-PRs
that referenced
this issue
Feb 5, 2025
There is a race where the process pointed to by pid_file_name could be restarted and read the old certificates but not have its new pid written into pid_file_name yet. In this case, we would try to signal the old process and fail, then give up, so the new process would never get its certs refreshed. On initial startup, unavoidable errors may also be logged when the process to be launched has not yet been started (because its certs don't exist yet) so pid_file_name is missing or invalid. These shouldn't be logged as errors, but we don't really want to suppress all errors for all pid file signalling since that would hide misconfigurations. Handle both issues by adding a retry-backoff loop for signalling pid_file_name. On failure, messages are logged at Info level until the retry limit is exceeded, then an Error is logged. During startup the controlling process is expected to be watching for the creation of the certificates, and to launch the controlled process and create the pid file in a timely manner once the certs appear. So it becomes possible to start up without "normal" errors appearing. For the restart-race case, we will initially fail to signal the stale pid in the pid file, so we will retry until the new pid is written, so the new process that loaded stale certs will be properly signalled to reload. To enable better testing for this, the retry count is exposed in the test channel for process signalling. Fixes spiffe#250, spiffe#251 Signed-off-by: Craig Ringer <[email protected]>
ringerc
added a commit
to ringerc/spiffe-helper-PRs
that referenced
this issue
Feb 5, 2025
There is a race where the process pointed to by pid_file_name could be restarted and read the old certificates but not have its new pid written into pid_file_name yet. In this case, we would try to signal the old process and fail, then give up, so the new process would never get its certs refreshed. On initial startup, unavoidable errors may also be logged when the process to be launched has not yet been started (because its certs don't exist yet) so pid_file_name is missing or invalid. These shouldn't be logged as errors, but we don't really want to suppress all errors for all pid file signalling since that would hide misconfigurations. Handle both issues by adding a retry-backoff loop for signalling pid_file_name. On failure, messages are logged at Info level until the retry limit is exceeded, then an Error is logged. During startup the controlling process is expected to be watching for the creation of the certificates, and to launch the controlled process and create the pid file in a timely manner once the certs appear. So it becomes possible to start up without "normal" errors appearing. For the restart-race case, we will initially fail to signal the stale pid in the pid file, so we will retry until the new pid is written, so the new process that loaded stale certs will be properly signalled to reload. To enable better testing for this, the retry count is exposed in the test channel for process signalling. Fixes spiffe#250, spiffe#251 Signed-off-by: Craig Ringer <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
There's a race between the process in
pid_file_name
being restarted andspiffe-helper
signalling it to reload its certificates.The problem
If the process being controlled is restarted, and it reads its certs before its pid file gets updated,
spiffe-helper
might replace the certs after the process has read the old copies of the certs. But spiffe-helper might fail to signal the process to reload the certificates because the pid file hasn't been updated yet.This is likely to occur in cases where the pid file is managed by a separate supervisor rather than the process itself, e.g. a shell script that is managing both spiffe-helper and the process that uses the certs, e.g.
here, if
mycmd
exits just as spiffe-helper renews a cert, the newmycmd
might read the old cert andspiffe-helper
might signal the old pid inpid_file_name
before the scriptecho
s the new pid intomycmd.pid
.This wouldn't be an issue if
mycmd
did its own pid-file management and wrote its pid file before reading its certs, but neither can be broadly guaranteed for all commands.Potential solutions
This race can be almost completely closed by having
spiffe-helper
retry if signalling the pid inpid_file_name
fails due to the file being missing or the pid it points to not existing.That would also help the issue with "normal errors" on startup in #251
It's not a completely perfect fix because in theory the old pid in
pid_file_name
could be re-used by a new process after the old one exits, before the pid file contents are rewritten to point to the new pid. Signalling it would thus not cause an error, but not deliver a signal to the right process. That's exceedingly unlikely to the point of not being worth caring about unless it could somehow be an exploit, and I don't see how it could be, so it can IMO be ignored.The text was updated successfully, but these errors were encountered: