Add logging to DD, attempt to make shutdown more graceful #14
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PROBLEM
Scripts from our CI pipeline that run using this plugin don't receive signals from the Buildkite agent running them. These include signals that Buildkite sends when a user manually cancels a job or when a job reaches its timeout as well as cases where an agent dies.
SOLUTION
This attempts to make shutdown more reliable, however this attempt was not very successful. Below is a detailed explanation of my findings. The TL;DR is that while the reasons that we don't receive the signals seem clear, the obvious fixes actually made things worse. Therefore this PR adds logging to datadog directly, the same way that we do in our pipeline scripts in the monorepo. From my testing this should succeed the vast majority of the time. However we cannot differentiate between a manual cancelation, a timeout, or another reason for SIGTERM. This could likely be improved in various ways in the future, and combined with other data in DD from the CI executions we should be able to get what we need.
TESTING
See the monorepo https://github.com/joinhandshake/handshake/pull/69867 which updates to this version. That will remain as a draft until a new release of this plugin is made and it can be updated to reference this release version.
FINDINGS
In some cases the plugin receives no signal, but in most cases the plugin does receive the signal but it is not forwarded to the docker compose process that is running our script. The cases where the plugin does not receive a signal seem likely to be bugs in Buildkite's agent code, or because the agent died before the signal propagated. I made a ton of attempts to figure out why the signal is not being forwarded to the script running in docker compose (see #13 for more).
This plugin is running the docker compose command in a subshell. This should prevent it from receiving a signal. However, when run in the subshell, it is killed correctly even tho we don't see the signal. I assume this is because the subshell itself is killed, SIGKILLing it's child processes. When run outside a subshell we still don't see the signal AND the docker compose process is not killed at all.
This plugin runs shell scripts in docker compose with bin/sh -e -c. This should cause the process to not be PID 1, which would mean that it does not receive signals. According to docker's docs, a script can be run directly with docker compose without the bin/sh command. This makes the process PID 1, and indeed that is the behavior seen if the code is changed, but the signals still don't seem to propagate.
So the 4 possibilities and results are:
An additionally confusing thing is that our script ALWAYS restarts when run in a subshell after a term signal is sent. I have been unable to explain this behavior. It seems that the docker compose command is re-running and nothing else, but where it is getting this setting I'm not sure. This behavior is more visible due to the changes in this PR. We are now attempting to docker compose stop the service that is running our script rather than the entire docker container, which seems to make it a bit more likely that our script will exit gracefully, but also means the script has time to run a little after restarting.