-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add logging to DD, attempt to make shutdown more graceful #12
Conversation
… service startups
Add --include-deps option to Docker image pulls and retries to Docker service startups
…stop has returned
…not sure how to fix...
|
||
if [[ "${BUILDKITE_PLUGIN_DOCKER_COMPOSE_COLLAPSE_LOGS:-false}" = "true" ]]; then | ||
group_type="---" | ||
else | ||
group_type="+++" | ||
fi | ||
|
||
# Disable -e to prevent cancelling step if the command fails for whatever reason | ||
set +e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unneccessary if we use the || exitcode=$?
syntax.
@davkutalek For visibility and because we're having conversations with BuildKite around renewel, I flagged this PR with @cclear and maybe we can get a bit of their help with some of these edge cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent research. The solution looks good given the conditions. My only question is whether we should merge this into the merge-upstream branch or the version specific branch (v4.x, I believe).
IMO the changes that we've made should be on our fork's I made the base
I'm happy to do this, but given that you've been managing this for some time and will be going forward, let me know what you actually want. |
Closing in favor of #14 which is based on 4.16.0.1 instead of upstream which has breaking changes to our build steps. |
PROBLEM
Scripts from our CI pipeline that run using this plugin don't receive signals from the Buildkite agent running them. These include signals that Buildkite sends when a user manually cancels a job or when a job reaches its timeout as well as cases where an agent dies.
SOLUTION
This attempts to make shutdown more reliable, however this attempt was not very successful. Below is a detailed explanation of my findings. The TL;DR is that while the reasons that we don't receive the signals seem clear, the obvious fixes actually made things worse. Therefore this PR adds logging to datadog directly, the same way that we do in our pipeline scripts in the monorepo. From my testing this should succeed the vast majority of the time. However we cannot differentiate between a manual cancelation, a timeout, or another reason for SIGTERM. This could likely be improved in various ways in the future, and combined with other data in DD from the CI executions we should be able to get what we need.
TESTING
See the monorepo PR which updates to this version. That will remain as a draft until a new release of this plugin is made and it can be updated to reference this release version.
FINDINGS
In some cases the plugin receives no signal, but in most cases the plugin does receive the signal but it is not forwarded to the docker compose process that is running our script. The cases where the plugin does not receive a signal seem likely to be bugs in Buildkite's agent code, or because the agent died before the signal propagated. I made a ton of attempts to figure out why the signal is not being forwarded to the script running in docker compose (see PR for more).
bin/sh -e -c
. This should cause the process to not be PID 1, which would mean that it does not receive signals. According to docker's docs, a script can be run directly with docker compose without the bin/sh command. This makes the process PID 1, and indeed that is the behavior seen if the code is changed, but the signals still don't seem to propagate.So the 4 possibilities and results are:
bin/sh
(upstream version): Signal is received by plugin but not our script, termination works. Plugin exits gracefully, script doesn't.bin/sh
: Signal is not received by plugin or script. Nothing exits gracefully.bin/sh
: Signal is received by plugin, and sometimes, but not usually, by our script. Plugin exits gracefully, script sometimes does. This also breaks all the plugins tests.bin/sh
: SIgnal is not received by plugin or script. Nothing exits gracefully.An additionally confusing thing is that our script ALWAYS restarts when run in a subshell after a term signal is sent. I have been unable to explain this behavior. It seems that the docker compose command is re-running and nothing else, but where it is getting this setting I'm not sure. This behavior is more visible due to the changes in this PR. We are now attempting to
docker compose stop
the service that is running our script rather than the entire docker container, which seems to make it a bit more likely that our script will exit gracefully, but also means the script has time to run a little after restarting.