Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logging to DD, attempt to make shutdown more graceful #12

Closed
wants to merge 30 commits into from

Conversation

davkutalek
Copy link

@davkutalek davkutalek commented Apr 24, 2024

PROBLEM
Scripts from our CI pipeline that run using this plugin don't receive signals from the Buildkite agent running them. These include signals that Buildkite sends when a user manually cancels a job or when a job reaches its timeout as well as cases where an agent dies.

SOLUTION
This attempts to make shutdown more reliable, however this attempt was not very successful. Below is a detailed explanation of my findings. The TL;DR is that while the reasons that we don't receive the signals seem clear, the obvious fixes actually made things worse. Therefore this PR adds logging to datadog directly, the same way that we do in our pipeline scripts in the monorepo. From my testing this should succeed the vast majority of the time. However we cannot differentiate between a manual cancelation, a timeout, or another reason for SIGTERM. This could likely be improved in various ways in the future, and combined with other data in DD from the CI executions we should be able to get what we need.

TESTING
See the monorepo PR which updates to this version. That will remain as a draft until a new release of this plugin is made and it can be updated to reference this release version.

FINDINGS
In some cases the plugin receives no signal, but in most cases the plugin does receive the signal but it is not forwarded to the docker compose process that is running our script. The cases where the plugin does not receive a signal seem likely to be bugs in Buildkite's agent code, or because the agent died before the signal propagated. I made a ton of attempts to figure out why the signal is not being forwarded to the script running in docker compose (see PR for more).

  • This plugin is running the docker compose command in a subshell. This should prevent it from receiving a signal. However, when run in the subshell, it is killed correctly even tho we don't see the signal. I assume this is because the subshell itself is killed, SIGKILLing it's child processes. When run outside a subshell we still don't see the signal AND the docker compose process is not killed at all.
  • This plugin runs shell scripts in docker compose with bin/sh -e -c. This should cause the process to not be PID 1, which would mean that it does not receive signals. According to docker's docs, a script can be run directly with docker compose without the bin/sh command. This makes the process PID 1, and indeed that is the behavior seen if the code is changed, but the signals still don't seem to propagate.

So the 4 possibilities and results are:

  • Run in subshell with bin/sh (upstream version): Signal is received by plugin but not our script, termination works. Plugin exits gracefully, script doesn't.
  • Run without subshell with bin/sh: Signal is not received by plugin or script. Nothing exits gracefully.
  • Run in subshell without bin/sh: Signal is received by plugin, and sometimes, but not usually, by our script. Plugin exits gracefully, script sometimes does. This also breaks all the plugins tests.
  • Run without subshell or bin/sh: SIgnal is not received by plugin or script. Nothing exits gracefully.

An additionally confusing thing is that our script ALWAYS restarts when run in a subshell after a term signal is sent. I have been unable to explain this behavior. It seems that the docker compose command is re-running and nothing else, but where it is getting this setting I'm not sure. This behavior is more visible due to the changes in this PR. We are now attempting to docker compose stop the service that is running our script rather than the entire docker container, which seems to make it a bit more likely that our script will exit gracefully, but also means the script has time to run a little after restarting.

bjreath and others added 3 commits January 17, 2024 14:00
@davkutalek davkutalek changed the base branch from master to merge-upstream April 25, 2024 20:04

if [[ "${BUILDKITE_PLUGIN_DOCKER_COMPOSE_COLLAPSE_LOGS:-false}" = "true" ]]; then
group_type="---"
else
group_type="+++"
fi

# Disable -e to prevent cancelling step if the command fails for whatever reason
set +e
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unneccessary if we use the || exitcode=$? syntax.

@davkutalek davkutalek requested review from a team and rayalan May 7, 2024 19:05
@davkutalek davkutalek changed the title Fix for graceful shutdown option. Add logging to DD, attempt to make shutdown more graceful May 7, 2024
@rayalan
Copy link

rayalan commented May 7, 2024

@davkutalek For visibility and because we're having conversations with BuildKite around renewel, I flagged this PR with @cclear and maybe we can get a bit of their help with some of these edge cases.

Copy link

@rayalan rayalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent research. The solution looks good given the conditions. My only question is whether we should merge this into the merge-upstream branch or the version specific branch (v4.x, I believe).

@davkutalek
Copy link
Author

davkutalek commented May 8, 2024

Excellent research. The solution looks good given the conditions. My only question is whether we should merge this into the merge-upstream branch or the version specific branch (v4.x, I believe).

IMO the changes that we've made should be on our fork's master branch so that is is clear what we've done. Then we should be creating releases from master that include the upstream version + our own suffix. (I have strong and somewhat unusual opinions about release versioning because I've had to manage multiple simultaneous versioned mobile SDKs where it needs to be very clear what/who it is for or how it is different).

I made the base merge-upstream just so my changes were clear, but what I think we should do is:

  • Merge upstream to our master
  • Merge this PR to our master
  • Merge whatever other changes we've made from our prev versions to our master
  • Make a new release 5.2.0-handshake6 (6 because it appears to be our sixth release with changes. However, if some of those releases were just updating from upstream it could be less)

I'm happy to do this, but given that you've been managing this for some time and will be going forward, let me know what you actually want.

@davkutalek
Copy link
Author

davkutalek commented May 9, 2024

Closing in favor of #14 which is based on 4.16.0.1 instead of upstream which has breaking changes to our build steps.

@davkutalek davkutalek closed this May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants