Add logging to DD, attempt to make shutdown more graceful #12

davkutalek · 2024-04-24T22:40:37Z

PROBLEM
Scripts from our CI pipeline that run using this plugin don't receive signals from the Buildkite agent running them. These include signals that Buildkite sends when a user manually cancels a job or when a job reaches its timeout as well as cases where an agent dies.

SOLUTION
This attempts to make shutdown more reliable, however this attempt was not very successful. Below is a detailed explanation of my findings. The TL;DR is that while the reasons that we don't receive the signals seem clear, the obvious fixes actually made things worse. Therefore this PR adds logging to datadog directly, the same way that we do in our pipeline scripts in the monorepo. From my testing this should succeed the vast majority of the time. However we cannot differentiate between a manual cancelation, a timeout, or another reason for SIGTERM. This could likely be improved in various ways in the future, and combined with other data in DD from the CI executions we should be able to get what we need.

TESTING
See the monorepo PR which updates to this version. That will remain as a draft until a new release of this plugin is made and it can be updated to reference this release version.

FINDINGS
In some cases the plugin receives no signal, but in most cases the plugin does receive the signal but it is not forwarded to the docker compose process that is running our script. The cases where the plugin does not receive a signal seem likely to be bugs in Buildkite's agent code, or because the agent died before the signal propagated. I made a ton of attempts to figure out why the signal is not being forwarded to the script running in docker compose (see PR for more).

This plugin is running the docker compose command in a subshell. This should prevent it from receiving a signal. However, when run in the subshell, it is killed correctly even tho we don't see the signal. I assume this is because the subshell itself is killed, SIGKILLing it's child processes. When run outside a subshell we still don't see the signal AND the docker compose process is not killed at all.
This plugin runs shell scripts in docker compose with bin/sh -e -c. This should cause the process to not be PID 1, which would mean that it does not receive signals. According to docker's docs, a script can be run directly with docker compose without the bin/sh command. This makes the process PID 1, and indeed that is the behavior seen if the code is changed, but the signals still don't seem to propagate.

So the 4 possibilities and results are:

Run in subshell with bin/sh (upstream version): Signal is received by plugin but not our script, termination works. Plugin exits gracefully, script doesn't.
Run without subshell with bin/sh: Signal is not received by plugin or script. Nothing exits gracefully.
Run in subshell without bin/sh: Signal is received by plugin, and sometimes, but not usually, by our script. Plugin exits gracefully, script sometimes does. This also breaks all the plugins tests.
Run without subshell or bin/sh: SIgnal is not received by plugin or script. Nothing exits gracefully.

An additionally confusing thing is that our script ALWAYS restarts when run in a subshell after a term signal is sent. I have been unable to explain this behavior. It seems that the docker compose command is re-running and nothing else, but where it is getting this setting I'm not sure. This behavior is more visible due to the changes in this PR. We are now attempting to docker compose stop the service that is running our script rather than the entire docker container, which seems to make it a bit more likely that our script will exit gracefully, but also means the script has time to run a little after restarting.

… service startups

Add --include-deps option to Docker image pulls and retries to Docker service startups

…stop has returned

…ure-kill-fix

…not sure how to fix...

davkutalek · 2024-05-07T19:00:05Z

commands/run.sh


 if [[ "${BUILDKITE_PLUGIN_DOCKER_COMPOSE_COLLAPSE_LOGS:-false}" = "true" ]]; then
  group_type="---"
 else
  group_type="+++"
 fi

-# Disable -e to prevent cancelling step if the command fails for whatever reason
-set +e


This is unneccessary if we use the || exitcode=$? syntax.

rayalan · 2024-05-07T19:37:38Z

@davkutalek For visibility and because we're having conversations with BuildKite around renewel, I flagged this PR with @cclear and maybe we can get a bit of their help with some of these edge cases.

rayalan

Excellent research. The solution looks good given the conditions. My only question is whether we should merge this into the merge-upstream branch or the version specific branch (v4.x, I believe).

davkutalek · 2024-05-08T18:43:59Z

Excellent research. The solution looks good given the conditions. My only question is whether we should merge this into the merge-upstream branch or the version specific branch (v4.x, I believe).

IMO the changes that we've made should be on our fork's master branch so that is is clear what we've done. Then we should be creating releases from master that include the upstream version + our own suffix. (I have strong and somewhat unusual opinions about release versioning because I've had to manage multiple simultaneous versioned mobile SDKs where it needs to be very clear what/who it is for or how it is different).

I made the base merge-upstream just so my changes were clear, but what I think we should do is:

Merge upstream to our master
Merge this PR to our master
Merge whatever other changes we've made from our prev versions to our master
Make a new release 5.2.0-handshake6 (6 because it appears to be our sixth release with changes. However, if some of those releases were just updating from upstream it could be less)

I'm happy to do this, but given that you've been managing this for some time and will be going forward, let me know what you actually want.

davkutalek · 2024-05-09T18:59:35Z

Closing in favor of #14 which is based on 4.16.0.1 instead of upstream which has breaking changes to our build steps.

bjreath and others added 3 commits January 17, 2024 14:00

Add --include-deps option to Docker image pulls and retries to Docker…

d86f209

… service startups

Merge pull request #8 from joinhandshake/apr/4.16.0-port

c5883ed

Add --include-deps option to Docker image pulls and retries to Docker service startups

Fix for graceful shutdown option. Don't call docker rm --force until …

0782f49

…stop has returned

davkutalek requested a review from Mctalian April 24, 2024 22:40

David Kutalek added 3 commits April 25, 2024 11:55

Merge remote-tracking branch 'upstream/master' into davkutalek/premat…

853a524

…ure-kill-fix

working on testing graceful shutdown fix

af54780

fixed broken tests

bebc54f

davkutalek changed the base branch from master to merge-upstream April 25, 2024 20:04

David Kutalek added 22 commits April 29, 2024 15:13

try running docker shell arg in [] to use exec mode

1b6df23

try graceful compose shutdown

dc72a0f

pass service name arg to stop/wait

553a8d6

exit upon getting term signal, otherwise service restarts

4dbb936

Don't run docker command in subshell, obviously signals wont make it\!

846e1bc

rewrite run.sh to use docker compose exec form. All the tests broke, …

e5fd440

…not sure how to fix...

revert previous since it isn't working

99c0fc3

attempt to run shell script without sh -c which creates new shell

722c50e

syntax fix

47d200d

try subshell with direct command

aa1220c

refactor subshell to make it a little clearer, prevent exitcode -u issue

8e06129

try not using a subshell again

1a5d072

Revert failed changes, Cleanup with some learnings. all tests passing

385869d

one last attempt with no subshell and no sh -c

4eeae5d

use subshell but not bin/sh

2e15fd5

use bin/sh without e flag

88a6602

fix service ref

8714720

revert prev

8189dcf

no subshell

d21572b

no sh

fccdf08

revert

92ae4b4

add logging

6490c0b

davkutalek commented May 7, 2024

View reviewed changes

revert remove sh

2d801bb

davkutalek requested review from a team and rayalan May 7, 2024 19:05

davkutalek changed the title ~~Fix for graceful shutdown option.~~ Add logging to DD, attempt to make shutdown more graceful May 7, 2024

rayalan approved these changes May 7, 2024

View reviewed changes

Merge tag 'v4.16.0.1' into davkutalek/premature-kill-fix

130008f

davkutalek closed this May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add logging to DD, attempt to make shutdown more graceful #12

Add logging to DD, attempt to make shutdown more graceful #12

davkutalek commented Apr 24, 2024 •

edited

Loading

davkutalek May 7, 2024

rayalan commented May 7, 2024

rayalan left a comment

davkutalek commented May 8, 2024 •

edited

Loading

davkutalek commented May 9, 2024 •

edited

Loading

Add logging to DD, attempt to make shutdown more graceful #12

Add logging to DD, attempt to make shutdown more graceful #12

Conversation

davkutalek commented Apr 24, 2024 • edited Loading

davkutalek May 7, 2024

Choose a reason for hiding this comment

rayalan commented May 7, 2024

rayalan left a comment

Choose a reason for hiding this comment

davkutalek commented May 8, 2024 • edited Loading

davkutalek commented May 9, 2024 • edited Loading

davkutalek commented Apr 24, 2024 •

edited

Loading

davkutalek commented May 8, 2024 •

edited

Loading

davkutalek commented May 9, 2024 •

edited

Loading