Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logging to DD, attempt to make shutdown more graceful #14

Merged
merged 1 commit into from
May 10, 2024

Conversation

davkutalek
Copy link

@davkutalek davkutalek commented May 8, 2024

PROBLEM
Scripts from our CI pipeline that run using this plugin don't receive signals from the Buildkite agent running them. These include signals that Buildkite sends when a user manually cancels a job or when a job reaches its timeout as well as cases where an agent dies.

SOLUTION
This attempts to make shutdown more reliable, however this attempt was not very successful. Below is a detailed explanation of my findings. The TL;DR is that while the reasons that we don't receive the signals seem clear, the obvious fixes actually made things worse. Therefore this PR adds logging to datadog directly, the same way that we do in our pipeline scripts in the monorepo. From my testing this should succeed the vast majority of the time. However we cannot differentiate between a manual cancelation, a timeout, or another reason for SIGTERM. This could likely be improved in various ways in the future, and combined with other data in DD from the CI executions we should be able to get what we need.

TESTING
See the monorepo https://github.com/joinhandshake/handshake/pull/69867 which updates to this version. That will remain as a draft until a new release of this plugin is made and it can be updated to reference this release version.

FINDINGS
In some cases the plugin receives no signal, but in most cases the plugin does receive the signal but it is not forwarded to the docker compose process that is running our script. The cases where the plugin does not receive a signal seem likely to be bugs in Buildkite's agent code, or because the agent died before the signal propagated. I made a ton of attempts to figure out why the signal is not being forwarded to the script running in docker compose (see #13 for more).

This plugin is running the docker compose command in a subshell. This should prevent it from receiving a signal. However, when run in the subshell, it is killed correctly even tho we don't see the signal. I assume this is because the subshell itself is killed, SIGKILLing it's child processes. When run outside a subshell we still don't see the signal AND the docker compose process is not killed at all.
This plugin runs shell scripts in docker compose with bin/sh -e -c. This should cause the process to not be PID 1, which would mean that it does not receive signals. According to docker's docs, a script can be run directly with docker compose without the bin/sh command. This makes the process PID 1, and indeed that is the behavior seen if the code is changed, but the signals still don't seem to propagate.
So the 4 possibilities and results are:

  • Run in subshell with bin/sh (upstream version): Signal is received by plugin but not our script, termination works. Plugin exits gracefully, script doesn't.
  • Run without subshell with bin/sh: Signal is not received by plugin or script. Nothing exits gracefully.
  • Run in subshell without bin/sh: Signal is received by plugin, and sometimes, but not usually, by our script. Plugin exits gracefully, script sometimes does. This also breaks all the plugins tests.
  • Run without subshell or bin/sh: SIgnal is not received by plugin or script. Nothing exits gracefully.

An additionally confusing thing is that our script ALWAYS restarts when run in a subshell after a term signal is sent. I have been unable to explain this behavior. It seems that the docker compose command is re-running and nothing else, but where it is getting this setting I'm not sure. This behavior is more visible due to the changes in this PR. We are now attempting to docker compose stop the service that is running our script rather than the entire docker container, which seems to make it a bit more likely that our script will exit gracefully, but also means the script has time to run a little after restarting.

@davkutalek davkutalek changed the title Old version test Add logging to DD, attempt to make shutdown more graceful May 9, 2024
@davkutalek davkutalek requested a review from rayalan May 9, 2024 19:00
@davkutalek davkutalek marked this pull request as ready for review May 9, 2024 19:00
Copy link

@rayalan rayalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ to the amount of work you've gone to here, including tests and documentation updates.

Let's talk this afternoon in person. My overall reaction is:

This is such a big change, we don't want to merge it with master. It introduces too much complexity/deviation from the baseline for us to maintain over the long-term. Instead, we want to run it as a one-off for a while, get our log results, and then think about how to do something such as:

  • Submit a PR to BuildKite
  • Make a small portable change that achieves a similar result (given the scope of the PR, I'm not sure that's possible)
  • Convince BuildKite to solve this problem on their own
  • ???

For context, Handshake's v4 branch has only one or two small deviations, both of which I aspire to merge back to BuildKite at some point so that we don't maintain a fork of the docker composer plugin at all.

.buildkite/pipeline.yml Outdated Show resolved Hide resolved
@davkutalek davkutalek changed the base branch from master to apr/4.16.0-port May 10, 2024 16:25
@davkutalek
Copy link
Author

Let's talk this afternoon in person. My overall reaction is:

This is such a big change, we don't want to merge it with master. It introduces too much complexity/deviation from the baseline for us to maintain over the long-term. Instead, we want to run it as a one-off for a while, get our log results, and then think about how to do something such as:

Yeah, happy to chat. My reaction is that we've done the work because we had to and because we have done it we have to continue maintaining it. Keeping it off of master is just hiding it which is confusing and makes it more difficult to manage. In the future if we are able to revert closer to the upstream then that is fairly easy to do. But hiding the work that would require doesn't help us.

@davkutalek davkutalek changed the base branch from apr/4.16.0-port to v4.16.0.1-merge May 10, 2024 16:40
@rayalan rayalan self-requested a review May 10, 2024 20:51
Copy link

@rayalan rayalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

David and I chatted; I have a better sense for his concerns and vice versa. Approving this in case it needs to move forward / can move forward at David's discretion.

@davkutalek davkutalek merged commit f252caa into v4.16.0.1-merge May 10, 2024
@davkutalek davkutalek deleted the old-version-test branch May 10, 2024 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants