Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures stopping containers are too quiet #445

Open
ran-arigur opened this issue Jun 15, 2024 · 1 comment
Open

Failures stopping containers are too quiet #445

ran-arigur opened this issue Jun 15, 2024 · 1 comment

Comments

@ran-arigur
Copy link

ran-arigur commented Jun 15, 2024

Hi,

My company recently had an internal incident where our builds started failing in mysterious ways. The problem turned out to be that the Docker Compose Buildkite Plugin's pre-exit hook script was failing to stop containers (for reasons outside the scope of this issue), so they were sticking around, causing a buildup of running Docker containers that ultimately consumed too many resources.

The issue is, the script in question swallows these failures with || true [code link], which meant that we didn't discover the issue when we were testing the change that ended up triggering the incident, and that even once the incident was underway, it took a long time to locate the issue.

(The command output does appear in the logs, but it's doubly collapsed: firstly because the build step succeeds, and secondly because it's inside a collapsed section. So the errors are completely invisible until you've specifically discovered that containers aren't being stopped and specifically go looking for why.)

I do think it makes sense to avoid failing the build step at this point, but I think the script should emit a warning annotation, and should de-collapse the relevant section of the build log.

To do that, I think the relevant function (compose_cleanup) should have a local had_failure variable; the || true should be replaced with || had_failure=yes; and the function should end with something like this:

if [[ "$had_failure" ]] ; then
  local job_desc="$BUILDKITE_LABEL"
  if (( ${BUILDKITE_PARALLEL_JOB_COUNT-1} > 1 )) ; then
    job_desc+=" $((BUILDKITE_PARALLEL_JOB+1))/$BUILDKITE_PARALLEL_JOB_COUNT"
  fi

  local job_link="[$job_desc](#$BUILDKITE_JOB_ID)"

  buildkite-agent annotate \
    "Error in Docker Compose Buildkite Plugin pre-exit hook in $job_link." \
    --style warning \
    --context docker-compose-buildkite-plugin-pre-exit-failure

  echo '^^^ +++'
fi

I think the same should also be done in ensure_stopped [code link].

Disclaimers:

  1. Not tested.
  2. I'm not sure if this should be done for all occurrences of || true, or if some errors should really remain silent. (docker logs, for example, doesn't seem worth warning about.) I'll defer to your judgment. 🙂

Thanks in advance!

@tomowatt
Copy link
Contributor

Hey @ran-arigur

Thank you for raising this and for the example code, we will raise this internally to improve the usage of the Plugin itself as definitely debugging of errors is harder when messages useful output is hidden and I think annotations and re-openening build log group are a great way to highlight for those types of scenarios

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants