You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My company recently had an internal incident where our builds started failing in mysterious ways. The problem turned out to be that the Docker Compose Buildkite Plugin's pre-exit hook script was failing to stop containers (for reasons outside the scope of this issue), so they were sticking around, causing a buildup of running Docker containers that ultimately consumed too many resources.
The issue is, the script in question swallows these failures with || true [code link], which meant that we didn't discover the issue when we were testing the change that ended up triggering the incident, and that even once the incident was underway, it took a long time to locate the issue.
(The command output does appear in the logs, but it's doubly collapsed: firstly because the build step succeeds, and secondly because it's inside a collapsed section. So the errors are completely invisible until you've specifically discovered that containers aren't being stopped and specifically go looking for why.)
I do think it makes sense to avoid failing the build step at this point, but I think the script should emit a warning annotation, and should de-collapse the relevant section of the build log.
To do that, I think the relevant function (compose_cleanup) should have a local had_failure variable; the || true should be replaced with || had_failure=yes; and the function should end with something like this:
if [[ "$had_failure" ]] ; then
local job_desc="$BUILDKITE_LABEL"
if (( ${BUILDKITE_PARALLEL_JOB_COUNT-1} > 1 )) ; then
job_desc+=" $((BUILDKITE_PARALLEL_JOB+1))/$BUILDKITE_PARALLEL_JOB_COUNT"
fi
local job_link="[$job_desc](#$BUILDKITE_JOB_ID)"
buildkite-agent annotate \
"Error in Docker Compose Buildkite Plugin pre-exit hook in $job_link." \
--style warning \
--context docker-compose-buildkite-plugin-pre-exit-failure
echo '^^^ +++'
fi
I think the same should also be done in ensure_stopped [code link].
Disclaimers:
Not tested.
I'm not sure if this should be done for all occurrences of || true, or if some errors should really remain silent. (docker logs, for example, doesn't seem worth warning about.) I'll defer to your judgment. 🙂
Thanks in advance!
The text was updated successfully, but these errors were encountered:
Thank you for raising this and for the example code, we will raise this internally to improve the usage of the Plugin itself as definitely debugging of errors is harder when messages useful output is hidden and I think annotations and re-openening build log group are a great way to highlight for those types of scenarios
Hi,
My company recently had an internal incident where our builds started failing in mysterious ways. The problem turned out to be that the Docker Compose Buildkite Plugin's pre-exit hook script was failing to stop containers (for reasons outside the scope of this issue), so they were sticking around, causing a buildup of running Docker containers that ultimately consumed too many resources.
The issue is, the script in question swallows these failures with
|| true
[code link], which meant that we didn't discover the issue when we were testing the change that ended up triggering the incident, and that even once the incident was underway, it took a long time to locate the issue.(The command output does appear in the logs, but it's doubly collapsed: firstly because the build step succeeds, and secondly because it's inside a collapsed section. So the errors are completely invisible until you've specifically discovered that containers aren't being stopped and specifically go looking for why.)
I do think it makes sense to avoid failing the build step at this point, but I think the script should emit a warning annotation, and should de-collapse the relevant section of the build log.
To do that, I think the relevant function (
compose_cleanup
) should have a localhad_failure
variable; the|| true
should be replaced with|| had_failure=yes
; and the function should end with something like this:I think the same should also be done in
ensure_stopped
[code link].Disclaimers:
|| true
, or if some errors should really remain silent. (docker logs
, for example, doesn't seem worth warning about.) I'll defer to your judgment. 🙂Thanks in advance!
The text was updated successfully, but these errors were encountered: