-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker dind sidecar iptables issue #3159
Comments
Updating and pinning |
@iamcaleberic yes, we had the same issue and the workaround works 👍 |
Seeing the same issue here running in GKE. We're also dealing with a problem where this morning we ended-up with 10,000 runners (triggering secondary rate limiting) and the vast majority of them were 'offline'. Is there any chance that there is a relationship between this and runners being left in an offline state as they fail to come online cleanly and ARC controller ( EDIT/UPDATE: After we implemented the fix to pin the docker sidecar to |
Sorry for the naive question but where are you specifying |
We are experimenting the same issue |
@verult was able to patch the command directly in like this on line 34342 in version 0.26 of actions-runner-controller.yaml
|
Thanks for the suggestions, everyone. |
@LaloLoop we ran into the issue of pods getting stuck in the Terminating phase after we deleted the runner controller, because there were finalizers left on these pods. Is your controller running when your pods are stuck? |
Thanks for pointing that out @verult . We reached the rate limit as described by @billimek . That caused the controller to panic continously and fail to reconcile. We're using 0.26.0, not sur if newer versions have better error/retries handling. |
@joshgc you can find in here https://github.com/actions/actions-runner-controller/blob/master/charts/actions-runner-controller/values.yaml#L55 Thanks @iamcaleberic it did magic and worked for us as well. |
Same error for me. |
For those running auto scaling runner set, I tried to update the template.spec.containers.dind to 24.0.7-dind-alpine3.18 and it didn't work. It retained the value of docker:dind. I know my syntax is correct because I also pin our custom image to containers as well. I manually updated the CRD My question is, why is this not pinned to a stable version instead of "latest"? It exposes us to unstable updates that can lead to downtime or interruption. |
If this is still an issue for some folks and you are still dealing with ~10,000 offline runners which is triggering the secondary rate-limiting, the following script snippet may be useful to remove the offline runners, #!/bin/bash
while true; do
echo "Fetching more runners"
RESPONSE=$(gh api \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
/orgs/<YOUR ORG>/actions/runners)
echo "Total runners: $(echo "$RESPONSE" | jq '.total_count')"
OFFLINE_RUNNERS="$(echo "$RESPONSE" | jq '.runners | map(select(.status == "offline"))')"
RUNNERS="$(echo "$OFFLINE_RUNNERS" | jq '.[].id') "
# Loop for each runner
for RUNNER in $RUNNERS; do
echo "Removing runner: $RUNNER"
gh api \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
-X DELETE \
"/orgs/<YOUR ORG>/actions/runners/$RUNNER" >> removal.logs
done
# If there was no runners, break
if [ -z "$RUNNERS" ]; then
echo "Done!"
break
fi
done ... or the following action may accomplish the same thing as well (just don't run it on self hosted runners where you are experiencing this issue!): some-natalie/runner-reaper. It's my understanding that GitHub should automatically remove offline runners after 24h but the symptom of this issue seems to be that it will very quickly ramp up the number of offline runners making that automation not viable unless or until you correct the pinned docker version. It also looks like the upstream |
As @iamcaleberic pointed out, if you're deploying the If you're running the newer |
…roller#3159 (#29796) Co-authored-by: Vlado Djerek <[email protected]>
running the newer anyone managed to find a workaround for it? |
Update the CRD manually under |
This worked for me: |
So, how do we do that? I have the following file below and I dont know where to add it. apiVersion: actions.summerwind.dev/v1alpha1
|
image.dindSidecarRepositoryAndTag is done on helm level |
Is this fixed now or should we stick to the binded version of the image ? |
For the gha scale set, I've ended up leaving container mode to be empty and updated template to include the specs to be the same as once created when container mode is dind, only with the new docker tag. |
Fix has been implemented upstream in docker:dind, however it now requires this helm-chart / us using to set a new variable. docker-library/docker#468 (comment) set DOCKER_IPTABLES_LEGACY=1 inside your dind pod, via an overwrite to the helm chart default variables (this should get added to the helm chart, if someone wants an easy PR) Change should go right after these lines for the PR to the chart if someone had a minute to open it. https://github.com/actions/actions-runner-controller/blob/master/charts/gha-runner-scale-set/templates/_helpers.tpl#L106 and https://github.com/actions/actions-runner-controller/blob/master/charts/gha-runner-scale-set/values.yaml#L142 |
Checks
Controller Version
v0.27.6
Helm Chart Version
0.23.6
CertManager Version
1.13.2
Deployment Method
Helm
cert-manager installation
Are you sure youve install cert-manager from an official source?
yes using official jetstack helm repo
Checks
Resource Definitions
To Reproduce
Describe the bug
The docker dind sidecar errors out and does not start and the runner pods ends up restarting every 120 secs, this is the timeout for docker.
Might be related to
docker-library/docker@4c2674d
docker-library/docker#437
Describe the expected behavior
dind sidecar to start.
Whole Controller Logs
The text was updated successfully, but these errors were encountered: