-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Succeeded workflows are not cleaned even if TTL is set 0 #10947
Comments
@CiprianAnton Can you check k8s api log to make sure delete call from workflow controller is succeeded? |
After |
I can constantly reproduce this, on multiple clusters. apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: hello-world-
labels:
workflows.argoproj.io/archive-strategy: "false"
annotations:
workflows.argoproj.io/description: |
This is a simple hello world example.
You can also run it in Python: https://couler-proj.github.io/couler/examples/#hello-world
spec:
entrypoint: whalesay
ttlStrategy:
secondsAfterSuccess: 0
secondsAfterFailure: 86400
securityContext:
runAsNonRoot: true
runAsUser: 8737 #; any non-root user
templates:
- name: whalesay
container:
image: docker/whalesay:latest
command: [cowsay]
args: ["hello world"] PowerShell script $maximumNumberOfWorkflowsToSchedule = 10
$numberOfWorkflowsToScheduleAtOnce = 4
$namespace = "default"
while ($true)
{
$currentWorkflows = &kubectl get workflows --no-headers -n $namespace
$numberOfCurrentWorkflows = ($currentWorkflows | Measure-Object -Line).Lines
Write-Host "Number of workflows in cluster: $numberOfCurrentWorkflows"
if ($numberOfCurrentWorkflows -le $maximumNumberOfWorkflowsToSchedule)
{
for ($i = 0; $i -lt $numberOfWorkflowsToScheduleAtOnce; $i++)
{
&argo submit -n $namespace ./hello-world.yaml
}
}
else
{
Write-Host "Too many workflows in cluster. Check succeeded workflows are cleaned up."
}
Start-Sleep -Seconds 5
} After aprox 20 minutes since the Argo controller started, this should reproduce. Remember to restart controller in order to reproduce this. |
This comment was marked as resolved.
This comment was marked as resolved.
Update: the issue reproduces for failed workflows as well, I don't think the state matters. Pod cleanup is also affected. Based on the logs like I'm inclined to believe the issue comes from the k8s client. There is also an hardcoded I'm not familiar with the codebase, some question for others that might know (like @terrytangyuan):
Problem also reproduces on 3.4.11 |
It might also worth upgrading client-go package to a newer version. |
I can confirm the issue was introduced in 39b7f91, when |
It's been there for a while so we may just wait for fix in kubernetes/kubernetes#127964 |
@CiprianAnton did u find workaround (downgrade the k8s client, shorten 20m workflowresyncperiod, increase |
@tooptoop4 The workaround I used was to just ignore those succeeded workflows. After 20 minutes it will self heal. This issue comes from the k8s go client and happens once after the pod restarted. |
Pre-requisites
:latest
What happened/what you expected to happen?
I've noticed this behavior in both v3.4.5 and v3.4.7.
After Argo controller restarts, there is a point (aprox after 20 mins) where controller is not cleaning workflows for a temporary amount of time. Workflows stay in Succeeded state for aprox 15mins, after that time cleanup gets resumed.
After this timeframe, everything seems to go back to normal, succeeded workflows being cleaned up immediately.
I've noticed this to happen only once after controller gets restarted.
The configuration we use for TTL:
I also have a graph that shows evolution of workflows in cluster:
We don't use artifactGC, I've excluded #10840
Version
v3.4.7
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
PowerShell script that creates workflows
Logs from the workflow controller
I have some logs:
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: