Succeeded workflows are not cleaned even if TTL is set 0 #10947

CiprianAnton · 2023-04-20T12:46:07Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I've noticed this behavior in both v3.4.5 and v3.4.7.
After Argo controller restarts, there is a point (aprox after 20 mins) where controller is not cleaning workflows for a temporary amount of time. Workflows stay in Succeeded state for aprox 15mins, after that time cleanup gets resumed.

After this timeframe, everything seems to go back to normal, succeeded workflows being cleaned up immediately.
I've noticed this to happen only once after controller gets restarted.

The configuration we use for TTL:

    ttlStrategy:
      secondsAfterFailure: 86400
      secondsAfterSuccess: 0

I also have a graph that shows evolution of workflows in cluster:

We don't use artifactGC, I've excluded #10840

Version

v3.4.7

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
      You can also run it in Python: https://couler-proj.github.io/couler/examples/#hello-world
spec:
  entrypoint: whalesay
  ttlStrategy:
    secondsAfterSuccess: 0
    secondsAfterFailure: 86400
  securityContext:
    runAsNonRoot: true
    runAsUser: 8737 #; any non-root user
  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]

PowerShell script that creates workflows

$maximumNumberOfWorkflowsToSchedule = 10
$numberOfWorkflowsToScheduleAtOnce = 4
$namespace = "default"

while ($true)
{
    $currentWorkflows = &kubectl get workflows --no-headers -n $namespace
    $numberOfCurrentWorkflows = ($currentWorkflows | Measure-Object -Line).Lines
    Write-Host "Number of workflows in cluster: $numberOfCurrentWorkflows"

    if ($numberOfCurrentWorkflows -le $maximumNumberOfWorkflowsToSchedule)
    {
        for ($i = 0; $i -lt $numberOfWorkflowsToScheduleAtOnce; $i++)
        {
            &argo submit -n $namespace ./hello-world.yaml
        }
    }
    else
    {
        Write-Host "Too many workflows in cluster. Check succeeded workflows are cleaned up."
    }
    Start-Sleep -Seconds 5
}

Logs from the workflow controller

I have some logs:

Line 11652: time="2023-04-20T11:40:06.810Z" level=info msg="Queueing Succeeded workflow default/94f2ce40-879f-41eb-a767-9b7cb93b541b-79xxp for delete in 0s due to TTL"
Line 16234: time="2023-04-20T11:56:39.390Z" level=info msg="Queueing Succeeded workflow default/f05c891b-3532-4197-9910-c6e7bc17c4e0-42l4v for delete in -15m52s due to TTL"
Line 16236: time="2023-04-20T11:56:39.391Z" level=info msg="Queueing Succeeded workflow default/7a41f47a-b60b-48f5-b7e4-9dc43ab5d7c6-5mzcs for delete in -19m5s due to TTL"
Line 16238: time="2023-04-20T11:56:39.392Z" level=info msg="Queueing Succeeded workflow default/ff42bc93-5145-4d84-9c78-a8a3f89108a5-2rkqd for delete in -18m52s due to TTL"
Line 16240: time="2023-04-20T11:56:39.393Z" level=info msg="Queueing Succeeded workflow default/ab5f6918-21da-4d5f-80f8-d4b18efa4df2-4fmvk for delete in -19m12s due to TTL"
Line 16242: time="2023-04-20T11:56:39.394Z" level=info msg="Queueing Succeeded workflow default/ccc7633b-2784-4f5a-91e1-fd279d4cfe97-psr8n for delete in -18m7s due to TTL"
Line 16244: time="2023-04-20T11:56:39.395Z" level=info msg="Queueing Succeeded workflow default/f4b9cc14-2c28-422d-bfbe-1ab4c41727a3-vbdkl for delete in -17m3s due to TTL"
Line 16246: time="2023-04-20T11:56:39.396Z" level=info msg="Queueing Succeeded workflow default/cc55b8c5-efdb-40d7-ae32-64d32c210336-k9ldv for delete in -16m3s due to TTL"
Line 16248: time="2023-04-20T11:56:39.396Z" level=info msg="Queueing Succeeded workflow default/b4f01c20-bfb7-4def-b7d4-5418c343df5d-gnpth for delete in -18m3s due to TTL"
Line 16250: time="2023-04-20T11:56:39.397Z" level=info msg="Queueing Succeeded workflow default/1cdbdfff-454b-43e0-b08a-1eb6268a4ff1-td8d6 for delete in -18m2s due to TTL"
Line 16252: time="2023-04-20T11:56:39.398Z" level=info msg="Queueing Succeeded workflow default/becf0184-066f-47c9-9f86-110b66165190-mxt96 for delete in -18m10s due to TTL"
Line 16254: time="2023-04-20T11:56:39.399Z" level=info msg="Queueing Succeeded workflow default/877c8e87-5898-4651-87b4-9f66442b7075-2v54f for delete in -17m5s due to TTL"

Logs from in your workflow's wait container

Don't have.

The text was updated successfully, but these errors were encountered:

sarabala1979 · 2023-04-20T17:20:15Z

@CiprianAnton Can you check k8s api log to make sure delete call from workflow controller is succeeded?

CiprianAnton · 2023-04-21T05:34:47Z

After Queueing Succeeded workflow default/877c8e87-5898-4651-87b4-9f66442b7075-2v54f for delete in -17m5s due to TTL appears in controller logs, the workflow gets deleted from the cluster. Kubectl and argoserver won't show it again.

CiprianAnton · 2023-04-21T10:07:16Z

I can constantly reproduce this, on multiple clusters.
I'm attaching a workflow based on hello-world example and a PowerShell script that schedules workflows continuously.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
      You can also run it in Python: https://couler-proj.github.io/couler/examples/#hello-world
spec:
  entrypoint: whalesay
  ttlStrategy:
    secondsAfterSuccess: 0
    secondsAfterFailure: 86400
  securityContext:
    runAsNonRoot: true
    runAsUser: 8737 #; any non-root user
  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]

PowerShell script

$maximumNumberOfWorkflowsToSchedule = 10
$numberOfWorkflowsToScheduleAtOnce = 4
$namespace = "default"

while ($true)
{
    $currentWorkflows = &kubectl get workflows --no-headers -n $namespace
    $numberOfCurrentWorkflows = ($currentWorkflows | Measure-Object -Line).Lines
    Write-Host "Number of workflows in cluster: $numberOfCurrentWorkflows"

    if ($numberOfCurrentWorkflows -le $maximumNumberOfWorkflowsToSchedule)
    {
        for ($i = 0; $i -lt $numberOfWorkflowsToScheduleAtOnce; $i++)
        {
            &argo submit -n $namespace ./hello-world.yaml
        }
    }
    else
    {
        Write-Host "Too many workflows in cluster. Check succeeded workflows are cleaned up."
    }
    Start-Sleep -Seconds 5
}

After aprox 20 minutes since the Argo controller started, this should reproduce. Remember to restart controller in order to reproduce this.

CiprianAnton · 2023-10-18T09:16:37Z

Update: the issue reproduces for failed workflows as well, I don't think the state matters. Pod cleanup is also affected. Based on the logs like Queueing Succeeded workflow default/877c8e87-5898-4651-87b4-9f66442b7075-2v54f for delete in -17m5s due to TTL, I think the problem is at enqueue side, workflows are not not scheduled to be cleaned.

I'm inclined to believe the issue comes from the k8s client. There is also an hardcoded workflowResyncPeriod of 20 minutes, which justifies how the issue self-heal in time.

I'm not familiar with the codebase, some question for others that might know (like @terrytangyuan):

what implications would have decreasing the resync period? Is it a heavy operation?
I see that wfInformer.Run is being run twice on the same object, once from controller.go and once from gc_controller.go. Is this use case valid? Some warning is being reported W1018 10:40:37.539223 1 shared_informer.go:401] The sharedIndexInformer has started, run more than once is not allowed

Problem also reproduces on 3.4.11
Comparing argo v.3.38 with 3.4.11, the k8s.io/client-go has changed from v0.23.3 to v0.24.3

CiprianAnton · 2023-10-18T10:28:47Z

It might also worth upgrading client-go package to a newer version.

CiprianAnton · 2023-10-27T15:06:39Z

I can confirm the issue was introduced in 39b7f91, when k8s.io/client-go was upgraded from v0.23.5 to v0.24.3

terrytangyuan · 2023-10-31T17:30:18Z

It's been there for a while so we may just wait for fix in kubernetes/kubernetes#127964

tooptoop4 · 2024-10-01T11:46:19Z

@CiprianAnton did u find workaround (downgrade the k8s client, shorten 20m workflowresyncperiod, increase RECENTLY_STARTED_POD_DURATION, set INFORMER_WRITE_BACK to false, something else)? i think i'm facing similar issue in #13671 i get The sharedIndexInformer has started, run more than once is not allowed in logs too

CiprianAnton · 2024-10-02T16:46:42Z

@tooptoop4 The workaround I used was to just ignore those succeeded workflows. After 20 minutes it will self heal. This issue comes from the k8s go client and happens once after the pod restarted.

CiprianAnton added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Apr 20, 2023

sarabala1979 added the P3 Low priority label Apr 20, 2023

CiprianAnton changed the title ~~Sometimes succeeded workflows are not cleaned even if TTL is set 0~~ Succeeded workflows are not cleaned even if TTL is set 0 Apr 24, 2023

This comment was marked as resolved.

Sign in to view

stale bot added the problem/stale This has not had a response in some time label Jun 18, 2023

terrytangyuan removed the problem/stale This has not had a response in some time label Sep 20, 2023

agilgur5 added the area/controller Controller issues, panics label Oct 19, 2023

agilgur5 added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Feb 18, 2024

tooptoop4 mentioned this issue Oct 1, 2024

Environment variables to configure (shorten) Informer ResyncPeriods #13690

Open

CiprianAnton mentioned this issue Oct 9, 2024

SharedIndexInformer not calling event handlers during the first resync kubernetes/kubernetes#127964

Open

agilgur5 added area/upstream This is an issue with an upstream dependency, not Argo itself solution/workaround There's a workaround, might not be great, but exists labels Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Succeeded workflows are not cleaned even if TTL is set 0 #10947

Succeeded workflows are not cleaned even if TTL is set 0 #10947

CiprianAnton commented Apr 20, 2023 •

edited by agilgur5

Loading

sarabala1979 commented Apr 20, 2023

CiprianAnton commented Apr 21, 2023

CiprianAnton commented Apr 21, 2023 •

edited

Loading

This comment was marked as resolved.

CiprianAnton commented Oct 18, 2023 •

edited

Loading

CiprianAnton commented Oct 18, 2023

CiprianAnton commented Oct 27, 2023

terrytangyuan commented Oct 31, 2023

tooptoop4 commented Oct 1, 2024 •

edited

Loading

CiprianAnton commented Oct 2, 2024

Succeeded workflows are not cleaned even if TTL is set 0 #10947

Succeeded workflows are not cleaned even if TTL is set 0 #10947

Comments

CiprianAnton commented Apr 20, 2023 • edited by agilgur5 Loading

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

sarabala1979 commented Apr 20, 2023

CiprianAnton commented Apr 21, 2023

CiprianAnton commented Apr 21, 2023 • edited Loading

This comment was marked as resolved.

CiprianAnton commented Oct 18, 2023 • edited Loading

CiprianAnton commented Oct 18, 2023

CiprianAnton commented Oct 27, 2023

terrytangyuan commented Oct 31, 2023

tooptoop4 commented Oct 1, 2024 • edited Loading

CiprianAnton commented Oct 2, 2024

CiprianAnton commented Apr 20, 2023 •

edited by agilgur5

Loading

CiprianAnton commented Apr 21, 2023 •

edited

Loading

CiprianAnton commented Oct 18, 2023 •

edited

Loading

tooptoop4 commented Oct 1, 2024 •

edited

Loading