Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Succeeded workflows are not cleaned even if TTL is set 0 #10947

Open
2 of 3 tasks
CiprianAnton opened this issue Apr 20, 2023 · 10 comments
Open
2 of 3 tasks

Succeeded workflows are not cleaned even if TTL is set 0 #10947

CiprianAnton opened this issue Apr 20, 2023 · 10 comments
Labels
area/controller Controller issues, panics area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more area/upstream This is an issue with an upstream dependency, not Argo itself P3 Low priority solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@CiprianAnton
Copy link
Contributor

CiprianAnton commented Apr 20, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I've noticed this behavior in both v3.4.5 and v3.4.7.
After Argo controller restarts, there is a point (aprox after 20 mins) where controller is not cleaning workflows for a temporary amount of time. Workflows stay in Succeeded state for aprox 15mins, after that time cleanup gets resumed.

After this timeframe, everything seems to go back to normal, succeeded workflows being cleaned up immediately.
I've noticed this to happen only once after controller gets restarted.

The configuration we use for TTL:

    ttlStrategy:
      secondsAfterFailure: 86400
      secondsAfterSuccess: 0

I also have a graph that shows evolution of workflows in cluster:
image

We don't use artifactGC, I've excluded #10840

Version

v3.4.7

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
      You can also run it in Python: https://couler-proj.github.io/couler/examples/#hello-world
spec:
  entrypoint: whalesay
  ttlStrategy:
    secondsAfterSuccess: 0
    secondsAfterFailure: 86400
  securityContext:
    runAsNonRoot: true
    runAsUser: 8737 #; any non-root user
  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]

PowerShell script that creates workflows

$maximumNumberOfWorkflowsToSchedule = 10
$numberOfWorkflowsToScheduleAtOnce = 4
$namespace = "default"

while ($true)
{
    $currentWorkflows = &kubectl get workflows --no-headers -n $namespace
    $numberOfCurrentWorkflows = ($currentWorkflows | Measure-Object -Line).Lines
    Write-Host "Number of workflows in cluster: $numberOfCurrentWorkflows"

    if ($numberOfCurrentWorkflows -le $maximumNumberOfWorkflowsToSchedule)
    {
        for ($i = 0; $i -lt $numberOfWorkflowsToScheduleAtOnce; $i++)
        {
            &argo submit -n $namespace ./hello-world.yaml
        }
    }
    else
    {
        Write-Host "Too many workflows in cluster. Check succeeded workflows are cleaned up."
    }
    Start-Sleep -Seconds 5
}

Logs from the workflow controller

I have some logs:

Line 11652: time="2023-04-20T11:40:06.810Z" level=info msg="Queueing Succeeded workflow default/94f2ce40-879f-41eb-a767-9b7cb93b541b-79xxp for delete in 0s due to TTL"
Line 16234: time="2023-04-20T11:56:39.390Z" level=info msg="Queueing Succeeded workflow default/f05c891b-3532-4197-9910-c6e7bc17c4e0-42l4v for delete in -15m52s due to TTL"
Line 16236: time="2023-04-20T11:56:39.391Z" level=info msg="Queueing Succeeded workflow default/7a41f47a-b60b-48f5-b7e4-9dc43ab5d7c6-5mzcs for delete in -19m5s due to TTL"
Line 16238: time="2023-04-20T11:56:39.392Z" level=info msg="Queueing Succeeded workflow default/ff42bc93-5145-4d84-9c78-a8a3f89108a5-2rkqd for delete in -18m52s due to TTL"
Line 16240: time="2023-04-20T11:56:39.393Z" level=info msg="Queueing Succeeded workflow default/ab5f6918-21da-4d5f-80f8-d4b18efa4df2-4fmvk for delete in -19m12s due to TTL"
Line 16242: time="2023-04-20T11:56:39.394Z" level=info msg="Queueing Succeeded workflow default/ccc7633b-2784-4f5a-91e1-fd279d4cfe97-psr8n for delete in -18m7s due to TTL"
Line 16244: time="2023-04-20T11:56:39.395Z" level=info msg="Queueing Succeeded workflow default/f4b9cc14-2c28-422d-bfbe-1ab4c41727a3-vbdkl for delete in -17m3s due to TTL"
Line 16246: time="2023-04-20T11:56:39.396Z" level=info msg="Queueing Succeeded workflow default/cc55b8c5-efdb-40d7-ae32-64d32c210336-k9ldv for delete in -16m3s due to TTL"
Line 16248: time="2023-04-20T11:56:39.396Z" level=info msg="Queueing Succeeded workflow default/b4f01c20-bfb7-4def-b7d4-5418c343df5d-gnpth for delete in -18m3s due to TTL"
Line 16250: time="2023-04-20T11:56:39.397Z" level=info msg="Queueing Succeeded workflow default/1cdbdfff-454b-43e0-b08a-1eb6268a4ff1-td8d6 for delete in -18m2s due to TTL"
Line 16252: time="2023-04-20T11:56:39.398Z" level=info msg="Queueing Succeeded workflow default/becf0184-066f-47c9-9f86-110b66165190-mxt96 for delete in -18m10s due to TTL"
Line 16254: time="2023-04-20T11:56:39.399Z" level=info msg="Queueing Succeeded workflow default/877c8e87-5898-4651-87b4-9f66442b7075-2v54f for delete in -17m5s due to TTL"

Logs from in your workflow's wait container

Don't have.
@CiprianAnton CiprianAnton added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Apr 20, 2023
@sarabala1979 sarabala1979 added the P3 Low priority label Apr 20, 2023
@sarabala1979
Copy link
Member

@CiprianAnton Can you check k8s api log to make sure delete call from workflow controller is succeeded?

@CiprianAnton
Copy link
Contributor Author

After Queueing Succeeded workflow default/877c8e87-5898-4651-87b4-9f66442b7075-2v54f for delete in -17m5s due to TTL appears in controller logs, the workflow gets deleted from the cluster. Kubectl and argoserver won't show it again.

@CiprianAnton
Copy link
Contributor Author

CiprianAnton commented Apr 21, 2023

I can constantly reproduce this, on multiple clusters.
I'm attaching a workflow based on hello-world example and a PowerShell script that schedules workflows continuously.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
      You can also run it in Python: https://couler-proj.github.io/couler/examples/#hello-world
spec:
  entrypoint: whalesay
  ttlStrategy:
    secondsAfterSuccess: 0
    secondsAfterFailure: 86400
  securityContext:
    runAsNonRoot: true
    runAsUser: 8737 #; any non-root user
  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]

PowerShell script

$maximumNumberOfWorkflowsToSchedule = 10
$numberOfWorkflowsToScheduleAtOnce = 4
$namespace = "default"

while ($true)
{
    $currentWorkflows = &kubectl get workflows --no-headers -n $namespace
    $numberOfCurrentWorkflows = ($currentWorkflows | Measure-Object -Line).Lines
    Write-Host "Number of workflows in cluster: $numberOfCurrentWorkflows"

    if ($numberOfCurrentWorkflows -le $maximumNumberOfWorkflowsToSchedule)
    {
        for ($i = 0; $i -lt $numberOfWorkflowsToScheduleAtOnce; $i++)
        {
            &argo submit -n $namespace ./hello-world.yaml
        }
    }
    else
    {
        Write-Host "Too many workflows in cluster. Check succeeded workflows are cleaned up."
    }
    Start-Sleep -Seconds 5
}

After aprox 20 minutes since the Argo controller started, this should reproduce. Remember to restart controller in order to reproduce this.

@CiprianAnton CiprianAnton changed the title Sometimes succeeded workflows are not cleaned even if TTL is set 0 Succeeded workflows are not cleaned even if TTL is set 0 Apr 24, 2023
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Jun 18, 2023
@terrytangyuan terrytangyuan removed the problem/stale This has not had a response in some time label Sep 20, 2023
@CiprianAnton
Copy link
Contributor Author

CiprianAnton commented Oct 18, 2023

Update: the issue reproduces for failed workflows as well, I don't think the state matters. Pod cleanup is also affected. Based on the logs like Queueing Succeeded workflow default/877c8e87-5898-4651-87b4-9f66442b7075-2v54f for delete in -17m5s due to TTL, I think the problem is at enqueue side, workflows are not not scheduled to be cleaned.

I'm inclined to believe the issue comes from the k8s client. There is also an hardcoded workflowResyncPeriod of 20 minutes, which justifies how the issue self-heal in time.

I'm not familiar with the codebase, some question for others that might know (like @terrytangyuan):

  • what implications would have decreasing the resync period? Is it a heavy operation?
  • I see that wfInformer.Run is being run twice on the same object, once from controller.go and once from gc_controller.go. Is this use case valid? Some warning is being reported W1018 10:40:37.539223 1 shared_informer.go:401] The sharedIndexInformer has started, run more than once is not allowed

Problem also reproduces on 3.4.11
Comparing argo v.3.38 with 3.4.11, the k8s.io/client-go has changed from v0.23.3 to v0.24.3

@CiprianAnton
Copy link
Contributor Author

It might also worth upgrading client-go package to a newer version.

@agilgur5 agilgur5 added the area/controller Controller issues, panics label Oct 19, 2023
@CiprianAnton
Copy link
Contributor Author

I can confirm the issue was introduced in 39b7f91, when k8s.io/client-go was upgraded from v0.23.5 to v0.24.3

@terrytangyuan
Copy link
Member

It's been there for a while so we may just wait for fix in kubernetes/kubernetes#127964

@agilgur5 agilgur5 added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Feb 18, 2024
@tooptoop4
Copy link
Contributor

tooptoop4 commented Oct 1, 2024

@CiprianAnton did u find workaround (downgrade the k8s client, shorten 20m workflowresyncperiod, increase RECENTLY_STARTED_POD_DURATION, set INFORMER_WRITE_BACK to false, something else)? i think i'm facing similar issue in #13671 i get The sharedIndexInformer has started, run more than once is not allowed in logs too

@CiprianAnton
Copy link
Contributor Author

@tooptoop4 The workaround I used was to just ignore those succeeded workflows. After 20 minutes it will self heal. This issue comes from the k8s go client and happens once after the pod restarted.

@agilgur5 agilgur5 added area/upstream This is an issue with an upstream dependency, not Argo itself solution/workaround There's a workaround, might not be great, but exists labels Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more area/upstream This is an issue with an upstream dependency, not Argo itself P3 Low priority solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests

5 participants