Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot delete failed Workflow due to ArtifactGC finalizer #13499

Closed
4 tasks done
chengjoey opened this issue Aug 24, 2024 · 9 comments · Fixed by #13500
Closed
4 tasks done

Cannot delete failed Workflow due to ArtifactGC finalizer #13499

chengjoey opened this issue Aug 24, 2024 · 9 comments · Fixed by #13500
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more type/bug

Comments

@chengjoey
Copy link
Contributor

chengjoey commented Aug 24, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

run example examples/artifact-gc-workflow.yaml

  1. kubectl create -f examples/artifact-gc-workflow.yaml
  2. because I am using a Linux/arm64 system, the workflow failed.
kubectl get workflow
NAME                STATUS   AGE   MESSAGE
artifact-gc-qzcrv   Failed   10s   Error (exit code 64): failed to start command: fork/exec /bin/sh: exec format error
  1. so i decide to delete the workflow, but at this time, deleting the workflow is stuck and cannot be successful
    get workflow:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    workflows.argoproj.io/pod-name-format: v2
  creationTimestamp: "2024-08-24T10:08:29Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-08-24T10:13:21Z"
  finalizers:
  - workflows.argoproj.io/artifact-gc
  generateName: artifact-gc-
  generation: 5
  labels:
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/phase: Failed
  name: artifact-gc-qzcrv
  namespace: default
  resourceVersion: "8957"
  uid: 4e159dab-2888-4097-a4c3-9ff67758d28d
spec:
...

i found that finalizers still existed, it should be removed

Version(s)

v3.5.10

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

run examples/artifact-gc-workflow.yaml on arm mac, and then delete the workflow

Logs from the workflow controller

time="2024-08-24T10:08:40.025Z" level=info msg="node changed" namespace=default new.message="Error (exit code 64): failed to start command: fork/exec /bin/sh: exec format error" new.phase=Failed new.progress=0/1 nodeID=artifact-gc-qzcrv old.message= old.phase=Pending old.progress=0/1 workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:40.026Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:40.026Z" level=info msg=reconcileAgentPod namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:40.026Z" level=info msg="Updated phase Running -> Failed" namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:40.026Z" level=info msg="Updated message  -> Error (exit code 64): failed to start command: fork/exec /bin/sh: exec format error" namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:40.026Z" level=info msg="Marking workflow completed" namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:40.026Z" level=info msg="Workflow to be dehydrated" Workflow Size=1689
time="2024-08-24T10:08:40.048Z" level=info msg="Workflow update successful" namespace=default phase=Failed resourceVersion=8669 workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:40.069Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/artifact-gc-qzcrv/labelPodCompleted
time="2024-08-24T10:08:45.068Z" level=info msg="Processing workflow" Phase=Failed ResourceVersion=8669 namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:45.069Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:45.069Z" level=info msg="Workflow to be dehydrated" Workflow Size=1715
time="2024-08-24T10:08:45.088Z" level=info msg="Workflow update successful" namespace=default phase=Failed resourceVersion=8680 workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:50.093Z" level=info msg="Processing workflow" Phase=Failed ResourceVersion=8680 namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:08:50.093Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=artifact-gc-qzcrv
time="2024-08-24T10:12:52.475Z" level=info msg="Alloc=5451 TotalAlloc=18959 Sys=24165 NumGC=8 Goroutines=175"
time="2024-08-24T10:13:31.589Z" level=info msg="Processing workflow" Phase=Failed ResourceVersion=8957 namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:13:31.589Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=artifact-gc-qzcrv
time="2024-08-24T10:13:31.590Z" level=info msg="Workflow to be dehydrated" Workflow Size=1751
time="2024-08-24T10:13:31.604Z" level=info msg="Workflow update successful" namespace=default phase=Failed resourceVersion=8967 workflow=artifact-gc-qzcrv
time="2024-08-24T10:13:36.610Z" level=info msg="Processing workflow" Phase=Failed ResourceVersion=8967 namespace=default workflow=artifact-gc-qzcrv
time="2024-08-24T10:13:36.610Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=artifact-gc-qzcrv

Logs from in your workflow's wait container

time="2024-08-24T10:08:31.403Z" level=info msg="Starting Workflow Executor" version=v3.5.10
time="2024-08-24T10:08:31.404Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-08-24T10:08:31.404Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=default podName=artifact-gc-qzcrv templateName=main version="&Version{Version:v3.5.10,BuildDate:2024-08-01T05:12:26Z,GitCommit:25829927431d9a0f46d17b72ae74aedb8d700884,GitTag:v3.5.10,GitTreeState:clean,GoVersion:go1.21.12,Compiler:gc,Platform:linux/arm64,}"
time="2024-08-24T10:08:31.414Z" level=info msg="Starting deadline monitor"
time="2024-08-24T10:08:32.414Z" level=info msg="Main container completed" error="<nil>"
time="2024-08-24T10:08:32.414Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-08-24T10:08:32.414Z" level=info msg="No output parameters"
time="2024-08-24T10:08:32.414Z" level=info msg="Saving output artifacts"
time="2024-08-24T10:08:32.415Z" level=info msg="Staging artifact: on-completion"
time="2024-08-24T10:08:32.415Z" level=info msg="Copying /tmp/on-completion.txt from container base image layer to /tmp/argo/outputs/artifacts/on-completion.tgz"
time="2024-08-24T10:08:32.415Z" level=info msg="/var/run/argo/outputs/artifacts/tmp/on-completion.txt.tgz -> /tmp/argo/outputs/artifacts/on-completion.tgz"
time="2024-08-24T10:08:32.415Z" level=error msg="executor error: open /var/run/argo/outputs/artifacts/tmp/on-completion.txt.tgz: no such file or directory"
time="2024-08-24T10:08:32.437Z" level=info msg="Alloc=8902 TotalAlloc=13535 Sys=24165 NumGC=3 Goroutines=8"
time="2024-08-24T10:08:32.456Z" level=fatal msg="open /var/run/argo/outputs/artifacts/tmp/on-completion.txt.tgz: no such file or directory"
@chengjoey
Copy link
Contributor Author

chengjoey commented Aug 24, 2024

kubectl get pod artifact-gc-qzcrv -o yaml:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/default-container: main
    workflows.argoproj.io/node-id: artifact-gc-qzcrv
    workflows.argoproj.io/node-name: artifact-gc-qzcrv
  creationTimestamp: "2024-08-24T10:08:30Z"
  labels:
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/workflow: artifact-gc-qzcrv
  name: artifact-gc-qzcrv
  namespace: default

labels[workflows.argoproj.io/completed] is true, so pod can not get by podinformer

func (wfc *WorkflowController) newWorkflowPodWatch(ctx context.Context) *cache.ListWatch {
c := wfc.kubeclientset.CoreV1().Pods(wfc.GetManagedNamespace())
// completed=false
labelSelector := labels.NewSelector().
Add(*workflowReq).
Add(*incompleteReq).
Add(wfc.instanceIDReq())

This causes anyPodSuccess to always be false

pods, err := woc.controller.podInformer.GetIndexer().ByIndex(indexes.WorkflowIndex, woc.wf.GetNamespace()+"/"+woc.wf.GetName())
if err != nil {
return fmt.Errorf("failed to get pods from informer: %w", err)
}
anyPodSuccess := false
for _, obj := range pods {

removeFinalizer = anyPodSuccess && woc.allArtifactsDeleted()

@agilgur5 agilgur5 changed the title delete gc-artifact workflow stuck Cannot delete failed Workflow due to ArtifactGC finalizer Aug 24, 2024
@agilgur5 agilgur5 added area/artifacts S3/GCP/OSS/Git/HDFS etc area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more labels Aug 24, 2024
@agilgur5
Copy link

See my review comment. This seems like the exact use-case for the forceFinalizerRemoval field

@agilgur5 agilgur5 added type/support User support issue - likely not a bug and removed type/bug labels Aug 24, 2024
@agilgur5
Copy link

agilgur5 commented Aug 24, 2024

Looks like this might be caused by failure of TaskResult reconciliation (so ArtifactGC didn't run yet), and so would have a duplicate root cause of #12993 and fixed by #13454. See my new comment on the PR

@agilgur5
Copy link

Try running your Controller image with :latest, which will include #13454

@agilgur5 agilgur5 added type/bug type/regression Regression from previous behavior (a specific type of bug) solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) and removed type/support User support issue - likely not a bug labels Aug 24, 2024
@chengjoey
Copy link
Contributor Author

chengjoey commented Aug 24, 2024

I have tried it. In fact, I have debugged it locally with the latest code and it cannot be deleted successfully unless force is used. However, I don't think it is necessary to use force in this scenario. Looking at the code, woc.allArtifactsDeleted() is actually true, but anyPodSuccess prevents the deletion of finalizers. I think zero pods and anyPodSuccess == true are equivalent.

removeFinalizer = anyPodSuccess && woc.allArtifactsDeleted()

@agilgur5 agilgur5 removed type/regression Regression from previous behavior (a specific type of bug) solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) labels Aug 25, 2024
@agilgur5
Copy link

To clarify, so in this scenario, no ArtifactGC Pods were launched? Since no artifacts were created

Looking at the code, woc.allArtifactsDeleted() is actually true, but anyPodSuccess prevents the deletion of finalizers. I think zero pods and anyPodSuccess == true are equivalent.

Ah, I see what you mean, thanks for elaborating!

Yes, if all artifacts were already deleted, then I think this check is just to make sure that there are no failed ArtifactGC Pods laying around? 🤔 cc @juliev0

@juliev0
Copy link
Contributor

juliev0 commented Aug 25, 2024

Hey all. This is really interesting. You've definitely happened upon at least one bug. But you know, while the LabelKeyCompleted label is being set to "false", it seems I never actually set it to "true" or deleted it, so it shouldn't actually be bypassed by that Informer. Do you agree?

(in which case, we can decide if it should even have that label)

@juliev0
Copy link
Contributor

juliev0 commented Aug 25, 2024

Looking at the anyPodSuccess code, I think it was probably just for the purpose of eliminating unnecessary work by the way, to avoid doing the woc.allArtifactsDeleted() check if none of the Artifact GC Pods had even finished and succeeded yet.

@juliev0
Copy link
Contributor

juliev0 commented Aug 25, 2024

Now I'm seeing in your PR that you mention that the Workflow failed due to not being able to run the image. Okay, let me respond over there...

So, it seems like root cause is really that you never had any ArtifactGC Pods to begin with, right? And the anyPodSuccess code is written with the expectation that if you have ArtifactGC configured in the Workflow, you'll have at least one Artifact that gets GC'ed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more type/bug
Projects
None yet
3 participants