Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArtifactGC fails if output artifact with non-existent file is configured with optional: true #13583

Closed
3 of 4 tasks
encigem opened this issue Sep 9, 2024 · 9 comments · Fixed by #13678
Closed
3 of 4 tasks
Assignees
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more type/bug

Comments

@encigem
Copy link

encigem commented Sep 9, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

What happened?:

  • When configuring Artifact Garbage Collection in a workflow, if the workflow reaches it's deadline before it manages to create a file which is to be used as an output artifact and that artifact is configured with optional: true parameter, then the ArtGC pod which runs after the steps are terminated fails with the message: Error: You need to configure artifact storage. More information on how to do this can be found in the docs: https://argoproj.github.io/argo-workflows/configure-artifact-repository/
  • If you remove the optional: true config from the offending artifact and re-run, then no artifact GC pod is created when the workflow terminates from deadline exceeded.
  • If you remove the offending artifact item entirely and re-run, no such Error: You need to configure artifact storage messages appear and the other artifacts are cleaned as expected when timeout occurs.

What did I expect to happen:

  • I expected the existing artifacts to be cleaned up and the non-existent artifact to be ignored in the ArtGC pod. The ArtGC pod should finish with a success

Version(s)

v3.4.10, v3.5.10, latest(sha256:4f03ff7ecaef4061dddd2c08f80de4d766b253aa3a57a87e69dd3a797bb42b1e)

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-artgc-
  namespace: argo
spec:
  entrypoint: main
  activeDeadlineSeconds: 20

  volumes:
    - name: emptydir
      emptyDir: {}

  templates:
    - name: main
      dag:
        tasks:
          - name: prepare-data
            template: prepare-data
            arguments:
              parameters: 
              - name: index
                value: "{{item}}"
            withParam: '[1, 2, 3]'
          
    - name: prepare-data
      inputs:
        parameters:
          - name: index
          - name: image
            value: "alpine:latest"
      outputs:
        artifacts:
          # Comment out the 'late-artifact' item and re-run to get a
          # successful artifact garbage collection pod execution
          - name: late-artifact-{{inputs.parameters.index}}
            path: "/empty-dir/late-artifact.txt"
            artifactGC:
              strategy: OnWorkflowCompletion
            optional: true  # removing this will result in no ArtGC pod being run
          - name: abcartifact-{{inputs.parameters.index}}
            path: "/empty-dir/abcartifact.txt"
            artifactGC:
              strategy: OnWorkflowCompletion
            optional: true
          - name: 123artifact-{{inputs.parameters.index}}
            path: "/empty-dir/123artifact.txt"
            artifactGC:
              strategy: OnWorkflowCompletion
            optional: true
      container:
        image: "{{inputs.parameters.image}}"
        volumeMounts:
          - name: emptydir
            mountPath: /empty-dir
        workingDir: /empty-dir
        command: ["sh", "-c"]
        args:
          - |

            set -x
            
            echo Hi > abcartifact.txt
            echo Hi > 123artifact.txt

            # cause timeout
            sleep 20

            echo Hi > late-artifact.txt

Logs from the workflow controller

$  kubectl logs -n argo deploy/workflow-controller | grep test-artgc-pv7h2

time="2024-09-09T15:50:26.593Z" level=info msg="Processing workflow" Phase=Failed ResourceVersion=209475215 namespace=argo workflow=test-artgc-pv7h2
time="2024-09-09T15:50:26.594Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=test-artgc-pv7h2
time="2024-09-09T15:50:36.472Z" level=info msg="Processing workflow" Phase=Failed ResourceVersion=209475215 namespace=argo workflow=test-artgc-pv7h2
time="2024-09-09T15:50:36.472Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=test-artgc-pv7h2

Logs from in your workflow's wait container

$  kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=test-artgc-pv7h2,workflow.argoproj.io/phase!=Succeeded

error: container wait is not valid for pod test-artgc-pv7h2-artgc-wfcomp-2166136261
@agilgur5 agilgur5 changed the title ArtifactGC fails if an output artifact entry with non-existent file in the path is configured with 'optional: true' parameter ArtifactGC fails if output artifact with non-existent file is configured with optional: true Sep 9, 2024
@encigem
Copy link
Author

encigem commented Sep 10, 2024

More info:

  • Workflows that have a failed artifact GC pod due to this issue cannot be removed without forcing removal of that WF finalizer.
  • When the issue occurs due to the optional: true config, none of the output artifacts are cleaned up.

Logs of succeeded ArtGC pod, when non-existent artifact is commented out:

$ kubectl logs -n argo test-artgc-mqhjs-artgc-wfcomp-2166136261
time="2024-09-10T07:02:51.393Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-2830858817/123artifact-2.tgz"
time="2024-09-10T07:02:51.393Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.393Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-2830858817/123artifact-2.tgz
time="2024-09-10T07:02:51.479Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-2830858817/abcartifact-2.tgz"
time="2024-09-10T07:02:51.479Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.479Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-2830858817/abcartifact-2.tgz
time="2024-09-10T07:02:51.483Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-4098831283/123artifact-3.tgz"
time="2024-09-10T07:02:51.483Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.484Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-4098831283/123artifact-3.tgz
time="2024-09-10T07:02:51.487Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-4098831283/abcartifact-3.tgz"
time="2024-09-10T07:02:51.487Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.487Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-4098831283/abcartifact-3.tgz
time="2024-09-10T07:02:51.492Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-668465731/123artifact-1.tgz"
time="2024-09-10T07:02:51.492Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.492Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-668465731/123artifact-1.tgz
time="2024-09-10T07:02:51.495Z" level=info msg="S3 Delete artifact: key: test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-668465731/abcartifact-1.tgz"
time="2024-09-10T07:02:51.495Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2024-09-10T07:02:51.495Z" level=info msg="Deleting object from s3" bucket=my-bucket endpoint="minio:9000" key=test-artgc-mqhjs/test-artgc-mqhjs-prepare-data-668465731/abcartifact-1.tgz

Logs of failed ArtGC pod when WorkFlow fails due to timeout and non-existent artifact is included in the WF YAML:

$ kubectl logs -n argo test-artgc-pv7h2-artgc-wfcomp-2166136261
Error: You need to configure artifact storage. More information on how to do this can be found in the docs: https://argo-workflows.readthedocs.io/en/latest/configure-artifact-repository/
You need to configure artifact storage. More information on how to do this can be found in the docs: https://argo-workflows.readthedocs.io/en/latest/configure-artifact-repository/

@agilgur5
Copy link

agilgur5 commented Sep 10, 2024

Hmm, I thought this would have been fixed by #13066, sounds like the optional: true case wasn't covered? I don't think that PR even considers it since if it's not there, there's nothing to delete (as you said as well) 🤔 cc @juliev0

@agilgur5 agilgur5 added area/artifacts S3/GCP/OSS/Git/HDFS etc area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more labels Sep 10, 2024
@juliev0 juliev0 self-assigned this Sep 11, 2024
@juliev0
Copy link
Contributor

juliev0 commented Sep 11, 2024

I have to admit that I wasn't aware of the optional parameter until now. :)

Now I see what you're saying, and see that my PR has some issues both with:

  1. prematurely returning when there's any single artifact which errors out instead of accounting for the other output artifacts, and
  2. not accounting for the Optional artifact case

I assigned it to myself to address sometime soon. (Or otherwise let me know if you'd like to work on it @encigem )

@agilgur5
Copy link

agilgur5 commented Sep 11, 2024

  1. Yes this is closely related to fix: don't necessarily include all artifacts from templates in node outputs #13066 (comment) -- this general issue would be handled by parallelization (feat: Parallel artifact GC #11768 / Implement parallelization to speed up S3 artifacts upload and download #12442) as they'd no longer be handled serially, so I'd probably just recommend rewriting it in parallel

@encigem
Copy link
Author

encigem commented Sep 11, 2024

I'm afraid w.r.t ArgoWF, my skills are for finding the bugs, but not fixing them :)
Please go ahead as assignee @juliev0. Thank you! 👍

@juliev0
Copy link
Contributor

juliev0 commented Sep 13, 2024

You know, I'm thinking about this comment you wrote after I wrote up that PR, @agilgur5.

If I did this (i.e. just ignoring the error on the Artifact GC side) and basically reverted my previous PR, then the WorkflowTaskResult would go back to just including all artifacts, whether or not they're there, whether or not they're optional. So, we may some of the time create an ArtifactGC Pod which essentially does nothing, and that seems okay I guess.

What do you think?

@agilgur5
Copy link

Ah I thought I had a more specific comment on this exact scenario, thanks for finding it!

I still think parallelized deletion and saving would be more optimal and would force us to properly handle these scenarios instead of a premature return.

Although if you're looking for a quick fix, yes, that sounds like it could handle this scenario

@juliev0
Copy link
Contributor

juliev0 commented Sep 28, 2024

Just started to work on this and realized that if I do revert the earlier change I made, it means that all artifacts would be included in the WorkflowTaskResult, which would mean that during Artifact GC, even if the pod does attempt deletion of all artifacts, it will still have an error trying to delete some of them. And without the ForceFinalizerRemoval flag set, then the Workflow's Finalizer will still be on, and deletion will be deemed to have failed.

For both Optional and non-Optional artifacts, it seems that we only want to attempt deletion of whatever exists and we don't want to fail Artifact GC just because we're trying to delete some artifact that doesn't exist. If it was a non-Optional artifact, we will have Failed the Workflow itself, but that doesn't mean we should also fail Artifact GC.

Therefore, I'm thinking of instead maintaining the notion that a WorkflowTaskResult only includes the artifacts that were successfully written, but fixing the logic for it based on the problems identified by @encigem.

@agilgur5 @encigem feel free to differ with anything I'm saying here if I'm mistaken, thanks.

@agilgur5
Copy link

agilgur5 commented Sep 29, 2024

Thanks for going through the logic! I agree, that sounds like the most correct way to handle it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants