Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task still Running when Workflow Failed -- no repro #13253

Closed
4 tasks done
heidongxianhua opened this issue Jun 27, 2024 · 12 comments
Closed
4 tasks done

task still Running when Workflow Failed -- no repro #13253

heidongxianhua opened this issue Jun 27, 2024 · 12 comments
Labels
area/controller Controller issues, panics area/retryStrategy Template-level retryStrategy problem/more information needed Not enough information has been provide to diagnose this issue. problem/stale This has not had a response in some time type/bug

Comments

@heidongxianhua
Copy link
Contributor

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

when the workflow failed, all the step should be finished status (stop、failed or succeed), but the step is alive with running.

image

when we mark the workflow with failed, we should mark all the step to finished status.

Version

lastest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

it is related to node rescource.

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
none

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
none
@jswxstw
Copy link
Member

jswxstw commented Jun 27, 2024

A similar issue: #12703

when we mark the workflow with failed, we should mark all the step to finished status.

It makes sense that marking all active nodes except nodes of exit handler as Failed when workflow is marked Failed or Error directly.

@heidongxianhua
Copy link
Contributor Author

yes, and now the workflow is finshed (failed), but the step is running, we can not retry the workflow with the error:
image

@jswxstw
Copy link
Member

jswxstw commented Jun 27, 2024

yes, and now the workflow is finshed (failed), but the step is running

What is the reason for the failure of your workflow? Normally, it should be: node failed -> workflow failed.

@heidongxianhua
Copy link
Contributor Author

I have no idea, but I guass the reason is the dependency. And the following step depends on this three steps, the dependency is depends: >- fep-protein.Succeeded && fep-water.Succeeded && (fep-gas.Succeeded || fep-gas.Failed)

@agilgur5
Copy link

we can not retry the workflow with the error:

Please use text instead of images when listing logs, configurations, or code, as text is much more accessible than images.

And the following step depends on this three steps, the dependency is depends: >- fep-protein.Succeeded && fep-water.Succeeded && (fep-gas.Succeeded || fep-gas.Failed)

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

it is related to node rescource.

You didn't provide a Workflow, so this is not reproducible. Please follow the issue template accurately.

@agilgur5 agilgur5 added area/controller Controller issues, panics problem/more information needed Not enough information has been provide to diagnose this issue. labels Jun 30, 2024
@agilgur5 agilgur5 changed the title workflow status update not right task still Running when Workflow Failed Jun 30, 2024
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

@github-actions github-actions bot added the problem/stale This has not had a response in some time label Jul 15, 2024
Copy link
Contributor

This issue has been closed due to inactivity and lack of information. If you still encounter this issue, please add the requested information and re-open.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 29, 2024
@agilgur5 agilgur5 changed the title task still Running when Workflow Failed task still Running when Workflow Failed -- no repro Jul 29, 2024
@tooptoop4
Copy link
Contributor

/reopen

@jswxstw jswxstw reopened this Nov 12, 2024
@jswxstw jswxstw removed the problem/stale This has not had a response in some time label Nov 12, 2024
@isubasinghe
Copy link
Member

Are we sure this is still an issue ? 3.5.12 was released with a fix for this, could you please double check.

Thanks

@jswxstw
Copy link
Member

jswxstw commented Nov 12, 2024

I reproduced a similar issue by simulating a failure scenario for createWorkflowPod, but I'm not sure if it's related to this issue.
@heidongxianhua @tooptoop4 Could you confirm if you have encountered a similar problem?

Workflow demo:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: test-retry
spec:
  entrypoint: main
  templates:
  - name: main
    steps:
    - - name: retry
        template: retry
  - name: retry
    retryStrategy:
      limit: 3
    container:
      image: python:alpine3.6
      command: ["python", -c]
      # fail with a 66% probability
      args: ["import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)"]

test-retry

Controller logs:

time="2024-11-12T10:39:53.127Z" level=debug msg="Evaluating node test-retry[0].retry: template: *v1alpha1.WorkflowStep (retry), boundaryID: test-retry" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.127Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.127Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=argo,name=test-retry)" tmpl="*v1alpha1.WorkflowStep (retry)"
time="2024-11-12T10:39:53.127Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=argo,name=test-retry)" tmpl="*v1alpha1.WorkflowStep (retry)"
time="2024-11-12T10:39:53.127Z" level=debug msg="Getting the template by name: retry" base="*v1alpha1.Workflow (namespace=argo,name=test-retry)" tmpl="*v1alpha1.WorkflowStep (retry)"
time="2024-11-12T10:39:53.127Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=argo,name=test-retry)" tmpl="*v1alpha1.NodeStatus (main)"
time="2024-11-12T10:39:53.128Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=argo,name=test-retry)" tmpl="*v1alpha1.NodeStatus (main)"
time="2024-11-12T10:39:53.128Z" level=debug msg="Getting the template by name: main" base="*v1alpha1.Workflow (namespace=argo,name=test-retry)" tmpl="*v1alpha1.NodeStatus (main)"
time="2024-11-12T10:39:53.128Z" level=debug msg="Inject a retry node for node test-retry[0].retry" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=debug msg="Initializing node test-retry[0].retry: template: *v1alpha1.WorkflowStep (retry), boundaryID: test-retry" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="Retry node test-retry-1045228556 initialized Running" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=warning msg="Non-transient error: <nil>"
time="2024-11-12T10:39:53.128Z" level=debug msg="Initializing node test-retry[0].retry(0): template: *v1alpha1.WorkflowStep (retry), boundaryID: test-retry" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="Pod node test-retry-3000431751 initialized Pending" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=argo,name=test-retry)" tmpl="*v1alpha1.NodeStatus (main)"
time="2024-11-12T10:39:53.128Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=argo,name=test-retry)" tmpl="*v1alpha1.NodeStatus (main)"
time="2024-11-12T10:39:53.128Z" level=debug msg="Getting the template by name: main" base="*v1alpha1.Workflow (namespace=argo,name=test-retry)" tmpl="*v1alpha1.NodeStatus (main)"
time="2024-11-12T10:39:53.128Z" level=debug msg="Executing node test-retry[0].retry(0) with container template: retry\n" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=warning msg="Non-transient error: pod created failed!"
time="2024-11-12T10:39:53.128Z" level=error msg="Mark error node" error="pod created failed!" namespace=argo nodeName="test-retry[0].retry(0)" workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="node test-retry-3000431751 phase Pending -> Error" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="node test-retry-3000431751 message: pod created failed!" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="node test-retry-3000431751 finished: 2024-11-12 02:39:53.12828131 +0000 UTC" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=error msg="Mark error node" error="step group deemed errored due to child test-retry[0].retry error: pod created failed!" namespace=argo nodeName="test-retry[0]" workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="node test-retry-1999497484 phase Running -> Error" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="node test-retry-1999497484 message: step group deemed errored due to child test-retry[0].retry error: pod created failed!" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="node test-retry-1999497484 finished: 2024-11-12 02:39:53.128340075 +0000 UTC" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="step group test-retry-1999497484 was unsuccessful: step group deemed errored due to child test-retry[0].retry error: pod created failed!" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="Outbound nodes of test-retry-1045228556 is [test-retry-3000431751]" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="Outbound nodes of test-retry is [test-retry-3000431751]" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="node test-retry phase Running -> Failed" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="node test-retry message: step group deemed errored due to child test-retry[0].retry error: pod created failed!" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="node test-retry finished: 2024-11-12 02:39:53.128393312 +0000 UTC" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=debug msg="Checking daemoned children of test-retry" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg=reconcileAgentPod namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=debug msg="Task results completion status: map[]" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="Updated message  -> step group deemed errored due to child test-retry[0].retry error: pod created failed!" namespace=argo workflow=test-retry
time="2024-11-12T10:39:53.128Z" level=info msg="Marking workflow completed" namespace=argo workflow=test-retry

@jswxstw jswxstw added the area/retryStrategy Template-level retryStrategy label Nov 12, 2024
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

@github-actions github-actions bot added the problem/stale This has not had a response in some time label Nov 26, 2024
Copy link
Contributor

This issue has been closed due to inactivity and lack of information. If you still encounter this issue, please add the requested information and re-open.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/retryStrategy Template-level retryStrategy problem/more information needed Not enough information has been provide to diagnose this issue. problem/stale This has not had a response in some time type/bug
Projects
None yet
Development

No branches or pull requests

5 participants