add task status named ReleasingFailed #2922

ycfnana · 2023-06-18T12:46:21Z

if pod being terminating for long time because of zombie process，the task will be scheduled this node，and then this job may hang until the pod with zombie process killed force

volcano-sh-bot · 2023-06-18T12:46:31Z

Welcome @ycfnana! It looks like this is your first PR to volcano-sh/volcano 🎉

hwdef · 2023-06-21T02:49:21Z

How does this PR fix the problems you raised?

ycfnana · 2023-06-21T06:12:13Z

How does this PR fix the problems you raised?

job may be status pipelined when this case happens. if job request resource is future idle, it will be pipelined status, the releasing resource will be used in the way of the future idle.
I think if the pod has something wrong with termimating process, it should be detected.  the process of pod terminating, will send signal TERM and wait pod.Spec.TerminationGracePeriodSeconds(default 30s) to send signal kill. I think pod terminating process has something wrong if wait time exceed TerminationGracePeriodSeconds.
however, I think this code will result in task status from Releasing to Running, it's not good but I have no idea to chang it to another status

hwdef · 2023-06-21T06:49:27Z

From the code point of view, you just make the conditions for returning releasing more stringent

ycfnana · 2023-06-21T10:24:02Z

From the code point of view, you just make the conditions for returning releasing more stringent

my fault, its title is vague. I'll update it
I met a issue, pod always be Pending when one node has zombie pod and another node has enough reource, due to the relaesing status, I think waiting time larger than TerminationGracePeriodSeconds it should not be releaseing, may happen something error, its should be another status.

hwdef · 2023-06-21T10:33:34Z

pkg/scheduler/api/helpers.go

 			return Releasing
 		}

 		return Running
 	case v1.PodPending:
-		if pod.DeletionTimestamp != nil {
+		if pod.DeletionTimestamp != nil &&
+			time.Now().Unix()-pod.DeletionTimestamp.Unix() <= gracePeriodSeconds {


time.Now().Unix()-pod.DeletionTimestamp.Unix()will keep growing, If once the time point of less than or equal to is missed, then this judgment will always be false and will never return release. Is this what we want?

In Running case, if cost time of deleting pod exceed gracePeriodSeconds . I think the pod has something error, but the request resource is allocated for scheduler, So this pod should not continue stay releasing state.
In pending case, it means this pod has been deleted but node name has already assigned , resource also cannot be freed and allocated for scheduler. although I have not met this case... just think should add it

time.Now().Unix()-pod.DeletionTimestamp.Unix()will keep growing, If once the time point of less than or equal to is missed, then this judgment will always be false and will never return release. Is this what we want?

I think gracePeriodSeconds can set bigger, and it should be another status not "Running" status when waiting time bigger than gracePeriodSeconds, but I have no idea what status to set

Sorry I don't think this is a perfect fix

yeah, but this need to fix, I think it should add a status like "ReleaseFailed", in the code of resource calculation, it should be the same as Running. what do you think?

@wangyang0616
Can you help review this PR

I fix it in #2943, can you help review it again?

volcano-sh-bot · 2023-06-28T16:45:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign k82cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @k82cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot · 2023-06-28T16:45:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign k82cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @k82cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: chenfengyu <[email protected]>

Signed-off-by: kingeasternsun <[email protected]> Signed-off-by: chenfengyu <[email protected]>

Signed-off-by: wangyang <[email protected]> Signed-off-by: chenfengyu <[email protected]>

Signed-off-by: Yikun Jiang <[email protected]> Signed-off-by: chenfengyu <[email protected]>

Signed-off-by: rayoluo <[email protected]> Signed-off-by: chenfengyu <[email protected]>

Signed-off-by: aakcht <[email protected]> Signed-off-by: chenfengyu <[email protected]>

Signed-off-by: chenfengyu <[email protected]>

volcano-sh-bot · 2023-06-28T16:55:53Z

@ycfnana: Adding label do-not-merge/contains-merge-commits because PR contains merge commits, which are not allowed in this repository.
Use git rebase to reapply your commits on top of the target branch. Detailed instructions for doing so can be found here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot requested review from alcorj-mizar and k82cn June 18, 2023 12:46

volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 18, 2023

ycfnana force-pushed the master branch 2 times, most recently from e4cf4c6 to d9a757a Compare June 18, 2023 17:58

ycfnana changed the title ~~avoid pod being terminating for a long time~~ avoid pod is pending for long tme when node has terminating pod with zombie process Jun 21, 2023

hwdef reviewed Jun 21, 2023

View reviewed changes

ycfnana changed the title ~~avoid pod is pending for long tme when node has terminating pod with zombie process~~ fix pod is always pending when node has terminating pod with zombie process Jun 21, 2023

ycfnana closed this Jun 24, 2023

ycfnana changed the title ~~fix pod is always pending when node has terminating pod with zombie process~~ add task status named ReleasingFailed Jun 28, 2023

ycfnana reopened this Jun 28, 2023

volcano-sh-bot added do-not-merge/contains-merge-commits labels Jun 28, 2023

volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 28, 2023

ycfnana closed this Jun 28, 2023

ycfnana and others added 7 commits June 29, 2023 00:47

avoid pod being terminating for a long time

ecc1622

Signed-off-by: chenfengyu <[email protected]>

add task status named ReleasingFailed

2692c58

Signed-off-by: chenfengyu <[email protected]>

set time grace period seconds from server opations

fd6f16a

Signed-off-by: chenfengyu <[email protected]>

fix bug when deal with missing resource dimension in left

1b338b6

Signed-off-by: kingeasternsun <[email protected]> Signed-off-by: chenfengyu <[email protected]>

Upgrade the setup-go and checkout versions in the action

2a9c803

Signed-off-by: wangyang <[email protected]> Signed-off-by: chenfengyu <[email protected]>

Upgrade the spark integration test to 3.4

e403c02

Signed-off-by: Yikun Jiang <[email protected]> Signed-off-by: chenfengyu <[email protected]>

fix some issues reported by golint

6aa718a

Signed-off-by: rayoluo <[email protected]> Signed-off-by: chenfengyu <[email protected]>

Aakcht and others added 5 commits June 29, 2023 00:47

Add more configuration options for helm chart

e806229

Signed-off-by: aakcht <[email protected]> Signed-off-by: chenfengyu <[email protected]>

resolve merge conflicts

0bf1005

Signed-off-by: aakcht <[email protected]> Signed-off-by: chenfengyu <[email protected]>

avoid null resources in volcano-development.yaml

0705965

Signed-off-by: aakcht <[email protected]> Signed-off-by: chenfengyu <[email protected]>

podgroup status running remove length of releasing failed pod

097ad3d

Signed-off-by: chenfengyu <[email protected]>

rename som variable

69ff3ff

Signed-off-by: chenfengyu <[email protected]>

ycfnana reopened this Jun 28, 2023

ycfnana force-pushed the master branch from 6b3f0b9 to 69ff3ff Compare June 28, 2023 16:52

volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/contains-merge-commits size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 28, 2023

Merge branch 'volcano-sh:master' into master

63c073c

volcano-sh-bot added the do-not-merge/contains-merge-commits label Jun 28, 2023

volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 28, 2023

ycfnana closed this Jun 28, 2023

ycfnana mentioned this pull request Jun 28, 2023

add task status named ReleasingFailed #2943

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add task status named ReleasingFailed #2922

add task status named ReleasingFailed #2922

ycfnana commented Jun 18, 2023

volcano-sh-bot commented Jun 18, 2023

hwdef commented Jun 21, 2023 •

edited

Loading

ycfnana commented Jun 21, 2023 •

edited

Loading

hwdef commented Jun 21, 2023

ycfnana commented Jun 21, 2023

hwdef Jun 21, 2023

ycfnana Jun 21, 2023 •

edited

Loading

ycfnana Jun 21, 2023 •

edited

Loading

hwdef Jun 22, 2023

ycfnana Jun 23, 2023 •

edited

Loading

hwdef Jun 24, 2023

ycfnana Jun 28, 2023

volcano-sh-bot commented Jun 28, 2023

volcano-sh-bot commented Jun 28, 2023

volcano-sh-bot commented Jun 28, 2023

add task status named ReleasingFailed #2922

add task status named ReleasingFailed #2922

Conversation

ycfnana commented Jun 18, 2023

volcano-sh-bot commented Jun 18, 2023

hwdef commented Jun 21, 2023 • edited Loading

ycfnana commented Jun 21, 2023 • edited Loading

hwdef commented Jun 21, 2023

ycfnana commented Jun 21, 2023

hwdef Jun 21, 2023

Choose a reason for hiding this comment

ycfnana Jun 21, 2023 • edited Loading

Choose a reason for hiding this comment

ycfnana Jun 21, 2023 • edited Loading

Choose a reason for hiding this comment

hwdef Jun 22, 2023

Choose a reason for hiding this comment

ycfnana Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

hwdef Jun 24, 2023

Choose a reason for hiding this comment

ycfnana Jun 28, 2023

Choose a reason for hiding this comment

volcano-sh-bot commented Jun 28, 2023

volcano-sh-bot commented Jun 28, 2023

volcano-sh-bot commented Jun 28, 2023

hwdef commented Jun 21, 2023 •

edited

Loading

ycfnana commented Jun 21, 2023 •

edited

Loading

ycfnana Jun 21, 2023 •

edited

Loading

ycfnana Jun 21, 2023 •

edited

Loading

ycfnana Jun 23, 2023 •

edited

Loading