Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the conflict between preemption and antiAffinity #3070

Merged

Conversation

wangyang0616
Copy link
Member

@wangyang0616 wangyang0616 commented Aug 26, 2023

  1. fix: #3068
  2. Partial optimization based on #3051, Currently, preemption only considers extended resources such as cpu, memory, and gpu, and does not consider strategies such as affinity, topology, etc. When the predicate inherits the k8s filtering algorithm, if the return value is not Success, it returns an error, neither allcate nor preempt

Test Results:
The high-priority job anti-test preempts the resources of the low-priority job label-sh and runs successfully.
image

image

image

Pending pod event:
image

Pending pod msg:
image

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 26, 2023
@@ -411,45 +411,35 @@ func (pp *predicatesPlugin) OnSessionOpen(ssn *framework.Session) {
Reason: api.NodePodNumberExceeded,
}
predicateStatus = append(predicateStatus, podsNumStatus)
return predicateStatus, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't return, please filter out those reason is null string at

for _, status := range s {
all = append(all, status.Reason)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if nodeUnscheduleStatus.Code != api.Success {
predicateStatus = append(predicateStatus, nodeUnscheduleStatus)
return predicateStatus, false, nil
return predicateStatus, false, fmt.Errorf("plugin %s predicates failed %s", nodeUnscheduleFilter.Name(), status.Message())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if error is return, those code will not hit:

if statusSets.ContainsUnschedulable() || statusSets.ContainsUnschedulableAndUnresolvable() ||
statusSets.ContainsErrorSkipOrWait() {
return nil, api.NewFitError(task, node, statusSets.Message())
}

if statusSets.ContainsUnschedulableAndUnresolvable() || statusSets.ContainsErrorSkipOrWait() {
return nil, fmt.Errorf("predicates failed in preempt for task <%s/%s> on node <%s>, status is not success or unschedulable",
task.Namespace, task.Name, node.Name)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The filtering of the pod number in the predicate plugin will go to this part of the code logic
  • The gpu-related resource filtering in device FilterNode will go to this part of the code logic

For strategies such as nodeaffinity, podaffinity, and nodeport, an error is returned directly, and the preemption action is not currently supported

if nodeAffinityStatus.Code != api.Success {
predicateStatus = append(predicateStatus, nodeAffinityStatus)
return predicateStatus, false, nil
return predicateStatus, false, fmt.Errorf("plugin %s predicates failed %s", nodeAffinityFilter.Name(), status.Message())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if nodePortStatus.Code != api.Success {
predicateStatus = append(predicateStatus, nodePortStatus)
return predicateStatus, nil
return predicateStatus, fmt.Errorf("plugin %s predicates failed %s", nodePortFilter.Name(), status.Message())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@lowang-bh
Copy link
Member

Thanks for your hard work. BTW, Is there any test result about this PR, not only including preemtation works well, but also the pod/podgroup unscheduling message in their status.

@lowang-bh
Copy link
Member

lowang-bh commented Aug 26, 2023

And there is another enhancement can be used in preemption action: #3071

What's more, I think the current preemption action can not work well. In my opinion, it should following those steps:

  1. filter out those nodes which can not help preemption, according to the filter result stored in job's NodesFitErrors
  2. range from those nodes in step-1, consider the victims from a node and remove the victims from this node, and then re-run predications on the node; if result is success, then the node is a candidate node, put them in final node list;
  3. score on thoes candinate nodes, to choose a best one.

@wangyang0616
Copy link
Member Author

Thanks for your hard work. BTW, Is there any test result about this PR, not only including preemtation works well, but also the pod/podgroup unscheduling message in their status.

I will add the test results later.

@wangyang0616 wangyang0616 force-pushed the fix_preempt_antiaffinity branch 3 times, most recently from 14d1a8d to 5d6cc2d Compare August 27, 2023 13:19
…t does not support preemption by strategies such as antiAffinity and topologyspread

Signed-off-by: wangyang <[email protected]>
@wangyang0616
Copy link
Member Author

And there is another enhancement can be used in preemption action: #3071

What's more, I think the current preemption action can not work well. In my opinion, it should following those steps:

  1. filter out those nodes which can not help preemption, according to the filter result stored in job's NodesFitErrors
  2. range from those nodes in step-1, consider the victims from a node and remove the victims from this node, and then re-run predications on the node; if result is success, then the node is a candidate node, put them in final node list;
  3. score on thoes candinate nodes, to choose a best one.

I very much agree with your proposal, the preemption function still has a lot to optimize, we can iterate gradually.

@@ -126,15 +126,15 @@ func (ra *Action) Execute(ssn *framework.Session) {
var statusSets util.StatusSets
statusSets, err := ssn.PredicateFn(task, n)
if err != nil {
klog.V(3).Infof("reclaim predicates failed for task <%s/%s> on node <%s>: %v",
klog.V(5).Infof("reclaim predicates failed for task <%s/%s> on node <%s>: %v",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason of change level 3 to level 5?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a large number of pending pods in the cluster perform the reclaim operation and resource reclamation fails, a large number of repeated logs are generated in each round of scheduling. I think it is more appropriate to change the log level to debug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

@william-wang william-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Aug 28, 2023
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 28, 2023
@volcano-sh-bot volcano-sh-bot merged commit 92df2af into volcano-sh:master Aug 28, 2023
12 checks passed
@lowang-bh
Copy link
Member

What's more, I think the current preemption action can not work well. In my opinion, it should following those steps:

  1. filter out those nodes which can not help preemption, according to the filter result stored in job's NodesFitErrors
  2. range from those nodes in step-1, consider the victims from a node and remove the victims from this node, and then re-run predications on the node; if result is success, then the node is a candidate node, put them in final node list;
  3. score on thoes candinate nodes, to choose a best one.

I have open an enhancement requirement to trace it. enhancement for preemption action #3074

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Preemption conflicts with the anti-affinity
4 participants