Preempt performance #3825

molei20021 · 2024-11-15T08:41:59Z

No description provided.

…reempt-performance

volcano-sh-bot · 2024-11-15T08:42:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign k82cn
You can assign the PR to them by writing /assign @k82cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/scheduler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

molei20021 · 2024-11-15T08:46:02Z

The performance of preempt is about 2x faster

lowang-bh · 2024-11-15T12:38:04Z

pkg/scheduler/actions/allocate/allocate.go

@@ -326,6 +326,10 @@ func (alloc *Action) predicate(task *api.TaskInfo, node *api.NodeInfo) error {
 		statusSets = append(statusSets, &api.Status{Code: api.Unschedulable, Reason: api.WrapInsufficientResourceReason(resources)})
 		return api.NewFitErrWithStatus(task, node, statusSets...)
 	}
+	if node.Allocatable.MaxTaskNum <= len(alloc.session.NodeMap[node.Name].Pods) {


why add it again? it exist in predicate plugin.

Bacause it may influence the predicate result if there are too many pod running on the node, so I put it before the new predicate cache to make the cache accurate.

Sorry I didn't get the point, you didn't remove these codes in the predicate plugin, so why do we need to add duplicate codes here?

lowang-bh · 2024-11-15T12:41:54Z

There is some gaps with my idea to improve the performance. Could you add some describe about your design?

molei20021 · 2024-11-18T02:46:20Z

There is some gaps with my idea to improve the performance. Could you add some describe about your design?

I add a design graph, the parts marked in red are modification points.

JesseStutler · 2024-11-19T04:07:11Z

pkg/scheduler/actions/preempt/preempt.go

@@ -31,6 +31,8 @@ import (

 type Action struct {
 	enablePredicateErrorCache bool
+	session                   *framework.Session
+	preemptableNodeMap        map[api.QueueID]map[string]int


Could you add comments to declare your key and value?
Like the key is the Queue which the task that may be evicted belongs, the value is also a map, the key of inner map is the node name and value is the num of running/bound pods that may be evicted on the node.

JesseStutler · 2024-11-19T06:22:53Z

pkg/scheduler/actions/reclaim/reclaim.go

@@ -230,5 +233,14 @@ func (ra *Action) Execute(ssn *framework.Session) {
 	}
 }

+func (ra *Action) predicate(task *api.TaskInfo, node *api.NodeInfo) error {
+	var statusSets api.StatusSets
+	if node.Allocatable.MaxTaskNum <= len(ra.session.NodeMap[node.Name].Pods) {


Why not just add this in session's PredicateForPreemptAction?

JesseStutler · 2024-11-19T06:29:55Z

pkg/scheduler/util/predicate_helper.go

@@ -12,6 +12,47 @@ import (
 	"volcano.sh/volcano/pkg/scheduler/api"
 )

+type PredicateCache struct {
+	Cache map[api.JobID]map[api.TaskID]map[string]map[int64]error


.....Too much map nesting. Please refactor as struct nesting and refactor your related code.

JesseStutler · 2024-11-19T06:36:46Z

pkg/scheduler/cache/cache.go

-			if err := sc.taskUnschedulable(taskInfo, reason, msg, nominatedNodeName); err != nil {
-				klog.ErrorS(err, "Failed to update unschedulable task status", "task", klog.KRef(taskInfo.Namespace, taskInfo.Name),
-					"reason", reason, "message", msg)
+			ts, exist := schedulingutil.GetPodStatusLastSetCache(job.UID, taskInfo.UID)


This part of code is used to reduce api requests sent to api-server? Need to split this as a separate commit, not to mit it into preemption performance improvement.

JesseStutler · 2024-11-19T06:39:28Z

pkg/scheduler/cache/cache.go

-				klog.ErrorS(err, "Failed to update unschedulable task status", "task", klog.KRef(taskInfo.Namespace, taskInfo.Name),
-					"reason", reason, "message", msg)
+			ts, exist := schedulingutil.GetPodStatusLastSetCache(job.UID, taskInfo.UID)
+			if !exist || nowTs-ts > 60 {


I remember that we discussed to set the interval that can be configured or using backoff. Direct set 60 here is
too empirical.

JesseStutler · 2024-11-19T06:51:55Z

pkg/scheduler/actions/allocate/allocate.go

@@ -326,6 +326,10 @@ func (alloc *Action) predicate(task *api.TaskInfo, node *api.NodeInfo) error {
 		statusSets = append(statusSets, &api.Status{Code: api.Unschedulable, Reason: api.WrapInsufficientResourceReason(resources)})
 		return api.NewFitErrWithStatus(task, node, statusSets...)
 	}
+	if node.Allocatable.MaxTaskNum <= len(alloc.session.NodeMap[node.Name].Pods) {


Sorry I didn't get the point, you didn't remove these codes in the predicate plugin, so why do we need to add duplicate codes here?

JesseStutler · 2024-11-19T06:57:48Z

There is some gaps with my idea to improve the performance. Could you add some describe about your design?

I add a design graph, the parts marked in red are modification points.

I didn't get why we need to record last condition time, in SchedulerCache's taskUnschedulable, there are already codes that validate whether the condition is same as last time:

volcano/pkg/scheduler/cache/cache.go

Lines 1026 to 1033 in 24ca00b

    
           updateCond := podConditionHaveUpdate(&pod.Status, condition) 
        
           // only update pod's nominatedNodeName when nominatedNodeName is not empty 
        
           // consider this situation: 
        
           // 1. at session 1, the pod A preempt another lower priority pod B, and we updated A's nominatedNodeName 
        
           // 2. at session 2, the pod B is still terminating, so the pod A is still pipelined, but it preempt none, so 
        
           // the nominatedNodeName is empty, but we should not override the A's nominatedNodeName to empty 
        
           updateNomiNode := len(nominatedNodeName) > 0 && podNominatedNodeNameNeedUpdate(&pod.Status, nominatedNodeName)

molei20021 added 5 commits November 7, 2024 14:23

preempt performance

294caa7

preempt

eacd391

preempt

0386539

preempt

8b24b57

Merge branch 'master' of https://github.com/molei20021/volcano into p…

6930ea6

…reempt-performance

volcano-sh-bot requested review from lowang-bh and william-wang November 15, 2024 08:42

volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 15, 2024

lowang-bh reviewed Nov 15, 2024

View reviewed changes

JesseStutler reviewed Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preempt performance #3825

Preempt performance #3825

molei20021 commented Nov 15, 2024

volcano-sh-bot commented Nov 15, 2024

molei20021 commented Nov 15, 2024

lowang-bh Nov 15, 2024

molei20021 Nov 18, 2024

JesseStutler Nov 19, 2024

lowang-bh commented Nov 15, 2024

molei20021 commented Nov 18, 2024

JesseStutler Nov 19, 2024

JesseStutler Nov 19, 2024

JesseStutler Nov 19, 2024

JesseStutler Nov 19, 2024

JesseStutler Nov 19, 2024

JesseStutler Nov 19, 2024

JesseStutler commented Nov 19, 2024

Preempt performance #3825

Are you sure you want to change the base?

Preempt performance #3825

Conversation

molei20021 commented Nov 15, 2024

volcano-sh-bot commented Nov 15, 2024

molei20021 commented Nov 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lowang-bh commented Nov 15, 2024

molei20021 commented Nov 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JesseStutler commented Nov 19, 2024