Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preempt performance #3825

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

molei20021
Copy link

No description provided.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign k82cn
You can assign the PR to them by writing /assign @k82cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 15, 2024
@molei20021
Copy link
Author

preempt-performance
The performance of preempt is about 2x faster

@@ -326,6 +326,10 @@ func (alloc *Action) predicate(task *api.TaskInfo, node *api.NodeInfo) error {
statusSets = append(statusSets, &api.Status{Code: api.Unschedulable, Reason: api.WrapInsufficientResourceReason(resources)})
return api.NewFitErrWithStatus(task, node, statusSets...)
}
if node.Allocatable.MaxTaskNum <= len(alloc.session.NodeMap[node.Name].Pods) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add it again? it exist in predicate plugin.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bacause it may influence the predicate result if there are too many pod running on the node, so I put it before the new predicate cache to make the cache accurate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't get the point, you didn't remove these codes in the predicate plugin, so why do we need to add duplicate codes here?

@lowang-bh
Copy link
Member

There is some gaps with my idea to improve the performance. Could you add some describe about your design?

@molei20021
Copy link
Author

There is some gaps with my idea to improve the performance. Could you add some describe about your design?

preempt-performance
I add a design graph, the parts marked in red are modification points.

@@ -31,6 +31,8 @@ import (

type Action struct {
enablePredicateErrorCache bool
session *framework.Session
preemptableNodeMap map[api.QueueID]map[string]int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add comments to declare your key and value?
Like the key is the Queue which the task that may be evicted belongs, the value is also a map, the key of inner map is the node name and value is the num of running/bound pods that may be evicted on the node.

@@ -230,5 +233,14 @@ func (ra *Action) Execute(ssn *framework.Session) {
}
}

func (ra *Action) predicate(task *api.TaskInfo, node *api.NodeInfo) error {
var statusSets api.StatusSets
if node.Allocatable.MaxTaskNum <= len(ra.session.NodeMap[node.Name].Pods) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just add this in session's PredicateForPreemptAction?

@@ -12,6 +12,47 @@ import (
"volcano.sh/volcano/pkg/scheduler/api"
)

type PredicateCache struct {
Cache map[api.JobID]map[api.TaskID]map[string]map[int64]error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.....Too much map nesting. Please refactor as struct nesting and refactor your related code.

if err := sc.taskUnschedulable(taskInfo, reason, msg, nominatedNodeName); err != nil {
klog.ErrorS(err, "Failed to update unschedulable task status", "task", klog.KRef(taskInfo.Namespace, taskInfo.Name),
"reason", reason, "message", msg)
ts, exist := schedulingutil.GetPodStatusLastSetCache(job.UID, taskInfo.UID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of code is used to reduce api requests sent to api-server? Need to split this as a separate commit, not to mit it into preemption performance improvement.

klog.ErrorS(err, "Failed to update unschedulable task status", "task", klog.KRef(taskInfo.Namespace, taskInfo.Name),
"reason", reason, "message", msg)
ts, exist := schedulingutil.GetPodStatusLastSetCache(job.UID, taskInfo.UID)
if !exist || nowTs-ts > 60 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that we discussed to set the interval that can be configured or using backoff. Direct set 60 here is
too empirical.

@@ -326,6 +326,10 @@ func (alloc *Action) predicate(task *api.TaskInfo, node *api.NodeInfo) error {
statusSets = append(statusSets, &api.Status{Code: api.Unschedulable, Reason: api.WrapInsufficientResourceReason(resources)})
return api.NewFitErrWithStatus(task, node, statusSets...)
}
if node.Allocatable.MaxTaskNum <= len(alloc.session.NodeMap[node.Name].Pods) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't get the point, you didn't remove these codes in the predicate plugin, so why do we need to add duplicate codes here?

@JesseStutler
Copy link
Member

There is some gaps with my idea to improve the performance. Could you add some describe about your design?

preempt-performance I add a design graph, the parts marked in red are modification points.

I didn't get why we need to record last condition time, in SchedulerCache's taskUnschedulable, there are already codes that validate whether the condition is same as last time:

updateCond := podConditionHaveUpdate(&pod.Status, condition)
// only update pod's nominatedNodeName when nominatedNodeName is not empty
// consider this situation:
// 1. at session 1, the pod A preempt another lower priority pod B, and we updated A's nominatedNodeName
// 2. at session 2, the pod B is still terminating, so the pod A is still pipelined, but it preempt none, so
// the nominatedNodeName is empty, but we should not override the A's nominatedNodeName to empty
updateNomiNode := len(nominatedNodeName) > 0 && podNominatedNodeNameNeedUpdate(&pod.Status, nominatedNodeName)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants