Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-resource-reservation-proposal #3775

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update the annotation volcano.sh/is-reserve and volcano.sh/runsec-max in this picture

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
56 changes: 56 additions & 0 deletions docs/design/node-resource-reservation-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Volcano node resource reservation
## background
* Consider such situation: there are thounsands of pods to be scheduled evert day, in 1 to 2 o'clock 500 low priority pods are created and schedulered which used 99% of cluster resource, in 2 to 3 o'clock 10 high priority pods are created, however, low priority pods are still running, high priority pods can not be scheduled due to lack of resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo evert, schedulered

* Users want high priority tasks in 2 to 3 o'clock have resource to schedule immediately every day and high priority tasks not preempt low priority pods because some low priority pods have already run many days.
## design
![annotation](images/node-resource-reservation-annotation.png)
### recognize high priority pods
There are two ways to recognize high priority pods:
* set annotation volcano.sh/reserveable: true in podgroup
which means all pods under podgroup are reserve tasks
* set annotation volcano.sh/reserveable: true in pod
which means this pod is reserve task
### recognize pod max running time
* set annotation volcano.sh/maximum-runtime: 500s in podgroup
which means the podgroup will run for a maximum of 500 seconds
* set annotation volcano.sh/maximum-runtime: 500s in pod
which means this pod will run for a maximum of 500 seconds
### reserve plugin
#### configuration
molei20021 marked this conversation as resolved.
Show resolved Hide resolved
```
- plugins:
- name: reserve
arguments:
reservelabels:
- nodeSelector:
business_type: ebook
startHour: 3
endHour: 4
resources:
cpu: "32"
memory: 64Gi
startReserveAgo: 2h
podNumToReserve: 10
cron: daily
```
In the configuration, reservelabels is consisted by nodeSelector which represent a node list, resources represent a list of resource reservation configuration.The overall meaning is that from 3 to 4 o'clock every day, 32 cpu, 64Gi memory need to be reserved and should start reserve 2h ago, if 10 reserve tasks are scheduled during reserve time range, stop reserve. which can save more resources for non-reserved tasks after 10 reserved tasks are scheduled during reserve period.
#### OpenSession
* make cache of nodeForbidMap which is used to cache forbidden nodes to forbid non-reserved tasks to be scheduled on reserved nodes, the calculation algorithm is as follows: firstly, order the nodes desc by node idle. Node idle is consisted of node resource unused and the resource will be released in the future before reserve start time which is taken by the annotation of pod max running time. secondly, traverse the ordered nodes, accumulate the node allocatable resource, if the accumulated resource is less than the resource to be reserved, add the node to nodeForbidMap which means the system will have the trend to reserve big resource other than many small resources.
* make cache of reservedTaskPendingResource which is used to cache the accumulated resource of pending tasks
* make cache of reservedTaskAllocatedResource which is used to cache the accumulated resource of allocated tasks
* make cache of resourceIdle which is used to accumulate the node futureidle resource.
* register plugin function PredicateFn
* register event handler AllocateFunc and DeallocateFunc to dynamically update the cache of reservedTaskPendingResource, reservedTaskAllocatedResource and resourceIdle
#### PredicateFn
Predicate is used to restrict other pods to be scheduled on reserved nodes. Reserved nodes are filtered out from the list of nodes and will change dynamically.
* check if the task is a reserve task, if yes, permit the task to be scheduled on this node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rearrange the explanation here according to the order of the flow chart? If it is a reserve task, it can be scheduled to this node directly? Doesn't it check whether the resources are sufficient?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is reserve task, PredicateFn will not forbid the task to be scheduled to the node, if in allocate action, the node do not have enough resource, the task will not be scheduled to the node.

* check if the current time is within the reserved time range, if no, permit the non-reserved task to be scheduled on this node.
* check if the number of reserve pods which have been scheduled is larger than the max pod number configured, if yes, permit the non-reserved task to be scheduled on this node.
* check if the node is in reserve node list(from nodeForbidMap cache), if yes, deny the non-reserved task to be scheduled on this node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just call ReserveNodesMap...? nodeForbidMap seems to have to be explained from the perspective of non-reserve tasks.

* check if the node idle resource(from resourceIdle cache) is larger than the reserve requirements(max(reservedTaskAllocatedResource + reservedTaskPendingResource, reserveResourcesConfig)), if yes, permit the non-reserved task to be scheduled on this node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question that what if the total resources of podNumToReserve are larger than the resources in config? We base on the total resources of podNumToReserve or base on the configured resources:resources: cpu: "32" memory: 64Gi?


![predicate](images/node-resource-reservation-predicate.png)
#### JobStarvingFn
JobStarving is used in preempt action which is an expand of reserve because sometimes reserve node resource may not be completely accurate. If podgroup or pod is set the annotation of reserve, the job is starving and can preempt other possible pods.
#### PreemptableFn
PreemptableFn is used to cooperate JobStarvingFn to filter the victims to be preempted. In reserve situation, the preemptor can preempt the task which have the same node label and the create time of preemptee is later than the preemptor which means to preempt the task which should not be scheduled before and the occupancy rate of the cluster is not effected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does Reserve need preemption? Are the JobStarvingFn and PreeemptableFn new added to this doc?