node-resource-reservation-proposal #3775

molei20021 · 2024-10-16T04:05:35Z

Volcano node resource reservation proposal

Signed-off-by: molei20021 <[email protected]>

volcano-sh-bot · 2024-10-16T04:05:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign kevin-wangzefeng
You can assign the PR to them by writing /assign @kevin-wangzefeng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JesseStutler · 2024-10-17T03:56:57Z

Hi @molei20021, thanks for your contribution! /cc @Monokaix @hwdef @lowang-bh

Monokaix · 2024-10-18T09:37:41Z

docs/design/node-resource-reservation-design.md

+       reserve.define.label2: {"business_type": "computer"}
+       reserve.resources.label2: [{"start_hour": 7, "end_hour": 9, "cpu": 24, "memory": 96, "start_reserve_ago": "30m", "pod_num": 15, "cron": "weekly 1,2,5"}]
+```
+In the configuration, nodeLabel represent a node list which nodeselector satisfy the nodeLabel, resources represent a list of resource reservation configuration, for example, reserve.resources.label1 means in hour 3 to 4 everyday, 32 cpu, 64 memory need to be reserved for label1, and start to reserve 2h ago, if 10 reserve pods are scheduled or after hour 4, stop reserve.


reserve, recover？

node reserve means during reserve time interval, if other task will make the reserved resources less than reserve config, deny other task to schedule temporary until reserve tasks is scheduled.

Monokaix · 2024-10-18T09:46:09Z

Queue also has the ability to reserve resources, how about use different queues to associate with different jobs, and adjust the queue's guarantee resources regularly？

molei20021 · 2024-10-21T01:54:24Z

Queue also has the ability to reserve resources, how about use different queues to associate with different jobs, and adjust the queue's guarantee resources regularly？

queue can not reserve for specified time interval, after the specified time when no reserve tasks will be created, it will still limit the resource of not guaranteed queues.

Monokaix · 2024-10-21T02:14:15Z

Queue also has the ability to reserve resources, how about use different queues to associate with different jobs, and adjust the queue's guarantee resources regularly？

queue can not reserve for specified time interval, after the specified time when no reserve tasks will be created, it will still limit the resource of not guaranteed queues.

How about adjust the queue guaranteed resources dynamically?

molei20021 · 2024-10-22T03:33:27Z

resources dynamically?

we need queue to limit the quota of different users, if I put big reserve task of different users to one queue, quota can not be limited and even of the quota of one queue is reserved, big reserve task may still fail to schedule immediately if there are many fragmented resources in the cluster. In the document, I use predicate plugin to order top idle node to forbid tasks which are not reserved to schedule and this is helpful to reduce fragmented resources.

Queue also has the ability to reserve resources, how about use different queues to associate with different jobs, and adjust the queue's guarantee resources regularly？

queue can not reserve for specified time interval, after the specified time when no reserve tasks will be created, it will still limit the resource of not guaranteed queues.

How about adjust the queue guaranteed resources dynamically?

I see guarantee of queue is used in proportion plugin and capacity plugin, in proportion plugin, it can influence deserved value of queue, but if I set guarantee in many queues at some time, the deserved value may be less than guarantee. In my situation one queue not only have small unimportant tasks but also have some big reserve tasks. Different department should have different queues to restrict the quota so we need to reserve import big tasks globally and the reserved resources should be unfragmented to be scheduled by big reserved pod.

Monokaix · 2024-10-26T09:54:05Z

resources dynamically?

we need queue to limit the quota of different users, if I put big reserve task of different users to one queue, quota can not be limited and even of the quota of one queue is reserved, big reserve task may still fail to schedule immediately if there are many fragmented resources in the cluster. In the document, I use predicate plugin to order top idle node to forbid tasks which are not reserved to schedule and this is helpful to reduce fragmented resources.

Queue also has the ability to reserve resources, how about use different queues to associate with different jobs, and adjust the queue's guarantee resources regularly？

queue can not reserve for specified time interval, after the specified time when no reserve tasks will be created, it will still limit the resource of not guaranteed queues.

How about adjust the queue guaranteed resources dynamically?

I see guarantee of queue is used in proportion plugin and capacity plugin, in proportion plugin, it can influence deserved value of queue, but if I set guarantee in many queues at some time, the deserved value may be less than guarantee. In my situation one queue not only have small unimportant tasks but also have some big reserve tasks. Different department should have different queues to restrict the quota so we need to reserve import big tasks globally and the reserved resources should be unfragmented to be scheduled by big reserved pod.

So you want to avoid fragment by node level not queue？
If do not consider fragment，can guarantee meet your need?

JesseStutler

I have reviewed it, please check it, there may be some inaccuracies in the Chinese and English translations that cause us some ambiguities. You may need to translate your proposal more accurately, thanks. But I think the reserve plugin is a great idea.

docs/design/node-resource-reservation-design.md

JesseStutler · 2024-10-27T06:34:12Z

docs/design/node-resource-reservation-design.md

+In the configuration, nodeLabel represent a node list which nodeselector satisfy the nodeLabel, resources represent a list of resource reservation configuration, for example, reserve.resources.label1 means in hour 3 to 4 everyday, 32 cpu, 64 memory need to be reserved for label1, and start to reserve 2h ago, if 10 reserve pods are scheduled or after hour 4, stop reserve.
+#### PredicateFn
+Predicate is used to restrict other pods to be scheduled on reserved nodes. Reserved nodes are filtered out from the list of nodes and will change dynamically. 
+* check if the task is a reserve task, if yes, permit the task to be scheduled on this node.


Can you rearrange the explanation here according to the order of the flow chart? If it is a reserve task, it can be scheduled to this node directly? Doesn't it check whether the resources are sufficient?

if it is reserve task, PredicateFn will not forbid the task to be scheduled to the node, if in allocate action, the node do not have enough resource, the task will not be scheduled to the node.

docs/design/node-resource-reservation-design.md

JesseStutler · 2024-10-27T06:41:31Z

docs/design/node-resource-reservation-design.md

+Predicate is used to restrict other pods to be scheduled on reserved nodes. Reserved nodes are filtered out from the list of nodes and will change dynamically. 
+* check if the task is a reserve task, if yes, permit the task to be scheduled on this node.
+* check if the time is in the reserved time range, if no, permit the task to be scheduled on this node.
+* check if the number of reserve pods which is scheduled is larger than the max pod number configured, if yes, permit the task to be scheduled on this node.


This is not quite right. If the number exceeds the maximum number of pods configured, scheduling should not be allowed, right?

max number of pod is used in situation that user know in 1 to 2 o'clock there'll be 10 reserve pod to be created, when in 1:30, 10 reserve pods are all scheduled, then in 1.30 to 2, we needn't reserve any resources, so more resoures are saved for non-reserve tasks.

docs/design/node-resource-reservation-design.md

Signed-off-by: molei20021 <[email protected]>

JesseStutler · 2024-11-18T11:34:51Z

docs/design/node-resource-reservation-design.md

@@ -0,0 +1,56 @@
+# Volcano node resource reservation
+## background
+* Consider such situation: there are thounsands of pods to be scheduled evert day, in 1 to 2 o'clock 500 low priority pods are created and schedulered which used 99% of cluster resource, in 2 to 3 o'clock 10 high priority pods are created, however, low priority pods are still running, high priority pods can not be scheduled due to lack of resource.


typo evert, schedulered

JesseStutler · 2024-11-18T11:37:11Z

docs/design/node-resource-reservation-design.md

+* check if the task is a reserve task, if yes, permit the task to be scheduled on this node.
+* check if the current time is within the reserved time range, if no, permit the non-reserved task to be scheduled on this node.
+* check if the number of reserve pods which have been scheduled is larger than the max pod number configured, if yes, permit the non-reserved task to be scheduled on this node.
+* check if the node is in reserve node list(from nodeForbidMap cache), if yes, deny the non-reserved task to be scheduled on this node.


Why don't we just call ReserveNodesMap...? nodeForbidMap seems to have to be explained from the perspective of non-reserve tasks.

JesseStutler · 2024-11-18T11:43:43Z

docs/design/node-resource-reservation-design.md

+* check if the current time is within the reserved time range, if no, permit the non-reserved task to be scheduled on this node.
+* check if the number of reserve pods which have been scheduled is larger than the max pod number configured, if yes, permit the non-reserved task to be scheduled on this node.
+* check if the node is in reserve node list(from nodeForbidMap cache), if yes, deny the non-reserved task to be scheduled on this node.
+* check if the node idle resource(from resourceIdle cache) is larger than the reserve requirements(max(reservedTaskAllocatedResource + reservedTaskPendingResource, reserveResourcesConfig)), if yes, permit the non-reserved task to be scheduled on this node.


I have a question that what if the total resources of podNumToReserve are larger than the resources in config? We base on the total resources of podNumToReserve or base on the configured resources:resources: cpu: "32" memory: 64Gi?

JesseStutler · 2024-11-18T11:52:42Z

docs/design/images/node-resource-reservation-annotation.png

Please also update the annotation volcano.sh/is-reserve and volcano.sh/runsec-max in this picture

JesseStutler · 2024-11-18T11:57:46Z

docs/design/node-resource-reservation-design.md

+#### JobStarvingFn
+JobStarving is used in preempt action which is an expand of reserve because sometimes reserve node resource may not be completely accurate. If podgroup or pod is set the annotation of reserve, the job is starving and can preempt other possible pods.
+#### PreemptableFn 
+PreemptableFn is used to cooperate JobStarvingFn to filter the victims to be preempted. In reserve situation, the preemptor can preempt the task which have the same node label and the create time of preemptee is later than the preemptor which means to preempt the task which should not be scheduled before and the occupancy rate of the cluster is not effected.


Why does Reserve need preemption? Are the JobStarvingFn and PreeemptableFn new added to this doc?

node-resource-reservation-proposal

6d18b2a

Signed-off-by: molei20021 <[email protected]>

volcano-sh-bot added the retest-not-required-docs-only label Oct 16, 2024

volcano-sh-bot requested review from hwdef and Thor-wl October 16, 2024 04:05

volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 16, 2024

Monokaix reviewed Oct 18, 2024

View reviewed changes

JesseStutler reviewed Oct 27, 2024

View reviewed changes

update doc

3598ef2

Signed-off-by: molei20021 <[email protected]>

JesseStutler reviewed Nov 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node-resource-reservation-proposal #3775

node-resource-reservation-proposal #3775

molei20021 commented Oct 16, 2024

volcano-sh-bot commented Oct 16, 2024

JesseStutler commented Oct 17, 2024

Monokaix Oct 18, 2024 •

edited

Loading

molei20021 Oct 21, 2024

Monokaix commented Oct 18, 2024

molei20021 commented Oct 21, 2024

Monokaix commented Oct 21, 2024

molei20021 commented Oct 22, 2024

Monokaix commented Oct 26, 2024

JesseStutler left a comment

JesseStutler Oct 27, 2024

molei20021 Oct 31, 2024

JesseStutler Oct 27, 2024

molei20021 Oct 31, 2024

JesseStutler Nov 18, 2024

JesseStutler Nov 18, 2024

JesseStutler Nov 18, 2024

JesseStutler Nov 18, 2024

JesseStutler Nov 18, 2024

node-resource-reservation-proposal #3775

Are you sure you want to change the base?

node-resource-reservation-proposal #3775

Conversation

molei20021 commented Oct 16, 2024

volcano-sh-bot commented Oct 16, 2024

JesseStutler commented Oct 17, 2024

Monokaix Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Monokaix commented Oct 18, 2024

molei20021 commented Oct 21, 2024

Monokaix commented Oct 21, 2024

molei20021 commented Oct 22, 2024

Monokaix commented Oct 26, 2024

JesseStutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Monokaix Oct 18, 2024 •

edited

Loading