Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler-simulate-proposal #3822

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

molei20021
Copy link

No description provided.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign william-wang
You can assign the PR to them by writing /assign @william-wang in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 15, 2024
Signed-off-by: molei20021 <[email protected]>
@@ -0,0 +1,32 @@
# Volcano scheduler simulate
## background
* Consider such situation: users changed the parameter of nodeorder plugin and need to know the effect to the production enviroment. For example, after change the mostrequested.weight, if the average wait time of big task is shorter than before, etc.
Copy link
Member

@JesseStutler JesseStutler Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides features validation, we also need scenarios for node simulation. (e.g., we don’t have GPU and NPU nodes, if there is some bugs in GPU or NPU scheduling, we can use simulation scheduling to debug. We do meet this scenario in our production). And to test the performance of the scheduler in large-scale clusters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add more node features of gpu in nodes.csv like gpu_allocatable, etc

### time simulator
* Time simulator is helpful to shorten simulate time because it will not get the time of real world, it will always get next timestamp of the min value between the create time of next pod and the finish time of next pod.
* The time related parameter should get from time simulator like pod create time, pod finish time, current time, etc...
### kube-apiserver simulator
Copy link
Member

@JesseStutler JesseStutler Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about integrating with kwok? kwok can simulate thousands of nodes and doesn't consume many resources. I'm wondering that if we use kwok, then we don't need to simulate kube-apiserver and kube-controller-manager. I worry about if you need to simulate kube-apiserver and kcm, there is lots of work you need to do.

Copy link
Member

@JesseStutler JesseStutler Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The project kube-scheduler-simulator also follow kwok to simulate scheduling and do performance testing, can we also do this way? https://github.com/kubernetes-sigs/kube-scheduler-simulator

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is how to speed up the simulation time, for example, a pod may run 20 hours and we also let it run 20 hours in kwok?

Copy link
Member

@JesseStutler JesseStutler Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact that kwok doesn't have kubelet, you can directly set when will the pod ends and at what stage, check this: https://kwok.sigs.k8s.io/docs/user/stages-configuration/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I create a PR: #3830 to add two scripts: one is for installing kwok and the other is for creating fake nodes. Perhaps building upon kwok is a better way to do simulating, you can add your stage confugration to do time simulating or other useful simulating. I think it's better and help us to do less work, kwok has been adopted by many schedulers as a tool for simulating scheduling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
retest-not-required-docs-only size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants