Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Preempt action support topology #3995

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bibibox
Copy link
Contributor

@bibibox bibibox commented Feb 5, 2025

What type of PR is this?

/kind documentation

@volcano-sh-bot volcano-sh-bot added retest-not-required-docs-only kind/documentation Categorizes issue or PR as related to documentation. labels Feb 5, 2025
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign william-wang
You can assign the PR to them by writing /assign @william-wang in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 5, 2025

When topology-sensitive resources like GPUs exist, the preemption process needs to consider resource topology relationships to ensure resource allocation after preemption still satisfies original topology constraints.

For example, if a node has 2 GPUs (8GB each), Pod A and Pod B each use 4GB, and Pod C needs 8GB. Direct scheduling of Pod C will fail, triggering preemption. After removing Pod A, Pod C can be scheduled, but when re-adding Pod A, topology changes might occur due to binpack strategy. At this point, Pod C can still be scheduled, ultimately leading to preemption failure due to no pods being evicted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the example here about the current situation of volcano preemption or the challenges of the current optimization solution?

type SimulateAddPodFn func(pod *api.TaskInfo, node *api.NodeInfo) error
```

### Limitations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The native Kubernetes scheduler has some capability constraints in terms of preemption, and has made certain trade-offs in terms of functionality and performance. See: # limitations-of-preemption
Compared with kube-scheduler, are the functional manifestations of Volcano's affinity preemption consistent or different? If so, what are the detailed differences?

@Monokaix
Copy link
Member

Monokaix commented Feb 8, 2025

There is chinese character in the img, and the subject of each process needs to be clearly identified.

@Monokaix
Copy link
Member

Monokaix commented Feb 8, 2025

The process is hard to understand for common users, we'd better make it more clearer, maybe we can add both desgin process and an example.

@Monokaix
Copy link
Member

Monokaix commented Feb 8, 2025

The three Key Functions are not presented in the above process design, we can give a more detailed description.

@Monokaix
Copy link
Member

Monokaix commented Feb 8, 2025

What's the standard of PreemptCostNodeOrder, least evicted pod mums? We should give one.

@Monokaix
Copy link
Member

Monokaix commented Feb 8, 2025

In what case a plugin should register removal and addition func, we should give a guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Categorizes issue or PR as related to documentation. retest-not-required-docs-only size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants