Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal of dynamic GPU slice plugin #3820

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sailorvii
Copy link

NVIDIA official GPU sharing includes time-slice, MPS and MIG. Currently the MPS and MIG dynamic is not supported, we want to add this into volcano scheduler plugin

@volcano-sh-bot
Copy link
Contributor

Welcome @sailorvii!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign lowang-bh
You can assign the PR to them by writing /assign @lowang-bh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 14, 2024
@Monokaix
Copy link
Member

Hi, please squash to one commit and sign off.

@sailorvii sailorvii force-pushed the master branch 2 times, most recently from e3ffd7e to 0000b26 Compare November 15, 2024 01:57
Copy link
Member

@JesseStutler JesseStutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewd it, please take a look~

docs/design/dynamic-gpu-slice.md Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the logic of this AddPod, in the mig-agent of nos? I'm wondering whether our dynamic GPU slice plugin is strongly dependent on the nos project. You can see that the annotation has the watermark of nos, and nos project is not updated frequently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. AddPod is in volcano/pkg/scheduler/api/node_info.go addResource.
  2. 3 functions can be reused from nos project: mig agent, mps agent and mps device plugin. They are not the most important part. If needed, we could rewrite them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @Monokaix , I think we'd better rewrite them as part of volcano and evolve with us.

docs/design/images/dynamicGPUSliceSlice.png Outdated Show resolved Hide resolved
docs/design/images/dynamicGPUSliceScore.png Outdated Show resolved Hide resolved
docs/design/dynamic-gpu-slice.md Outdated Show resolved Hide resolved
@sailorvii sailorvii force-pushed the master branch 2 times, most recently from 9c752ad to 7e85873 Compare November 20, 2024 06:22
@archlitchi
Copy link
Contributor

A nice feature, but i have a few recommends:

  1. please add user guide for using dynamic MIG and MPS
  2. please clarify if annotations 'dynamicgpuslice' is a pod annotation or a node annotation?

Refine as JesseStutler's comments
Address the comments by archlitchi.

Signed-off-by: sailorvii <[email protected]>
Signed-off-by: chenw66 <[email protected]>
@sailorvii
Copy link
Author

archlitchi

Thanks for your time and review.

  1. Add the usage part.
  2. They're all node annotations. (the title has said “Node labels and annotations”)

@sailorvii sailorvii closed this Nov 25, 2024
@sailorvii sailorvii reopened this Nov 25, 2024
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: dynamicgpuslice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about use the deviceshare plugin?

@Monokaix
Copy link
Member

We should clarify which dp the user should deploy and the relationship between dynamic mig slice and vgpu. The semantics of vgpu and dynamic mig slice are not completely consistent. Whether to use nvidia dp or hami needs to be discussed again.

@JesseStutler
Copy link
Member

Let’s discuss it again how to evolve this feature at the weekly meeting? Currently, it seems that there are three repos: volcano does the scheduling, hami does the dp, and nos does the mig/mps agent. It is too fragmented. @sailorvii @archlitchi @Monokaix

@sailorvii
Copy link
Author

Let’s discuss it again how to evolve this feature at the weekly meeting? Currently, it seems that there are three repos: volcano does the scheduling, hami does the dp, and nos does the mig/mps agent. It is too fragmented. @sailorvii @archlitchi @Monokaix

Thank you all for your time. It's good to discuss the details in the meeting.

Copy link

stale bot commented Feb 1, 2025

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2025

## Introduction

The NVIDIA GPU build-in sharing method includes: time-slice, MPS and MIG. The context switch for time slice sharing would waste some time, so we chose the MPS and MIG. The GPU MIG profile is variable, the user could acquire the MIG device in the profile definition, but current implementation only defines the dedicated profile before the user requirement. That limits the usage of MIG. We want to develop an automatic slice plugin and create the slice when the user require it. And we also add the MPS support.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use upper case MIG uniformly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. retest-not-required-docs-only size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants