Skip to content

Commit

Permalink
use group-min-member annotation to initialize PodGroup
Browse files Browse the repository at this point in the history
Signed-off-by: sceneryback <[email protected]>
  • Loading branch information
wangyang0616 authored and sceneryback committed Feb 12, 2025
1 parent 4dea29b commit 27e95ed
Show file tree
Hide file tree
Showing 17 changed files with 500 additions and 94 deletions.
20 changes: 10 additions & 10 deletions .github/stale.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,30 +21,30 @@ exemptAssignees: false
staleLabel: lifecycle/stale

pull:
daysUntilClose: 60
daysUntilStale: 90
daysUntilClose: 90
daysUntilStale: 180
markComment: >
Hello 👋 Looks like there was no activity on this amazing PR for last 90 days.
Hello 👋 Looks like there was no activity on this amazing PR for last 180 days.
**Do you mind updating us on the status?** Is there anything we can help with? If you plan to still work on it, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen a PR if you get back to this!).
If there will be no activity for 90 days, this issue will be closed (we can always reopen a PR if you get back to this!).
#unmarkComment: No need for unmark comment.
closeComment: >
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗
Closing for now as there was no activity for last 90 days after marked as stale, let us know if you need this to be reopened! 🤗
issues:
daysUntilClose: 60
daysUntilStale: 90
daysUntilClose: 90
daysUntilStale: 180
markComment: >
Hello 👋 Looks like there was no activity on this issue for last 90 days.
Hello 👋 Looks like there was no activity on this issue for last 180 days.
**Do you mind updating us on the status?** Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
If there will be no activity for 90 days, this issue will be closed (we can always reopen an issue if we need!).
#unmarkComment: No need for unmark comment.
closeComment: >
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗
Closing for now as there was no activity for last 90 days after marked as stale, let us know if you need this to be reopened! 🤗
# Limit the number of actions per hour, from 1-30. Default is 30
limitPerRun: 30
1 change: 1 addition & 0 deletions OWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ reviewers:
- Monokaix
- lowang-bh
- archlitchi
- JesseStutler
approvers:
- k82cn
- kevin-wangzefeng
Expand Down
173 changes: 173 additions & 0 deletions docs/design/dynamic-mig.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# NVIDIA GPU MPS and MIG dynamic slice plugin

## Special Thanks

This feature will not be implemented without the help of @sailorvii.

## Introduction

The NVIDIA GPU build-in sharing method includes: time-slice, MPS and MIG. The context switch for time slice sharing would waste some time, so we chose the MPS and MIG. The GPU MIG profile is variable, the user could acquire the MIG device in the profile definition, but current implementation only defines the dedicated profile before the user requirement. That limits the usage of MIG. We want to develop an automatic slice plugin and create the slice when the user require it.
For the scheduling method, node-level binpack and spread will be supported. Referring to the binpack plugin, we consider the CPU, Mem, GPU memory and other user-defined resource.
Volcano already have a [vgpu feature](https://github.com/Project-HAMi/volcano-vgpu-device-plugin) for NVIDIA devices after v1.9, it is done by using [hami-core](https://github.com/Project-HAMi/HAMi-core), which is a cuda-hacking library. It can dynamically share GPUs and ensure both quality of service and resource isolation. But considering MIG is also widely used across the world. Supporting MIG mode along with 'hami-core' in volcano-vgpu can be helpful. A unified API for dynamic-MIG and hami-core for volcano-vgpu is needed.

## Targets

- CPU, Mem, and GPU combined scheduling
- GPU dynamic slice: Hami-core and MIG
- Support node-level binpack and spread by GPU memory, CPU and Mem
- A unified vGPU Pool different virtualization technics
- Tasks can choose to use MIG, use HAMi-core, or use both.

### Config maps
- volcano-device-configMap
This configmap defines the plugin configurations including resourceName, and MIG geometries, and node-level configurations.

```yaml
apiVersion: v1
data:
volcano-device-share.conf: |
nvidia:
resourceCountName: volcano.sh/vgpu-number
resourceMemoryName: volcano.sh/vgpu-memory
resourceCoreName: volcano.sh/vgpu-cores
knownMigGeometries:
- models: [ "A30" ]
allowedGeometries:
- group: group1
geometries:
- name: 1g.6gb
memory: 6144
count: 4
- group: group2
geometries:
- name: 2g.12gb
memory: 12288
count: 2
- group: group3
geometries:
- name: 4g.24gb
memory: 24576
count: 1
- models: [ "A100-SXM4-40GB", "A100-40GB-PCIe", "A100-PCIE-40GB", "A100-SXM4-40GB" ]
allowedGeometries:
- group: group1
geometries:
- name: 1g.5gb
memory: 5120
count: 7
- group: group2
geometries:
- name: 2g.10gb
memory: 10240
count: 3
- name: 1g.5gb
memory: 5120
count: 1
- group: group3
geometries:
- name: 3g.20gb
memory: 20480
count: 2
- group: group4
geometries:
- name: 7g.40gb
memory: 40960
count: 1
- models: [ "A100-SXM4-80GB", "A100-80GB-PCIe", "A100-PCIE-80GB"]
allowedGeometries:
- group: group1
geometries:
- name: 1g.10gb
memory: 10240
count: 7
- group: group2
geometries:
- name: 2g.20gb
memory: 20480
count: 3
- name: 1g.10gb
memory: 10240
count: 1
- group: group3
geometries:
- name: 3g.40gb
memory: 40960
count: 2
- group: group4
geometries:
- name: 7g.79gb
memory: 80896
count: 1
```
## Structure
<img src="./images/volcano-dynamic-mig-structure.png" width = "400" />
## Examples
Dynamic mig is compatable with volcano-vgpu tasks, as the example below:
Just Setting `volcano.sh/vgpu-number` and `volcano.sh/vgpu-memory`.

```yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod1
spec:
containers:
- name: ubuntu-container1
image: ubuntu:20.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
volcano.sh/vgpu-number: 2 # requesting 2 vGPUs
volcano.sh/vgpu-memory: 8000 # Each vGPU contains 8000m device memory (Optional,Integer
```

A task can decide only to use `mig` or `hami-core` by setting `annotations.volcano.sh/vgpu-mode` to corresponding value, as the example below shows:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod1
annotations:
volcano.sh/vgpu-mode: "mig"
spec:
containers:
- name: ubuntu-container1
image: ubuntu:20.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
volcano.sh/vgpu-number: 2 # requesting 2 vGPUs
volcano.sh/vgpu-memory: 8000 # Each vGPU contains 8000m device memory (Optional,Integer)
```

## Procedures

The Procedure of a vGPU task which uses dynamic-mig is shown below:

<img src="./images/volcano-dynamic-mig-procedure.png" width = "800" />

Note that after submited a task, deviceshare plugin will iterate over templates defined in configMap `volcano-device-share`, and find the first available template to fit. You can always change the content of that configMap, and restart vc-scheduler to customize.

If you submit the example above(a pod requests 2 * 8G GPUs) to a cluster, which has an empty A100-PCIE-40GB node, then it will follow the procedure below:

<img src="./images/dynamic-mig-example.png" width = "400" />

The walkthrough will be shown in bold line

As the figure shows, after the procedure, it will adopt geometry 'group2' to that GPU with the definiation below:

```yaml
group2:
2g.10gb : 3
1g.5gb : 1
```

There are four mig instances in total, vc-scheduler will return 2 '2g.10gb' instances to the task, and add the remaining instances (1 '2g.10gb' + 1 '1g.5gb' ) to the available empty mig instances, for future usage.

In the end, start the container with 2g.10gb instances * 2

Binary file added docs/design/images/dynamic-mig-example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ require (
sigs.k8s.io/controller-runtime v0.13.0
sigs.k8s.io/yaml v1.4.0
stathat.com/c/consistent v1.0.0
volcano.sh/apis v1.10.0-alpha.0.0.20241210014034-bf27f4e986d0
volcano.sh/apis v1.11.1-0.20250211082520-7f8222e881d9
)

require (
Expand Down
4 changes: 2 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -510,5 +510,5 @@ sigs.k8s.io/yaml v1.4.0 h1:Mk1wCc2gy/F0THH0TAp1QYyJNzRm2KCLy3o5ASXVI5E=
sigs.k8s.io/yaml v1.4.0/go.mod h1:Ejl7/uTz7PSA4eKMyQCUTnhZYNmLIl+5c2lQPGR2BPY=
stathat.com/c/consistent v1.0.0 h1:ezyc51EGcRPJUxfHGSgJjWzJdj3NiMU9pNfLNGiXV0c=
stathat.com/c/consistent v1.0.0/go.mod h1:QkzMWzcbB+yQBL2AttO6sgsQS/JSTapcDISJalmCDS0=
volcano.sh/apis v1.10.0-alpha.0.0.20241210014034-bf27f4e986d0 h1:qcQNg8mEsXU+7YYX6hff9JT+jDj2RJB4aEGwOoWwjBY=
volcano.sh/apis v1.10.0-alpha.0.0.20241210014034-bf27f4e986d0/go.mod h1:FOdmG++9+8lgENJ9XXDh+O3Jcb9YVRnlMSpgIh3NSVI=
volcano.sh/apis v1.11.1-0.20250211082520-7f8222e881d9 h1:FaXN5C42er0oqvmyviJ6QSQcs1uTUJ8/Txz0AI4QkAI=
volcano.sh/apis v1.11.1-0.20250211082520-7f8222e881d9/go.mod h1:FOdmG++9+8lgENJ9XXDh+O3Jcb9YVRnlMSpgIh3NSVI=
1 change: 1 addition & 0 deletions pkg/controllers/OWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ reviewers:
- hzxuzhonghu
- TommyLike
- hwdef
- JesseStutler
approvers:
- hzxuzhonghu
- TommyLike
Expand Down
20 changes: 5 additions & 15 deletions pkg/controllers/job/job_controller_util.go
Original file line number Diff line number Diff line change
Expand Up @@ -327,10 +327,10 @@ func (p TasksPriority) CalcFirstCountResources(count int32) v1.ResourceList {

for _, task := range p {
if count <= task.Replicas {
minReq = quotav1.Add(minReq, calTaskRequests(&v1.Pod{Spec: task.Template.Spec}, count))
minReq = quotav1.Add(minReq, util.CalTaskRequests(&v1.Pod{Spec: task.Template.Spec}, count))
break
} else {
minReq = quotav1.Add(minReq, calTaskRequests(&v1.Pod{Spec: task.Template.Spec}, task.Replicas))
minReq = quotav1.Add(minReq, util.CalTaskRequests(&v1.Pod{Spec: task.Template.Spec}, task.Replicas))
count -= task.Replicas
}
}
Expand All @@ -353,7 +353,7 @@ func (p TasksPriority) CalcPGMinResources(jobMinAvailable int32) v1.ResourceList
if left := jobMinAvailable - podCnt; left < validReplics {
validReplics = left
}
minReq = quotav1.Add(minReq, calTaskRequests(&v1.Pod{Spec: task.Template.Spec}, validReplics))
minReq = quotav1.Add(minReq, util.CalTaskRequests(&v1.Pod{Spec: task.Template.Spec}, validReplics))
podCnt += validReplics
if podCnt >= jobMinAvailable {
break
Expand All @@ -377,10 +377,10 @@ func (p TasksPriority) CalcPGMinResources(jobMinAvailable int32) v1.ResourceList
}

if leftCnt >= left {
minReq = quotav1.Add(minReq, calTaskRequests(&v1.Pod{Spec: task.Template.Spec}, left))
minReq = quotav1.Add(minReq, util.CalTaskRequests(&v1.Pod{Spec: task.Template.Spec}, left))
leftCnt -= left
} else {
minReq = quotav1.Add(minReq, calTaskRequests(&v1.Pod{Spec: task.Template.Spec}, leftCnt))
minReq = quotav1.Add(minReq, util.CalTaskRequests(&v1.Pod{Spec: task.Template.Spec}, leftCnt))
leftCnt = 0
}
if leftCnt <= 0 {
Expand All @@ -390,16 +390,6 @@ func (p TasksPriority) CalcPGMinResources(jobMinAvailable int32) v1.ResourceList
return minReq
}

// calTaskRequests returns requests resource with validReplica replicas
func calTaskRequests(pod *v1.Pod, validReplica int32) v1.ResourceList {
minReq := v1.ResourceList{}
usage := *util.GetPodQuotaUsage(pod)
for i := int32(0); i < validReplica; i++ {
minReq = quotav1.Add(minReq, usage)
}
return minReq
}

// isInternalEvent checks if the event is an internal event
func isInternalEvent(event v1alpha1.Event) bool {
switch event {
Expand Down
39 changes: 36 additions & 3 deletions pkg/controllers/podgroup/pg_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,11 @@ import (
utilfeature "k8s.io/apiserver/pkg/util/feature"
"k8s.io/client-go/informers"
appinformers "k8s.io/client-go/informers/apps/v1"
batchinformers "k8s.io/client-go/informers/batch/v1"
coreinformers "k8s.io/client-go/informers/core/v1"
"k8s.io/client-go/kubernetes"
appslisters "k8s.io/client-go/listers/apps/v1"
batchlisters "k8s.io/client-go/listers/batch/v1"
corelisters "k8s.io/client-go/listers/core/v1"
"k8s.io/client-go/tools/cache"
"k8s.io/client-go/util/workqueue"
Expand All @@ -51,6 +54,9 @@ type pgcontroller struct {
podInformer coreinformers.PodInformer
pgInformer schedulinginformer.PodGroupInformer
rsInformer appinformers.ReplicaSetInformer
dsInformer appinformers.DaemonSetInformer
ssInformer appinformers.StatefulSetInformer
jobInformer batchinformers.JobInformer

informerFactory informers.SharedInformerFactory
vcInformerFactory vcinformer.SharedInformerFactory
Expand All @@ -63,9 +69,22 @@ type pgcontroller struct {
pgLister schedulinglister.PodGroupLister
pgSynced func() bool

// A store of replicaset
// A store of replicasets
rsLister appslisters.ReplicaSetLister
rsSynced func() bool

// A store of daemonsets
dsLister appslisters.DaemonSetLister
dsSynced func() bool

// A store of statefulsets
ssLister appslisters.StatefulSetLister
ssSynced func() bool

// A store of jobs
jobLister batchlisters.JobLister
jobSynced func() bool

queue workqueue.TypedRateLimitingInterface[podRequest]

schedulerNames []string
Expand Down Expand Up @@ -99,15 +118,29 @@ func (pg *pgcontroller) Initialize(opt *framework.ControllerOption) error {
AddFunc: pg.addPod,
})

pg.rsInformer = pg.informerFactory.Apps().V1().ReplicaSets()
pg.rsLister = pg.rsInformer.Lister()
pg.rsSynced = pg.rsInformer.Informer().HasSynced

pg.dsInformer = pg.informerFactory.Apps().V1().DaemonSets()
pg.dsLister = pg.dsInformer.Lister()
pg.dsSynced = pg.dsInformer.Informer().HasSynced

pg.ssInformer = pg.informerFactory.Apps().V1().StatefulSets()
pg.ssLister = pg.ssInformer.Lister()
pg.ssSynced = pg.ssInformer.Informer().HasSynced

pg.jobInformer = pg.informerFactory.Batch().V1().Jobs()
pg.jobLister = pg.jobInformer.Lister()
pg.jobSynced = pg.jobInformer.Informer().HasSynced

factory := opt.VCSharedInformerFactory
pg.vcInformerFactory = factory
pg.pgInformer = factory.Scheduling().V1beta1().PodGroups()
pg.pgLister = pg.pgInformer.Lister()
pg.pgSynced = pg.pgInformer.Informer().HasSynced

if utilfeature.DefaultFeatureGate.Enabled(features.WorkLoadSupport) {
pg.rsInformer = pg.informerFactory.Apps().V1().ReplicaSets()
pg.rsSynced = pg.rsInformer.Informer().HasSynced
pg.rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: pg.addReplicaSet,
UpdateFunc: pg.updateReplicaSet,
Expand Down
Loading

0 comments on commit 27e95ed

Please sign in to comment.