Skip to content

Commit

Permalink
Minus refine
Browse files Browse the repository at this point in the history
  • Loading branch information
chenw66 committed Nov 14, 2024
1 parent bfde87c commit 596744d
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions docs/design/dynamic-gpu-slice.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# NVIDIA GPU MPS and MIG automatic slice plugin
# NVIDIA GPU MPS and MIG dynamic slice plugin

## Introduction

The NVIDIA GPU build-in sharing method includes: time-slice, MPS and MIG. The context switch for time slice sharing would waste some time, so we chose the MPS and MIG. The GPU MIG profile is variable, the user could acquire the mig device in the profile definition, but current implementation only defines the dedicated profile before the user requirement. That limits the usage of MIG. We want to develop an automatic slice plugin and create the slice when the user require it. And we also add the MPS support.
The NVIDIA GPU build-in sharing method includes: time-slice, MPS and MIG. The context switch for time slice sharing would waste some time, so we chose the MPS and MIG. The GPU MIG profile is variable, the user could acquire the MIG device in the profile definition, but current implementation only defines the dedicated profile before the user requirement. That limits the usage of MIG. We want to develop an automatic slice plugin and create the slice when the user require it. And we also add the MPS support.
For the scheduling method, node-level binpack and spread will be supported. Referring to the binpack plugin, we consider the CPU, Mem, GPU memory and other user-defined resource.

## Targets
Expand Down Expand Up @@ -96,7 +96,7 @@ We use the plan ID for synchronization. When the scheduler plugin plans the GPU
- Nos device plugin: inherit from NVIDIA official device plugin; add MPS device support in the device plugin and other two containers.
* Config manager: read the config map mps-configmap, translate it into a config file, and share with the device plugin; kill the device plugin process and read the node annotation "status-gpu", write the annotation "status-plan" after the spec equals the status
* Device plugin: fake the MPS device by the config and interact with kubelet
* Mps server:NVIDIA official MPS server daemon
* MPS server:NVIDIA official MPS server daemon
- GPU agent: read the GPU usage by kubelet API and update the node annoatation "status-gpu"
- MIG agent: call the nvml library to set MIG devices by the node annotation "spec-gpu" and set the "status-plan"; get the MIG devices usage and update the node annotation "status-gpu"
- MIG manager: MIG enable and disable; for MIG enable/disable configuration has some limitations, all MIG related Pods need to be stopped and the official MIG manager has those functions, so we reuse it but change the main process: just enable&disable MIG
Expand Down

0 comments on commit 596744d

Please sign in to comment.