diff --git a/docs/design/dynamic-gpu-slice.md b/docs/design/dynamic-gpu-slice.md index 0003f75d90..d0a86bb9d4 100644 --- a/docs/design/dynamic-gpu-slice.md +++ b/docs/design/dynamic-gpu-slice.md @@ -1,8 +1,8 @@ -# NVIDIA GPU MPS and MIG automatic slice plugin +# NVIDIA GPU MPS and MIG dynamic slice plugin ## Introduction -The NVIDIA GPU build-in sharing method includes: time-slice, MPS and MIG. The context switch for time slice sharing would waste some time, so we chose the MPS and MIG. The GPU MIG profile is variable, the user could acquire the mig device in the profile definition, but current implementation only defines the dedicated profile before the user requirement. That limits the usage of MIG. We want to develop an automatic slice plugin and create the slice when the user require it. And we also add the MPS support. +The NVIDIA GPU build-in sharing method includes: time-slice, MPS and MIG. The context switch for time slice sharing would waste some time, so we chose the MPS and MIG. The GPU MIG profile is variable, the user could acquire the MIG device in the profile definition, but current implementation only defines the dedicated profile before the user requirement. That limits the usage of MIG. We want to develop an automatic slice plugin and create the slice when the user require it. And we also add the MPS support. For the scheduling method, node-level binpack and spread will be supported. Referring to the binpack plugin, we consider the CPU, Mem, GPU memory and other user-defined resource. ## Targets @@ -96,7 +96,7 @@ We use the plan ID for synchronization. When the scheduler plugin plans the GPU - Nos device plugin: inherit from NVIDIA official device plugin; add MPS device support in the device plugin and other two containers. * Config manager: read the config map mps-configmap, translate it into a config file, and share with the device plugin; kill the device plugin process and read the node annotation "status-gpu", write the annotation "status-plan" after the spec equals the status * Device plugin: fake the MPS device by the config and interact with kubelet - * Mps server:NVIDIA official MPS server daemon + * MPS server:NVIDIA official MPS server daemon - GPU agent: read the GPU usage by kubelet API and update the node annoatation "status-gpu" - MIG agent: call the nvml library to set MIG devices by the node annotation "spec-gpu" and set the "status-plan"; get the MIG devices usage and update the node annotation "status-gpu" - MIG manager: MIG enable and disable; for MIG enable/disable configuration has some limitations, all MIG related Pods need to be stopped and the official MIG manager has those functions, so we reuse it but change the main process: just enable&disable MIG