From de770703a3f55f46366f386c692e859361fca01b Mon Sep 17 00:00:00 2001
From: jessestutler <chenzicong4@huawei.com>
Date: Fri, 1 Nov 2024 11:24:08 +0800
Subject: [PATCH] Add dynamicresources plugin design doc and user guide

Signed-off-by: JesseStutler <chenzicong4@huawei.com>
---
 docs/design/dynamicresources-plugin.md        | 117 +++++++++++++++
 .../how_to_use_dynamicresources_plugin.md     | 138 ++++++++++++++++++
 2 files changed, 255 insertions(+)
 create mode 100644 docs/design/dynamicresources-plugin.md
 create mode 100644 docs/user-guide/how_to_use_dynamicresources_plugin.md

diff --git a/docs/design/dynamicresources-plugin.md b/docs/design/dynamicresources-plugin.md
new file mode 100644
index 0000000000..9f32189d3f
--- /dev/null
+++ b/docs/design/dynamicresources-plugin.md
@@ -0,0 +1,117 @@
+# Dynamic Resources Allocation(DRA) plugin
+## Motivation
+The kubelet device plugin interface allows extended devices to connect to the node, but for many newer devices, 
+this approach for requesting these extended devices is too limited. For example:
+
+- Device initialization: When starting a workload that uses an accelerator like an FPGA, the accelerator may need to be reconfigured 
+  or reprogrammed. Currently, it’s impossible to specify the desired device properties that are required for reconfiguring devices. 
+  For the FPGA example, a file containing the desired configuration of the FPGA has to be referenced.
+- Device cleanup: Sometimes when the workload is finished it may need to clean up of the device. For example, an FPGA might have 
+  to be reset because its configuration for the workload was confidential. Currently, there is no interface that device plugin provided 
+  to clean up the devices.
+- Partial allocation: When workloads use only a portion of the device capabilities, devices can be partitioned (e.g. using Nvidia MIG or SR-IOV) to better match workload needs. 
+  Sharing the devices in this way can greatly increase HW utilization / reduce costs. Currently, there's no API to request partial device allocation, devices need to be 
+  pre-partitioned and advertised in the same way a full / static devices are. User must then select a pre-partitioned device instead of having one created for them on the fly 
+  based on their particular resource constraints. Without the ability to create devices dynamically (i.e. at the time they are requested) the set of pre-defined devices must 
+  be carefully tuned to ensure that device resources do not go unused because some of the pre-partioned devices are in low-demand.
+- Support Over the Fabric devices: Some devices may be available over the Fabric (network, special links, etc). Currently, the device plugin API is designed for node-local resources 
+  that get discovered by a plugin running on the node. There is no universal design for this kind of cross-fabric scheduling. 
+- Currently, `Requests` or `Limits` can only declare a number and can not pass many parameters. It relies too much on annotation to pass parameters.
+
+Therefore, kubernetes introduces the DRA(Dynamic Resource Allocation), which also called Device Plugin 2.0. Volcano also follows up and introduces the DRA mechanism to 
+enhance the extended device access capabilities. DRA has two architectures, in k8s versions 1.26~1.29, it is called classic DRA, 
+resource controller and scheduler need to collaborate to schedule, which may lead to poor scheduling efficiency. Currently, this architecture has been withdrawn in k8s in latest branch, 
+all the related codes has been removed. Structured parameters DRA is a new architecture introduced after v1.30, this will also be the only DRA architecture in the future. 
+Therefore, volcano introduces structured paramaters DRA in k8s version 1.31, and will only maintain this DRA architecture and related APIs in the future.
+
+The architecture of structured parameters DRA is as follows:
+https://github.com/kubernetes/enhancements/raw/87dac43838c73966c05da2cb1a14c0ac0b66ceab/keps/sig-node/4381-dra-structured-parameters/components.png
+`ResourceClaim` is the Pod's demand for extended device, just like the demand for PVC. Users need to customize the kubelet DRA driver to carry the attributes of the extended device in the `ResourceSlice` 
+and report it to the scheduler, and then the scheduler will use the information in the `ResourceSlice` to schedule. 
+
+> See more details you may refer to structured parameters DRA KEP: https://github.com/kubernetes/enhancements/tree/87dac43838c73966c05da2cb1a14c0ac0b66ceab/keps/sig-node/4381-dra-structured-parameters
+
+## Non-goals
+- Writing kubelet DRA driver. Volcano only provide the capabilities of scheduling pods that specify resource claim. 
+- Replace the device plugin API. At present, DRA is not mature yet. If the application resource requirement is simple, it can still be implemented through using device plugin. 
+However, if it involves more complex parameter passing and requires some more advanced scenarios, you can use DRA, which support sharing of extended resources such as GPU natively.
+## Implementation
+A new plugin called `dynamicresources` will be added. The struct of the `dynamicresources` plugin embeds the dynamicResources plugin 
+in the native kube-scheduler, and directly calls the extension points such as `PreFilter`, `Filter`, `PreBidn` and others in the kube-scheduler:
+```go
+type dynamicResourcesPlugin struct {
+	*dynamicresources.DynamicResources
+}
+```
+The `dynamicresources` plugin will register following fns:
+- PrePredicateFn:
+It will call `PreEnqueue` firstly, the `PreEnqueue` is used to verify whether a Pod can be added to scheduler queue, in DRA plugin, 
+it's used to check whether the resourceClaims specified by pod exist, so it's more appropriate to place the `PreEnqueue` in PrePredicateFn.
+Secondly it will call `PreFilter`, and a new cycleState which will save the pod scheduling context will be initialized to pass between different extension points,
+after `PreFilter` succeeded, the cycleState need to be stored into session, otherwise, it will not be able to pass between different extension points, 
+and the extension points will be scattered in different places. For example, `PreBind` is executed in SchedulerCache. A new attribute will be added to 
+session called `CycleStatesMap` to store those cycleStates that need to pass between extension points scattered in different places:
+```go
+type Session struct {
+  ...
+  // The key is task's UID, value is the CycleState.
+  CycleStatesMap sync.Map
+  ...
+}
+```
+- PredicateFn:
+It will get the cycleState stored in `PrePredicateFn` from session and then call `Filter` to get the resource claim allocation result.
+- PreBindFn:
+It's a new fn added into volcano, it's necessary to bind some resources such as `ResourceClaim` and `PVC` before binding the pod onto the node. 
+It will be called in `Bind` stage in `SchedulerCache`, but currently `Bind` only take TaskInfo as the input parameter, 
+`PreBindFn` is an attribute registered by the plugin to the session and can not be carried to the `SchedulerCache` for execution. 
+Therefore, `Bind` needs to carry more information as input parameters. A new structure called `BindContext` is added:
+```go
+type BindContext struct {
+    TaskInfo *schedulingapi.TaskInfo
+  
+    // Before the Bind task, we need to execute PreBind. If PreBind fails, we need to execute DeallocateFunc in EventHandler to rollback.
+    NodeInfo      *schedulingapi.NodeInfo
+    PreBindFns    []schedulingapi.PreBindFn
+    EventHandlers []*util.EventHandler
+}
+```
+And `BindContext` will be took as the input parameter to execute the `Bind`, before the pod binding, it will call `PreBindFn` to do prebind,
+if prebind failed, DeallocateFunc will called to rollback prebind:
+```go
+func (sc *SchedulerCache) BindTask() {
+    ...
+    var tmpBindCache []*BindContext = make([]*BindContext, len(sc.bindCache))
+    copy(tmpBindCache, sc.bindCache)
+    // Currently, bindContexts only contain 1 element.
+    go func(bindContexts []*BindContext) {
+        for _, bindContext := range bindContexts {
+            for _, preBindFn := range bindContext.PreBindFns {
+                err := preBindFn(bindContext.TaskInfo, bindContext.NodeInfo)
+                if err != nil {
+                    klog.Errorf("task %s/%s execute prebind failed: %v", bindContext.TaskInfo.Namespace, bindContext.TaskInfo.Name, err)
+                    for _, eh := range bindContext.EventHandlers {
+                        if eh.DeallocateFunc != nil {
+                            eh.DeallocateFunc(&util.Event{
+                                Task: bindContext.TaskInfo,
+                            })
+                        }
+                    }
+                    sc.resyncTask(bindContext.TaskInfo)
+                    return
+                }
+            }
+            ...
+        }
+        sc.Bind(bindContexts)
+    }(tmpBindCache)
+    ...
+}
+```
+- EventHandler: 
+  - AllocateFUnc: It will call `Reserve` to reserve the resource claim for the pod, this is useful when multiple pods specify the same `ResourceClaim`.
+  - DeallocateFunc: It will call `Unreserve` to rollback `PreBind` or `Reserve` if they failed.
+
+In `OnSessionOpen`, the `DynamicResource` of kube-scheduler will be initialized and the above Fns will be registered in the Session.
+Besides, a new string slice attribute called BindContextEnabledPlugins will be added to session, it records the plugin name that need to
+carry out the `NodeInfo`, `PreBindFns`, `EventHandlers` together with the `TaskInfo` in `BindContext`.  
\ No newline at end of file
diff --git a/docs/user-guide/how_to_use_dynamicresources_plugin.md b/docs/user-guide/how_to_use_dynamicresources_plugin.md
new file mode 100644
index 0000000000..7b8d21bd96
--- /dev/null
+++ b/docs/user-guide/how_to_use_dynamicresources_plugin.md
@@ -0,0 +1,138 @@
+# How to use dynamicresources plugin 
+## Introduction
+Dynamicresources plugin introduces the DRA(Dynamic Resource Allocation) into volcano enabling user to use extended device 
+such as GPU/NPU, FPGA, etc. Users can declare `ResourceClaim` in the Pod and carry some parameters to enhance the scheduling 
+of nodes with extended devices. For more detail, please see [dynamicresources-plugin.md](../design/dynamicresources-plugin.md)
+
+## Environment setup
+
+### Install volcano
+
+Refer to [Install Guide](https://github.com/volcano-sh/volcano/blob/master/installer/README.md) to install volcano.
+
+After installed, update the scheduler configuration:
+
+```shell
+kubectl edit cm -n volcano-system volcano-scheduler-configmap
+```
+
+Please make sure
+
+- allocate or backfill action is enabled.
+- dynamicresources plugin is enabled.
+```yaml
+kind: ConfigMap
+apiVersion: v1
+metadata:
+  name: volcano-scheduler-configmap
+  namespace: volcano-system
+data:
+  volcano-scheduler.conf: |
+    actions: "enqueue, allocate, backfill, reclaim, preempt"
+    tiers:
+    tiers:
+    - plugins:
+      - name: dynamicresources # dynamicresources plugin should be enabled
+      - name: priority
+      - name: gang
+        enablePreemptable: false
+      - name: conformance
+      - name: sla
+    - plugins:
+      - name: overcommit
+      - name: drf
+        enablePreemptable: false
+      - name: predicates
+      - name: proportion
+      - name: nodeorder
+      - name: binpack
+```
+- the featuregate called DynamicResourceAllocation is set to true in scheduler args:
+```yaml
+- args:
+  ...
+  - --feature-gates=DynamicResourceAllocation=true
+  ...
+```
+
+## Usage
+### Deploy kubelet DRA driver
+kubelet DRA driver runs as a daemonset to prepare extended devices and report `ResourceSlice` API. Here are some sample 
+opensource projects for reference:
+- https://github.com/NVIDIA/k8s-dra-driver: If you has Nvidia GPU on nodes, you can take this to test. 
+- https://github.com/intel/intel-resource-drivers-for-kubernetes.git
+- https://github.com/kubernetes-sigs/dra-example-driver:
+It's the simplest repo to deploy a dra driver, it provides access to a set of mock GPU devices, it will print logs in container if the 
+container allocated a mock GPU. You can refer to its #Demo part in README.md to deploy a simple dra driver, after deploying
+the driver succeeded, you can apply the demo pods requesting `ResourceClaim`s to running:
+```shell
+kubectl apply --filename=demo/gpu-test{1,2,3,4,5}.yaml
+```
+After all the pods are running, you can dump the logs of each pods to verify that GPUs were allocated:
+```shell
+for example in $(seq 1 5); do \
+  echo "gpu-test${example}:"
+  for pod in $(kubectl get pod -n gpu-test${example} --output=jsonpath='{.items[*].metadata.name}'); do \
+    for ctr in $(kubectl get pod -n gpu-test${example} ${pod} -o jsonpath='{.spec.containers[*].name}'); do \
+      echo "${pod} ${ctr}:"
+      if [ "${example}" -lt 3 ]; then
+        kubectl logs -n gpu-test${example} ${pod} -c ${ctr}| grep -E "GPU_DEVICE_[0-9]+=" | grep -v "RESOURCE_CLAIM"
+      else
+        kubectl logs -n gpu-test${example} ${pod} -c ${ctr}| grep -E "GPU_DEVICE_[0-9]+" | grep -v "RESOURCE_CLAIM"
+      fi
+    done
+  done
+  echo ""
+done
+```
+This should produce output similar to the following:
+```shell
+gpu-test1:
+pod0 ctr0:
+declare -x GPU_DEVICE_6="gpu-6"
+pod1 ctr0:
+declare -x GPU_DEVICE_7="gpu-7"
+
+gpu-test2:
+pod0 ctr0:
+declare -x GPU_DEVICE_0="gpu-0"
+declare -x GPU_DEVICE_1="gpu-1"
+
+gpu-test3:
+pod0 ctr0:
+declare -x GPU_DEVICE_2="gpu-2"
+declare -x GPU_DEVICE_2_SHARING_STRATEGY="TimeSlicing"
+declare -x GPU_DEVICE_2_TIMESLICE_INTERVAL="Default"
+pod0 ctr1:
+declare -x GPU_DEVICE_2="gpu-2"
+declare -x GPU_DEVICE_2_SHARING_STRATEGY="TimeSlicing"
+declare -x GPU_DEVICE_2_TIMESLICE_INTERVAL="Default"
+
+gpu-test4:
+pod0 ctr0:
+declare -x GPU_DEVICE_3="gpu-3"
+declare -x GPU_DEVICE_3_SHARING_STRATEGY="TimeSlicing"
+declare -x GPU_DEVICE_3_TIMESLICE_INTERVAL="Default"
+pod1 ctr0:
+declare -x GPU_DEVICE_3="gpu-3"
+declare -x GPU_DEVICE_3_SHARING_STRATEGY="TimeSlicing"
+declare -x GPU_DEVICE_3_TIMESLICE_INTERVAL="Default"
+
+gpu-test5:
+pod0 ts-ctr0:
+declare -x GPU_DEVICE_4="gpu-4"
+declare -x GPU_DEVICE_4_SHARING_STRATEGY="TimeSlicing"
+declare -x GPU_DEVICE_4_TIMESLICE_INTERVAL="Long"
+pod0 ts-ctr1:
+declare -x GPU_DEVICE_4="gpu-4"
+declare -x GPU_DEVICE_4_SHARING_STRATEGY="TimeSlicing"
+declare -x GPU_DEVICE_4_TIMESLICE_INTERVAL="Long"
+pod0 sp-ctr0:
+declare -x GPU_DEVICE_5="gpu-5"
+declare -x GPU_DEVICE_5_PARTITION_COUNT="10"
+declare -x GPU_DEVICE_5_SHARING_STRATEGY="SpacePartitioning"
+pod0 sp-ctr1:
+declare -x GPU_DEVICE_5="gpu-5"
+declare -x GPU_DEVICE_5_PARTITION_COUNT="10"
+declare -x GPU_DEVICE_5_SHARING_STRATEGY="SpacePartitioning"
+```
\ No newline at end of file