From de770703a3f55f46366f386c692e859361fca01b Mon Sep 17 00:00:00 2001 From: jessestutler Date: Fri, 1 Nov 2024 11:24:08 +0800 Subject: [PATCH] Add dynamicresources plugin design doc and user guide Signed-off-by: JesseStutler --- docs/design/dynamicresources-plugin.md | 117 +++++++++++++++ .../how_to_use_dynamicresources_plugin.md | 138 ++++++++++++++++++ 2 files changed, 255 insertions(+) create mode 100644 docs/design/dynamicresources-plugin.md create mode 100644 docs/user-guide/how_to_use_dynamicresources_plugin.md diff --git a/docs/design/dynamicresources-plugin.md b/docs/design/dynamicresources-plugin.md new file mode 100644 index 0000000000..9f32189d3f --- /dev/null +++ b/docs/design/dynamicresources-plugin.md @@ -0,0 +1,117 @@ +# Dynamic Resources Allocation(DRA) plugin +## Motivation +The kubelet device plugin interface allows extended devices to connect to the node, but for many newer devices, +this approach for requesting these extended devices is too limited. For example: + +- Device initialization: When starting a workload that uses an accelerator like an FPGA, the accelerator may need to be reconfigured + or reprogrammed. Currently, it’s impossible to specify the desired device properties that are required for reconfiguring devices. + For the FPGA example, a file containing the desired configuration of the FPGA has to be referenced. +- Device cleanup: Sometimes when the workload is finished it may need to clean up of the device. For example, an FPGA might have + to be reset because its configuration for the workload was confidential. Currently, there is no interface that device plugin provided + to clean up the devices. +- Partial allocation: When workloads use only a portion of the device capabilities, devices can be partitioned (e.g. using Nvidia MIG or SR-IOV) to better match workload needs. + Sharing the devices in this way can greatly increase HW utilization / reduce costs. Currently, there's no API to request partial device allocation, devices need to be + pre-partitioned and advertised in the same way a full / static devices are. User must then select a pre-partitioned device instead of having one created for them on the fly + based on their particular resource constraints. Without the ability to create devices dynamically (i.e. at the time they are requested) the set of pre-defined devices must + be carefully tuned to ensure that device resources do not go unused because some of the pre-partioned devices are in low-demand. +- Support Over the Fabric devices: Some devices may be available over the Fabric (network, special links, etc). Currently, the device plugin API is designed for node-local resources + that get discovered by a plugin running on the node. There is no universal design for this kind of cross-fabric scheduling. +- Currently, `Requests` or `Limits` can only declare a number and can not pass many parameters. It relies too much on annotation to pass parameters. + +Therefore, kubernetes introduces the DRA(Dynamic Resource Allocation), which also called Device Plugin 2.0. Volcano also follows up and introduces the DRA mechanism to +enhance the extended device access capabilities. DRA has two architectures, in k8s versions 1.26~1.29, it is called classic DRA, +resource controller and scheduler need to collaborate to schedule, which may lead to poor scheduling efficiency. Currently, this architecture has been withdrawn in k8s in latest branch, +all the related codes has been removed. Structured parameters DRA is a new architecture introduced after v1.30, this will also be the only DRA architecture in the future. +Therefore, volcano introduces structured paramaters DRA in k8s version 1.31, and will only maintain this DRA architecture and related APIs in the future. + +The architecture of structured parameters DRA is as follows: +https://github.com/kubernetes/enhancements/raw/87dac43838c73966c05da2cb1a14c0ac0b66ceab/keps/sig-node/4381-dra-structured-parameters/components.png +`ResourceClaim` is the Pod's demand for extended device, just like the demand for PVC. Users need to customize the kubelet DRA driver to carry the attributes of the extended device in the `ResourceSlice` +and report it to the scheduler, and then the scheduler will use the information in the `ResourceSlice` to schedule. + +> See more details you may refer to structured parameters DRA KEP: https://github.com/kubernetes/enhancements/tree/87dac43838c73966c05da2cb1a14c0ac0b66ceab/keps/sig-node/4381-dra-structured-parameters + +## Non-goals +- Writing kubelet DRA driver. Volcano only provide the capabilities of scheduling pods that specify resource claim. +- Replace the device plugin API. At present, DRA is not mature yet. If the application resource requirement is simple, it can still be implemented through using device plugin. +However, if it involves more complex parameter passing and requires some more advanced scenarios, you can use DRA, which support sharing of extended resources such as GPU natively. +## Implementation +A new plugin called `dynamicresources` will be added. The struct of the `dynamicresources` plugin embeds the dynamicResources plugin +in the native kube-scheduler, and directly calls the extension points such as `PreFilter`, `Filter`, `PreBidn` and others in the kube-scheduler: +```go +type dynamicResourcesPlugin struct { + *dynamicresources.DynamicResources +} +``` +The `dynamicresources` plugin will register following fns: +- PrePredicateFn: +It will call `PreEnqueue` firstly, the `PreEnqueue` is used to verify whether a Pod can be added to scheduler queue, in DRA plugin, +it's used to check whether the resourceClaims specified by pod exist, so it's more appropriate to place the `PreEnqueue` in PrePredicateFn. +Secondly it will call `PreFilter`, and a new cycleState which will save the pod scheduling context will be initialized to pass between different extension points, +after `PreFilter` succeeded, the cycleState need to be stored into session, otherwise, it will not be able to pass between different extension points, +and the extension points will be scattered in different places. For example, `PreBind` is executed in SchedulerCache. A new attribute will be added to +session called `CycleStatesMap` to store those cycleStates that need to pass between extension points scattered in different places: +```go +type Session struct { + ... + // The key is task's UID, value is the CycleState. + CycleStatesMap sync.Map + ... +} +``` +- PredicateFn: +It will get the cycleState stored in `PrePredicateFn` from session and then call `Filter` to get the resource claim allocation result. +- PreBindFn: +It's a new fn added into volcano, it's necessary to bind some resources such as `ResourceClaim` and `PVC` before binding the pod onto the node. +It will be called in `Bind` stage in `SchedulerCache`, but currently `Bind` only take TaskInfo as the input parameter, +`PreBindFn` is an attribute registered by the plugin to the session and can not be carried to the `SchedulerCache` for execution. +Therefore, `Bind` needs to carry more information as input parameters. A new structure called `BindContext` is added: +```go +type BindContext struct { + TaskInfo *schedulingapi.TaskInfo + + // Before the Bind task, we need to execute PreBind. If PreBind fails, we need to execute DeallocateFunc in EventHandler to rollback. + NodeInfo *schedulingapi.NodeInfo + PreBindFns []schedulingapi.PreBindFn + EventHandlers []*util.EventHandler +} +``` +And `BindContext` will be took as the input parameter to execute the `Bind`, before the pod binding, it will call `PreBindFn` to do prebind, +if prebind failed, DeallocateFunc will called to rollback prebind: +```go +func (sc *SchedulerCache) BindTask() { + ... + var tmpBindCache []*BindContext = make([]*BindContext, len(sc.bindCache)) + copy(tmpBindCache, sc.bindCache) + // Currently, bindContexts only contain 1 element. + go func(bindContexts []*BindContext) { + for _, bindContext := range bindContexts { + for _, preBindFn := range bindContext.PreBindFns { + err := preBindFn(bindContext.TaskInfo, bindContext.NodeInfo) + if err != nil { + klog.Errorf("task %s/%s execute prebind failed: %v", bindContext.TaskInfo.Namespace, bindContext.TaskInfo.Name, err) + for _, eh := range bindContext.EventHandlers { + if eh.DeallocateFunc != nil { + eh.DeallocateFunc(&util.Event{ + Task: bindContext.TaskInfo, + }) + } + } + sc.resyncTask(bindContext.TaskInfo) + return + } + } + ... + } + sc.Bind(bindContexts) + }(tmpBindCache) + ... +} +``` +- EventHandler: + - AllocateFUnc: It will call `Reserve` to reserve the resource claim for the pod, this is useful when multiple pods specify the same `ResourceClaim`. + - DeallocateFunc: It will call `Unreserve` to rollback `PreBind` or `Reserve` if they failed. + +In `OnSessionOpen`, the `DynamicResource` of kube-scheduler will be initialized and the above Fns will be registered in the Session. +Besides, a new string slice attribute called BindContextEnabledPlugins will be added to session, it records the plugin name that need to +carry out the `NodeInfo`, `PreBindFns`, `EventHandlers` together with the `TaskInfo` in `BindContext`. \ No newline at end of file diff --git a/docs/user-guide/how_to_use_dynamicresources_plugin.md b/docs/user-guide/how_to_use_dynamicresources_plugin.md new file mode 100644 index 0000000000..7b8d21bd96 --- /dev/null +++ b/docs/user-guide/how_to_use_dynamicresources_plugin.md @@ -0,0 +1,138 @@ +# How to use dynamicresources plugin +## Introduction +Dynamicresources plugin introduces the DRA(Dynamic Resource Allocation) into volcano enabling user to use extended device +such as GPU/NPU, FPGA, etc. Users can declare `ResourceClaim` in the Pod and carry some parameters to enhance the scheduling +of nodes with extended devices. For more detail, please see [dynamicresources-plugin.md](../design/dynamicresources-plugin.md) + +## Environment setup + +### Install volcano + +Refer to [Install Guide](https://github.com/volcano-sh/volcano/blob/master/installer/README.md) to install volcano. + +After installed, update the scheduler configuration: + +```shell +kubectl edit cm -n volcano-system volcano-scheduler-configmap +``` + +Please make sure + +- allocate or backfill action is enabled. +- dynamicresources plugin is enabled. +```yaml +kind: ConfigMap +apiVersion: v1 +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill, reclaim, preempt" + tiers: + tiers: + - plugins: + - name: dynamicresources # dynamicresources plugin should be enabled + - name: priority + - name: gang + enablePreemptable: false + - name: conformance + - name: sla + - plugins: + - name: overcommit + - name: drf + enablePreemptable: false + - name: predicates + - name: proportion + - name: nodeorder + - name: binpack +``` +- the featuregate called DynamicResourceAllocation is set to true in scheduler args: +```yaml +- args: + ... + - --feature-gates=DynamicResourceAllocation=true + ... +``` + +## Usage +### Deploy kubelet DRA driver +kubelet DRA driver runs as a daemonset to prepare extended devices and report `ResourceSlice` API. Here are some sample +opensource projects for reference: +- https://github.com/NVIDIA/k8s-dra-driver: If you has Nvidia GPU on nodes, you can take this to test. +- https://github.com/intel/intel-resource-drivers-for-kubernetes.git +- https://github.com/kubernetes-sigs/dra-example-driver: +It's the simplest repo to deploy a dra driver, it provides access to a set of mock GPU devices, it will print logs in container if the +container allocated a mock GPU. You can refer to its #Demo part in README.md to deploy a simple dra driver, after deploying +the driver succeeded, you can apply the demo pods requesting `ResourceClaim`s to running: +```shell +kubectl apply --filename=demo/gpu-test{1,2,3,4,5}.yaml +``` +After all the pods are running, you can dump the logs of each pods to verify that GPUs were allocated: +```shell +for example in $(seq 1 5); do \ + echo "gpu-test${example}:" + for pod in $(kubectl get pod -n gpu-test${example} --output=jsonpath='{.items[*].metadata.name}'); do \ + for ctr in $(kubectl get pod -n gpu-test${example} ${pod} -o jsonpath='{.spec.containers[*].name}'); do \ + echo "${pod} ${ctr}:" + if [ "${example}" -lt 3 ]; then + kubectl logs -n gpu-test${example} ${pod} -c ${ctr}| grep -E "GPU_DEVICE_[0-9]+=" | grep -v "RESOURCE_CLAIM" + else + kubectl logs -n gpu-test${example} ${pod} -c ${ctr}| grep -E "GPU_DEVICE_[0-9]+" | grep -v "RESOURCE_CLAIM" + fi + done + done + echo "" +done +``` +This should produce output similar to the following: +```shell +gpu-test1: +pod0 ctr0: +declare -x GPU_DEVICE_6="gpu-6" +pod1 ctr0: +declare -x GPU_DEVICE_7="gpu-7" + +gpu-test2: +pod0 ctr0: +declare -x GPU_DEVICE_0="gpu-0" +declare -x GPU_DEVICE_1="gpu-1" + +gpu-test3: +pod0 ctr0: +declare -x GPU_DEVICE_2="gpu-2" +declare -x GPU_DEVICE_2_SHARING_STRATEGY="TimeSlicing" +declare -x GPU_DEVICE_2_TIMESLICE_INTERVAL="Default" +pod0 ctr1: +declare -x GPU_DEVICE_2="gpu-2" +declare -x GPU_DEVICE_2_SHARING_STRATEGY="TimeSlicing" +declare -x GPU_DEVICE_2_TIMESLICE_INTERVAL="Default" + +gpu-test4: +pod0 ctr0: +declare -x GPU_DEVICE_3="gpu-3" +declare -x GPU_DEVICE_3_SHARING_STRATEGY="TimeSlicing" +declare -x GPU_DEVICE_3_TIMESLICE_INTERVAL="Default" +pod1 ctr0: +declare -x GPU_DEVICE_3="gpu-3" +declare -x GPU_DEVICE_3_SHARING_STRATEGY="TimeSlicing" +declare -x GPU_DEVICE_3_TIMESLICE_INTERVAL="Default" + +gpu-test5: +pod0 ts-ctr0: +declare -x GPU_DEVICE_4="gpu-4" +declare -x GPU_DEVICE_4_SHARING_STRATEGY="TimeSlicing" +declare -x GPU_DEVICE_4_TIMESLICE_INTERVAL="Long" +pod0 ts-ctr1: +declare -x GPU_DEVICE_4="gpu-4" +declare -x GPU_DEVICE_4_SHARING_STRATEGY="TimeSlicing" +declare -x GPU_DEVICE_4_TIMESLICE_INTERVAL="Long" +pod0 sp-ctr0: +declare -x GPU_DEVICE_5="gpu-5" +declare -x GPU_DEVICE_5_PARTITION_COUNT="10" +declare -x GPU_DEVICE_5_SHARING_STRATEGY="SpacePartitioning" +pod0 sp-ctr1: +declare -x GPU_DEVICE_5="gpu-5" +declare -x GPU_DEVICE_5_PARTITION_COUNT="10" +declare -x GPU_DEVICE_5_SHARING_STRATEGY="SpacePartitioning" +``` \ No newline at end of file