-
Notifications
You must be signed in to change notification settings - Fork 971
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add dynamicresources plugin design doc and user guide
Signed-off-by: JesseStutler <[email protected]>
- Loading branch information
1 parent
6f5c93d
commit de77070
Showing
2 changed files
with
255 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
# Dynamic Resources Allocation(DRA) plugin | ||
## Motivation | ||
The kubelet device plugin interface allows extended devices to connect to the node, but for many newer devices, | ||
this approach for requesting these extended devices is too limited. For example: | ||
|
||
- Device initialization: When starting a workload that uses an accelerator like an FPGA, the accelerator may need to be reconfigured | ||
or reprogrammed. Currently, it’s impossible to specify the desired device properties that are required for reconfiguring devices. | ||
For the FPGA example, a file containing the desired configuration of the FPGA has to be referenced. | ||
- Device cleanup: Sometimes when the workload is finished it may need to clean up of the device. For example, an FPGA might have | ||
to be reset because its configuration for the workload was confidential. Currently, there is no interface that device plugin provided | ||
to clean up the devices. | ||
- Partial allocation: When workloads use only a portion of the device capabilities, devices can be partitioned (e.g. using Nvidia MIG or SR-IOV) to better match workload needs. | ||
Sharing the devices in this way can greatly increase HW utilization / reduce costs. Currently, there's no API to request partial device allocation, devices need to be | ||
pre-partitioned and advertised in the same way a full / static devices are. User must then select a pre-partitioned device instead of having one created for them on the fly | ||
based on their particular resource constraints. Without the ability to create devices dynamically (i.e. at the time they are requested) the set of pre-defined devices must | ||
be carefully tuned to ensure that device resources do not go unused because some of the pre-partioned devices are in low-demand. | ||
- Support Over the Fabric devices: Some devices may be available over the Fabric (network, special links, etc). Currently, the device plugin API is designed for node-local resources | ||
that get discovered by a plugin running on the node. There is no universal design for this kind of cross-fabric scheduling. | ||
- Currently, `Requests` or `Limits` can only declare a number and can not pass many parameters. It relies too much on annotation to pass parameters. | ||
|
||
Therefore, kubernetes introduces the DRA(Dynamic Resource Allocation), which also called Device Plugin 2.0. Volcano also follows up and introduces the DRA mechanism to | ||
enhance the extended device access capabilities. DRA has two architectures, in k8s versions 1.26~1.29, it is called classic DRA, | ||
resource controller and scheduler need to collaborate to schedule, which may lead to poor scheduling efficiency. Currently, this architecture has been withdrawn in k8s in latest branch, | ||
all the related codes has been removed. Structured parameters DRA is a new architecture introduced after v1.30, this will also be the only DRA architecture in the future. | ||
Therefore, volcano introduces structured paramaters DRA in k8s version 1.31, and will only maintain this DRA architecture and related APIs in the future. | ||
|
||
The architecture of structured parameters DRA is as follows: | ||
https://github.com/kubernetes/enhancements/raw/87dac43838c73966c05da2cb1a14c0ac0b66ceab/keps/sig-node/4381-dra-structured-parameters/components.png | ||
`ResourceClaim` is the Pod's demand for extended device, just like the demand for PVC. Users need to customize the kubelet DRA driver to carry the attributes of the extended device in the `ResourceSlice` | ||
and report it to the scheduler, and then the scheduler will use the information in the `ResourceSlice` to schedule. | ||
|
||
> See more details you may refer to structured parameters DRA KEP: https://github.com/kubernetes/enhancements/tree/87dac43838c73966c05da2cb1a14c0ac0b66ceab/keps/sig-node/4381-dra-structured-parameters | ||
## Non-goals | ||
- Writing kubelet DRA driver. Volcano only provide the capabilities of scheduling pods that specify resource claim. | ||
- Replace the device plugin API. At present, DRA is not mature yet. If the application resource requirement is simple, it can still be implemented through using device plugin. | ||
However, if it involves more complex parameter passing and requires some more advanced scenarios, you can use DRA, which support sharing of extended resources such as GPU natively. | ||
## Implementation | ||
A new plugin called `dynamicresources` will be added. The struct of the `dynamicresources` plugin embeds the dynamicResources plugin | ||
in the native kube-scheduler, and directly calls the extension points such as `PreFilter`, `Filter`, `PreBidn` and others in the kube-scheduler: | ||
```go | ||
type dynamicResourcesPlugin struct { | ||
*dynamicresources.DynamicResources | ||
} | ||
``` | ||
The `dynamicresources` plugin will register following fns: | ||
- PrePredicateFn: | ||
It will call `PreEnqueue` firstly, the `PreEnqueue` is used to verify whether a Pod can be added to scheduler queue, in DRA plugin, | ||
it's used to check whether the resourceClaims specified by pod exist, so it's more appropriate to place the `PreEnqueue` in PrePredicateFn. | ||
Secondly it will call `PreFilter`, and a new cycleState which will save the pod scheduling context will be initialized to pass between different extension points, | ||
after `PreFilter` succeeded, the cycleState need to be stored into session, otherwise, it will not be able to pass between different extension points, | ||
and the extension points will be scattered in different places. For example, `PreBind` is executed in SchedulerCache. A new attribute will be added to | ||
session called `CycleStatesMap` to store those cycleStates that need to pass between extension points scattered in different places: | ||
```go | ||
type Session struct { | ||
... | ||
// The key is task's UID, value is the CycleState. | ||
CycleStatesMap sync.Map | ||
... | ||
} | ||
``` | ||
- PredicateFn: | ||
It will get the cycleState stored in `PrePredicateFn` from session and then call `Filter` to get the resource claim allocation result. | ||
- PreBindFn: | ||
It's a new fn added into volcano, it's necessary to bind some resources such as `ResourceClaim` and `PVC` before binding the pod onto the node. | ||
It will be called in `Bind` stage in `SchedulerCache`, but currently `Bind` only take TaskInfo as the input parameter, | ||
`PreBindFn` is an attribute registered by the plugin to the session and can not be carried to the `SchedulerCache` for execution. | ||
Therefore, `Bind` needs to carry more information as input parameters. A new structure called `BindContext` is added: | ||
```go | ||
type BindContext struct { | ||
TaskInfo *schedulingapi.TaskInfo | ||
|
||
// Before the Bind task, we need to execute PreBind. If PreBind fails, we need to execute DeallocateFunc in EventHandler to rollback. | ||
NodeInfo *schedulingapi.NodeInfo | ||
PreBindFns []schedulingapi.PreBindFn | ||
EventHandlers []*util.EventHandler | ||
} | ||
``` | ||
And `BindContext` will be took as the input parameter to execute the `Bind`, before the pod binding, it will call `PreBindFn` to do prebind, | ||
if prebind failed, DeallocateFunc will called to rollback prebind: | ||
```go | ||
func (sc *SchedulerCache) BindTask() { | ||
... | ||
var tmpBindCache []*BindContext = make([]*BindContext, len(sc.bindCache)) | ||
copy(tmpBindCache, sc.bindCache) | ||
// Currently, bindContexts only contain 1 element. | ||
go func(bindContexts []*BindContext) { | ||
for _, bindContext := range bindContexts { | ||
for _, preBindFn := range bindContext.PreBindFns { | ||
err := preBindFn(bindContext.TaskInfo, bindContext.NodeInfo) | ||
if err != nil { | ||
klog.Errorf("task %s/%s execute prebind failed: %v", bindContext.TaskInfo.Namespace, bindContext.TaskInfo.Name, err) | ||
for _, eh := range bindContext.EventHandlers { | ||
if eh.DeallocateFunc != nil { | ||
eh.DeallocateFunc(&util.Event{ | ||
Task: bindContext.TaskInfo, | ||
}) | ||
} | ||
} | ||
sc.resyncTask(bindContext.TaskInfo) | ||
return | ||
} | ||
} | ||
... | ||
} | ||
sc.Bind(bindContexts) | ||
}(tmpBindCache) | ||
... | ||
} | ||
``` | ||
- EventHandler: | ||
- AllocateFUnc: It will call `Reserve` to reserve the resource claim for the pod, this is useful when multiple pods specify the same `ResourceClaim`. | ||
- DeallocateFunc: It will call `Unreserve` to rollback `PreBind` or `Reserve` if they failed. | ||
|
||
In `OnSessionOpen`, the `DynamicResource` of kube-scheduler will be initialized and the above Fns will be registered in the Session. | ||
Besides, a new string slice attribute called BindContextEnabledPlugins will be added to session, it records the plugin name that need to | ||
carry out the `NodeInfo`, `PreBindFns`, `EventHandlers` together with the `TaskInfo` in `BindContext`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,138 @@ | ||
# How to use dynamicresources plugin | ||
## Introduction | ||
Dynamicresources plugin introduces the DRA(Dynamic Resource Allocation) into volcano enabling user to use extended device | ||
such as GPU/NPU, FPGA, etc. Users can declare `ResourceClaim` in the Pod and carry some parameters to enhance the scheduling | ||
of nodes with extended devices. For more detail, please see [dynamicresources-plugin.md](../design/dynamicresources-plugin.md) | ||
|
||
## Environment setup | ||
|
||
### Install volcano | ||
|
||
Refer to [Install Guide](https://github.com/volcano-sh/volcano/blob/master/installer/README.md) to install volcano. | ||
|
||
After installed, update the scheduler configuration: | ||
|
||
```shell | ||
kubectl edit cm -n volcano-system volcano-scheduler-configmap | ||
``` | ||
|
||
Please make sure | ||
|
||
- allocate or backfill action is enabled. | ||
- dynamicresources plugin is enabled. | ||
```yaml | ||
kind: ConfigMap | ||
apiVersion: v1 | ||
metadata: | ||
name: volcano-scheduler-configmap | ||
namespace: volcano-system | ||
data: | ||
volcano-scheduler.conf: | | ||
actions: "enqueue, allocate, backfill, reclaim, preempt" | ||
tiers: | ||
tiers: | ||
- plugins: | ||
- name: dynamicresources # dynamicresources plugin should be enabled | ||
- name: priority | ||
- name: gang | ||
enablePreemptable: false | ||
- name: conformance | ||
- name: sla | ||
- plugins: | ||
- name: overcommit | ||
- name: drf | ||
enablePreemptable: false | ||
- name: predicates | ||
- name: proportion | ||
- name: nodeorder | ||
- name: binpack | ||
``` | ||
- the featuregate called DynamicResourceAllocation is set to true in scheduler args: | ||
```yaml | ||
- args: | ||
... | ||
- --feature-gates=DynamicResourceAllocation=true | ||
... | ||
``` | ||
|
||
## Usage | ||
### Deploy kubelet DRA driver | ||
kubelet DRA driver runs as a daemonset to prepare extended devices and report `ResourceSlice` API. Here are some sample | ||
opensource projects for reference: | ||
- https://github.com/NVIDIA/k8s-dra-driver: If you has Nvidia GPU on nodes, you can take this to test. | ||
- https://github.com/intel/intel-resource-drivers-for-kubernetes.git | ||
- https://github.com/kubernetes-sigs/dra-example-driver: | ||
It's the simplest repo to deploy a dra driver, it provides access to a set of mock GPU devices, it will print logs in container if the | ||
container allocated a mock GPU. You can refer to its #Demo part in README.md to deploy a simple dra driver, after deploying | ||
the driver succeeded, you can apply the demo pods requesting `ResourceClaim`s to running: | ||
```shell | ||
kubectl apply --filename=demo/gpu-test{1,2,3,4,5}.yaml | ||
``` | ||
After all the pods are running, you can dump the logs of each pods to verify that GPUs were allocated: | ||
```shell | ||
for example in $(seq 1 5); do \ | ||
echo "gpu-test${example}:" | ||
for pod in $(kubectl get pod -n gpu-test${example} --output=jsonpath='{.items[*].metadata.name}'); do \ | ||
for ctr in $(kubectl get pod -n gpu-test${example} ${pod} -o jsonpath='{.spec.containers[*].name}'); do \ | ||
echo "${pod} ${ctr}:" | ||
if [ "${example}" -lt 3 ]; then | ||
kubectl logs -n gpu-test${example} ${pod} -c ${ctr}| grep -E "GPU_DEVICE_[0-9]+=" | grep -v "RESOURCE_CLAIM" | ||
else | ||
kubectl logs -n gpu-test${example} ${pod} -c ${ctr}| grep -E "GPU_DEVICE_[0-9]+" | grep -v "RESOURCE_CLAIM" | ||
fi | ||
done | ||
done | ||
echo "" | ||
done | ||
``` | ||
This should produce output similar to the following: | ||
```shell | ||
gpu-test1: | ||
pod0 ctr0: | ||
declare -x GPU_DEVICE_6="gpu-6" | ||
pod1 ctr0: | ||
declare -x GPU_DEVICE_7="gpu-7" | ||
|
||
gpu-test2: | ||
pod0 ctr0: | ||
declare -x GPU_DEVICE_0="gpu-0" | ||
declare -x GPU_DEVICE_1="gpu-1" | ||
|
||
gpu-test3: | ||
pod0 ctr0: | ||
declare -x GPU_DEVICE_2="gpu-2" | ||
declare -x GPU_DEVICE_2_SHARING_STRATEGY="TimeSlicing" | ||
declare -x GPU_DEVICE_2_TIMESLICE_INTERVAL="Default" | ||
pod0 ctr1: | ||
declare -x GPU_DEVICE_2="gpu-2" | ||
declare -x GPU_DEVICE_2_SHARING_STRATEGY="TimeSlicing" | ||
declare -x GPU_DEVICE_2_TIMESLICE_INTERVAL="Default" | ||
|
||
gpu-test4: | ||
pod0 ctr0: | ||
declare -x GPU_DEVICE_3="gpu-3" | ||
declare -x GPU_DEVICE_3_SHARING_STRATEGY="TimeSlicing" | ||
declare -x GPU_DEVICE_3_TIMESLICE_INTERVAL="Default" | ||
pod1 ctr0: | ||
declare -x GPU_DEVICE_3="gpu-3" | ||
declare -x GPU_DEVICE_3_SHARING_STRATEGY="TimeSlicing" | ||
declare -x GPU_DEVICE_3_TIMESLICE_INTERVAL="Default" | ||
|
||
gpu-test5: | ||
pod0 ts-ctr0: | ||
declare -x GPU_DEVICE_4="gpu-4" | ||
declare -x GPU_DEVICE_4_SHARING_STRATEGY="TimeSlicing" | ||
declare -x GPU_DEVICE_4_TIMESLICE_INTERVAL="Long" | ||
pod0 ts-ctr1: | ||
declare -x GPU_DEVICE_4="gpu-4" | ||
declare -x GPU_DEVICE_4_SHARING_STRATEGY="TimeSlicing" | ||
declare -x GPU_DEVICE_4_TIMESLICE_INTERVAL="Long" | ||
pod0 sp-ctr0: | ||
declare -x GPU_DEVICE_5="gpu-5" | ||
declare -x GPU_DEVICE_5_PARTITION_COUNT="10" | ||
declare -x GPU_DEVICE_5_SHARING_STRATEGY="SpacePartitioning" | ||
pod0 sp-ctr1: | ||
declare -x GPU_DEVICE_5="gpu-5" | ||
declare -x GPU_DEVICE_5_PARTITION_COUNT="10" | ||
declare -x GPU_DEVICE_5_SHARING_STRATEGY="SpacePartitioning" | ||
``` |