Skip to content

Commit

Permalink
Feat: add external scaler (openkruise#39)
Browse files Browse the repository at this point in the history
* Feat: add external scaler

Signed-off-by: ChrisLiu <[email protected]>

* Add docs for autoscaling

Signed-off-by: ChrisLiu <[email protected]>

---------

Signed-off-by: ChrisLiu <[email protected]>
  • Loading branch information
chrisliu1995 authored Apr 26, 2023
1 parent 1cb53de commit 7c072bc
Show file tree
Hide file tree
Showing 16 changed files with 1,212 additions and 1 deletion.
2 changes: 2 additions & 0 deletions config/default/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ bases:
# [PROMETHEUS] To enable prometheus monitor, uncomment all sections with 'PROMETHEUS'.
# - ../prometheus

- ../scaler

patchesStrategicMerge:
# Protect the /metrics endpoint by putting it behind auth.
# If you want your controller-manager to expose the /metrics
Expand Down
2 changes: 2 additions & 0 deletions config/scaler/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
resources:
- service.yaml
12 changes: 12 additions & 0 deletions config/scaler/service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
apiVersion: v1
kind: Service
metadata:
name: external-scaler
namespace: kruise-game-system
spec:
ports:
- port: 6000
targetPort: 6000
selector:
control-plane: controller-manager
99 changes: 99 additions & 0 deletions docs/en/user_manuals/autoscale.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
## Feature overview

Compared to stateless service types, game servers have higher requirements for automatic scaling, especially in terms of scaling down.

The differences between game servers become more and more obvious over time, and the precision requirements for scaling down are extremely high. Coarse-grained scaling mechanisms can easily cause negative effects such as player disconnections, resulting in huge losses for the business.

The horizontal scaling mechanism in native Kubernetes is shown in the following figure:

![autoscaling-k8s-en.png](../../images/autoscaling-k8s-en.png)

In the game scenario, its main problems are:

- At the pod level, it is unable to perceive the game server game status and therefore cannot set deletion priority based on game status.
- At the workload level, it cannot select scaling-down objects based on game status.
- At the autoscaler level, it cannot accurately calculate the appropriate number of replicas based on the game server game status.

In this way, the automatic scaling mechanism based on native Kubernetes will cause two major problems in the game scenario:

- The number of scaling down is not accurate. It is easy to delete too many or too few game servers.
- The scaling-down object is not accurate. It is easy to delete game servers with high game load levels.


The automatic scaling mechanism of OKG is shown in the following figure:

![autoscaling-okg-en.png](../../images/autoscaling-okg-en.png)

- At the game server level, each game server can report its own status and expose whether it is in the WaitToBeDeleted state through custom service quality or external components.
- At the workload level, the GameServerSet can determine the scaling-down object based on the business status reported by the game server. As described in Game Server Horizontal Scaling, the game server in the WaitToBeDeleted state is the highest priority game server to be deleted during scaling down.
- At the autoscaler level, accurately calculate the number of game servers in the WaitToBeDeleted state, and use it as the scaling-down quantity to avoid accidental deletion.

In this way, OKG's automatic scaler will only delete game servers in the WaitToBeDeleted state during the scaling-down window, achieving targeted and precise scaling down.

## Usage Example

_**Prerequisites: Install [KEDA](https://keda.sh/docs/2.10/deploy/) in the cluster.**_

Deploy the ScaledObject object to set the automatic scaling strategy. Refer to the [ScaledObject API](https://github.com/kedacore/keda/blob/main/apis/keda/v1alpha1/scaledobject_types.go) for the specific field meanings.

```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: minecraft # Fill in the name of the corresponding GameServerSet
spec:
scaleTargetRef:
name: minecraft # Fill in the name of the corresponding GameServerSet
apiVersion: game.kruise.io/v1alpha1
kind: GameServerSet
pollingInterval: 30
minReplicaCount: 0
advanced:
horizontalPodAutoscalerConfig:
behavior: # Inherit from HPA behavior, refer to https://kubernetes.io/zh-cn/docs/tasks/run-application/horizontal-pod-autoscale/#configurable-scaling-behavior
scaleDown:
stabilizationWindowSeconds: 45 # Set the scaling-down stabilization window time to 45 seconds
policies:
- type: Percent
value: 100
periodSeconds: 15
triggers:
- type: external
metricType: Value
metadata:
scalerAddress: kruise-game-external-scaler.kruise-game-system:6000

```

After deployment, change the opsState of the gs minecraft-0 to WaitToBeDeleted (see [Custom Service Quality](service_qualities.md) for automated setting of game server status).

```bash
kubectl edit gs minecraft-0

...
spec:
deletionPriority: 0
opsState: WaitToBeDeleted # Set to None initially, and change it to WaitToBeDeleted
updatePriority: 0
...

```

After the scaling-down window period, the game server minecraft-0 is automatically deleted.

```bash
kubectl get gs
NAME STATE OPSSTATE DP UP
minecraft-0 Deleting WaitToBeDeleted 0 0
minecraft-1 Ready None 0 0
minecraft-2 Ready None 0 0

# After a while
...

kubectl get gs
NAME STATE OPSSTATE DP UP
minecraft-1 Ready None 0 0
minecraft-2 Ready None 0 0

```
Binary file added docs/images/autoscaling-k8s-en.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/autoscaling-k8s.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/autoscaling-okg-en.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/autoscaling-okg.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
94 changes: 94 additions & 0 deletions docs/中文/用户手册/自动伸缩.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
## 功能概览

游戏服与无状态业务类型不同,对于自动伸缩特性有着更高的要求,其要求主要体现在缩容方面。

由于游戏为强有状态业务,随着时间的推移,游戏服之间的差异性愈加明显,缩容的精确度要求极高,粗糙的缩容机制容易造成玩家断线等负面影响,给业务造成巨大损失。

原生Kubernetes中的水平伸缩机制如下图所示

![autoscaling-k8s.png](../../images/autoscaling-k8s.png)

在游戏场景下,它的主要问题在于:

- 在pod层面,无法感知游戏服业务状态,进而无法通过业务状态设置删除优先级
- 在workload层面,无法根据业务状态选择缩容对象
- 在autoscaler层面,无法定向感知游戏服业务状态计算合适的副本数目

这样一来,基于原生Kubernetes的自动伸缩机制将在游戏场景下造成两大问题:

- 缩容数目不精确。容易删除过多或过少的游戏服。
- 缩容对象不精确。容易删除业务负载水平高的游戏服。

OKG 的自动伸缩机制如下所示

![autoscaling-okg.png](../../images/autoscaling-okg.png)

- 在游戏服层面,每个游戏服可以上报自身状态,通过自定义服务质量或外部组件来暴露自身是否为WaitToBeDeleted状态。
- 在workload层面,GameServerSet可根据游戏服上报的业务状态来决定缩容的对象,如[游戏服水平伸缩](../快速开始/游戏服水平伸缩.md)中所述,WaitToBeDeleted的游戏服是删除优先级最高的游戏服,缩容时最优先删除。
- 在autoscaler层面,精准计算WaitToBeDeleted的游戏服个数,将其作为缩容数量,不会造成误删的情况。

如此一来,OKG的自动伸缩器在缩容窗口期内只会删除处于WaitToBeDeleted状态的游戏服,真正做到定向缩容、精准缩容。

## 使用示例

_**前置条件:在集群中安装 [KEDA](https://keda.sh/docs/2.10/deploy/)**_

部署ScaledObject对象来设置自动伸缩策略,具体字段含义可参考 [ScaledObject API](https://github.com/kedacore/keda/blob/main/apis/keda/v1alpha1/scaledobject_types.go)

```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: minecraft #填写对应GameServerSet的名称
spec:
scaleTargetRef:
name: minecraft #填写对应GameServerSet的名称
apiVersion: game.kruise.io/v1alpha1
kind: GameServerSet
pollingInterval: 30
minReplicaCount: 0
advanced:
horizontalPodAutoscalerConfig:
behavior: #继承HPA策略,可参考文档 https://kubernetes.io/zh-cn/docs/tasks/run-application/horizontal-pod-autoscale/#configurable-scaling-behavior
scaleDown:
stabilizationWindowSeconds: 45 #设置缩容稳定窗口时间为45秒
policies:
- type: Percent
value: 100
periodSeconds: 15
triggers:
- type: external
metricType: Value
metadata:
scalerAddress: kruise-game-external-scaler.kruise-game-system:6000
```
部署完成后,更改gs minecraft-0 的 opsState 为 WaitToBeDeleted(可参考[自定义服务质量](自定义服务质量.md)实现自动化设置游戏服状态)
```bash
kubectl edit gs minecraft-0

...
spec:
deletionPriority: 0
opsState: WaitToBeDeleted #初始为None, 将其改为WaitToBeDeleted
updatePriority: 0
...
```

经过缩容窗口期后,游戏服minecraft-0自动被删除
```bash
kubectl get gs
NAME STATE OPSSTATE DP UP
minecraft-0 Deleting WaitToBeDeleted 0 0
minecraft-1 Ready None 0 0
minecraft-2 Ready None 0 0

# After a while
...

kubectl get gs
NAME STATE OPSSTATE DP UP
minecraft-1 Ready None 0 0
minecraft-2 Ready None 0 0
```
4 changes: 3 additions & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ require (
github.com/onsi/ginkgo v1.16.5
github.com/onsi/gomega v1.18.1
github.com/openkruise/kruise-api v1.3.0
google.golang.org/grpc v1.40.0
google.golang.org/protobuf v1.27.1
k8s.io/api v0.24.0
k8s.io/apimachinery v0.24.0
k8s.io/client-go v0.24.0
Expand Down Expand Up @@ -75,7 +77,7 @@ require (
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect
gomodules.xyz/jsonpatch/v2 v2.2.0 // indirect
google.golang.org/appengine v1.6.7 // indirect
google.golang.org/protobuf v1.27.1 // indirect
google.golang.org/genproto v0.0.0-20220107163113-42d7afdf6368 // indirect
gopkg.in/inf.v0 v0.9.1 // indirect
gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7 // indirect
gopkg.in/yaml.v2 v2.4.0 // indirect
Expand Down
2 changes: 2 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -865,6 +865,7 @@ google.golang.org/genproto v0.0.0-20210319143718-93e7006c17a6/go.mod h1:FWY/as6D
google.golang.org/genproto v0.0.0-20210402141018-6c239bbf2bb1/go.mod h1:9lPAdzaEmUacj36I+k7YKbEc5CXzPIeORRgDAUOu28A=
google.golang.org/genproto v0.0.0-20210602131652-f16073e35f0c/go.mod h1:UODoCrxHCcBojKKwX1terBiRUaqAsFqJiF615XL43r0=
google.golang.org/genproto v0.0.0-20210831024726-fe130286e0e2/go.mod h1:eFjDcFEctNawg4eG61bRv87N7iHBWyVhJu7u1kqDUXY=
google.golang.org/genproto v0.0.0-20220107163113-42d7afdf6368 h1:Et6SkiuvnBn+SgrSYXs/BrUpGB4mbdwt4R3vaPIlicA=
google.golang.org/genproto v0.0.0-20220107163113-42d7afdf6368/go.mod h1:5CzLGKJ67TSI2B9POpiiyGha0AjJvZIUgRMt1dSmuhc=
google.golang.org/grpc v1.19.0/go.mod h1:mqu4LbDTu4XGKhr4mRzUsmM4RtVoemTSY81AxZiDr8c=
google.golang.org/grpc v1.20.1/go.mod h1:10oTOabMzJvdu6/UiuZezV6QK5dSlG84ov/aaiqXj38=
Expand All @@ -887,6 +888,7 @@ google.golang.org/grpc v1.36.0/go.mod h1:qjiiYl8FncCW8feJPdyg3v6XW24KsRHe+dy9BAG
google.golang.org/grpc v1.36.1/go.mod h1:qjiiYl8FncCW8feJPdyg3v6XW24KsRHe+dy9BAGRRjU=
google.golang.org/grpc v1.37.0/go.mod h1:NREThFqKR1f3iQ6oBuvc5LadQuXVGo9rkm5ZGrQdJfM=
google.golang.org/grpc v1.38.0/go.mod h1:NREThFqKR1f3iQ6oBuvc5LadQuXVGo9rkm5ZGrQdJfM=
google.golang.org/grpc v1.40.0 h1:AGJ0Ih4mHjSeibYkFGh1dD9KJ/eOtZ93I6hoHhukQ5Q=
google.golang.org/grpc v1.40.0/go.mod h1:ogyxbiOoUXAkP+4+xa6PZSE9DZgIHtSpzjDTB9KAK34=
google.golang.org/protobuf v0.0.0-20200109180630-ec00e32a8dfd/go.mod h1:DFci5gLYBciE7Vtevhsrf46CRTquxDuWsQurQQe4oz8=
google.golang.org/protobuf v0.0.0-20200221191635-4d8936d0db64/go.mod h1:kwYJMbMJ01Woi6D6+Kah6886xMZcty6N08ah7+eCXa0=
Expand Down
16 changes: 16 additions & 0 deletions main.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@ import (
"github.com/openkruise/kruise-game/cloudprovider"
cpmanager "github.com/openkruise/kruise-game/cloudprovider/manager"
controller "github.com/openkruise/kruise-game/pkg/controllers"
"github.com/openkruise/kruise-game/pkg/externalscaler"
"github.com/openkruise/kruise-game/pkg/webhook"
"google.golang.org/grpc"
"net"
"os"
"time"

Expand Down Expand Up @@ -66,6 +69,7 @@ func main() {
var probeAddr string
var namespace string
var syncPeriodStr string
var scaleServerAddr string
flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.")
flag.StringVar(&probeAddr, "health-probe-bind-address", ":8082", "The address the probe endpoint binds to.")
flag.BoolVar(&enableLeaderElection, "leader-elect", false,
Expand All @@ -74,6 +78,7 @@ func main() {
flag.StringVar(&namespace, "namespace", "",
"Namespace if specified restricts the manager's cache to watch objects in the desired namespace. Defaults to all namespaces.")
flag.StringVar(&syncPeriodStr, "sync-period", "", "Determines the minimum frequency at which watched resources are reconciled.")
flag.StringVar(&scaleServerAddr, "scale-server-bind-address", ":6000", "The address the scale server endpoint binds to.")

// Add cloud provider flags
cloudprovider.InitCloudProviderFlags()
Expand Down Expand Up @@ -164,6 +169,17 @@ func main() {
}
}()

externalScaler := externalscaler.NewExternalScaler(mgr.GetClient())
go func() {
grpcServer := grpc.NewServer()
lis, _ := net.Listen("tcp", scaleServerAddr)
externalscaler.RegisterExternalScalerServer(grpcServer, externalScaler)
if err := grpcServer.Serve(lis); err != nil {
setupLog.Error(err, "unable to setup ExternalScalerServer")
os.Exit(1)
}
}()

setupLog.Info("starting kruise-game-manager")

if err := mgr.Start(signal); err != nil {
Expand Down
75 changes: 75 additions & 0 deletions pkg/externalscaler/externalscaler.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
package externalscaler

import (
"context"
"fmt"
gamekruiseiov1alpha1 "github.com/openkruise/kruise-game/apis/v1alpha1"
"k8s.io/apimachinery/pkg/types"
"k8s.io/klog/v2"
"sigs.k8s.io/controller-runtime/pkg/client"
)

type ExternalScaler struct {
client client.Client
}

func (e *ExternalScaler) mustEmbedUnimplementedExternalScalerServer() {
}

func (e *ExternalScaler) IsActive(ctx context.Context, scaledObject *ScaledObjectRef) (*IsActiveResponse, error) {
return &IsActiveResponse{
Result: true,
}, nil
}

func (e *ExternalScaler) StreamIsActive(scaledObject *ScaledObjectRef, epsServer ExternalScaler_StreamIsActiveServer) error {
return nil
}

func (e *ExternalScaler) GetMetricSpec(ctx context.Context, scaledObjectRef *ScaledObjectRef) (*GetMetricSpecResponse, error) {
name := scaledObjectRef.GetName()
ns := scaledObjectRef.GetNamespace()
gss := &gamekruiseiov1alpha1.GameServerSet{}
err := e.client.Get(ctx, types.NamespacedName{Namespace: ns, Name: name}, gss)
if err != nil {
klog.Error(err)
return nil, err
}
desireReplicas := gss.Spec.Replicas
klog.Infof("GameServerSet %s/%s TargetSize is %d", ns, name, *desireReplicas)
return &GetMetricSpecResponse{
MetricSpecs: []*MetricSpec{{
MetricName: "gssReplicas",
TargetSize: int64(*desireReplicas),
}},
}, nil
}

func (e *ExternalScaler) GetMetrics(ctx context.Context, metricRequest *GetMetricsRequest) (*GetMetricsResponse, error) {
name := metricRequest.ScaledObjectRef.GetName()
ns := metricRequest.ScaledObjectRef.GetNamespace()
gss := &gamekruiseiov1alpha1.GameServerSet{}
err := e.client.Get(ctx, types.NamespacedName{Namespace: ns, Name: name}, gss)
if err != nil {
klog.Error(err)
return nil, err
}
currentReplicas := gss.Status.CurrentReplicas
numWaitToBeDeleted := gss.Status.WaitToBeDeletedReplicas
if numWaitToBeDeleted == nil || currentReplicas == 0 {
return nil, fmt.Errorf("GameServerSet %s/%s has not inited", ns, name)
}
klog.Infof("GameServerSet %s/%s desire replicas is %d", ns, name, currentReplicas-*numWaitToBeDeleted)
return &GetMetricsResponse{
MetricValues: []*MetricValue{{
MetricName: "gssReplicas",
MetricValue: int64(currentReplicas - *numWaitToBeDeleted),
}},
}, nil
}

func NewExternalScaler(client client.Client) *ExternalScaler {
return &ExternalScaler{
client: client,
}
}
Loading

0 comments on commit 7c072bc

Please sign in to comment.