Skip to content

Commit

Permalink
operator deploy on cloud
Browse files Browse the repository at this point in the history
  • Loading branch information
catpineapple committed Jan 23, 2025
1 parent a763109 commit 467d8ad
Show file tree
Hide file tree
Showing 9 changed files with 721 additions and 423 deletions.
105 changes: 105 additions & 0 deletions docs/ecosystem/doris-operator/doris-operator-overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
---
{
"title": "Doris Kubernetes Operator",
"language": "en"
}
---

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

为满足用户在 Kubernetes 平台上对 Doris 的高效部署和运维需求诞生的 [Kubernetes Operator](https://github.com/apache/doris-operator)(简称:Doris Operator),
集成了原生 Kubernetes 资源的复杂管理能力,并融合了 Doris 组件间的分布式协同、用户集群形态的按需定制等经验,为用户提供了一个更简洁、高效、易用的容器化部署方案。
旨在实现 Doris 在 Kubernetes 上的高效管控,帮助用户减少运维管理和学习成本的同时,提供强大的功能和灵活的配置能力。

Doris Operator 基于 Kubernetes CustomResourceDefinitions(CRD)实现了 Doris 在 Kubernetes 平台的配置、管理和调度。 Doris Operator 能够根据用户自定义的期望状态,自动创建 Pods 及其他资源以启动服务。通过自动注册机制,可将所有启动的服务整合成一个完整的 Doris 集群。这一实现显着降低了在 Doris 集群中处理配置信息、节点发现与注册、访问通信及健康检查等生产环境必备操作的复杂性和学习成本。

## Doris Operator Architecture

The design of Doris Operator is based on the principle of a two-layer scheduler. The first-layer scheduling of each component uses native StatefulSet and Service resources to directly manage the corresponding Pod service, which makes it fully compatible with open source Kubernetes clusters, including public clouds, private clouds, and self-built Kubernetes platforms.

Based on the deployment definition provided by Doris Operator, users can customize the Doris deployment state and send it to the Kubernetes cluster through the kubectl management command of Kubernetes. Doris Operator converts the deployment of each service into StatefulSet and its affiliated resources (such as Service) according to the customized state, and then schedules the desired Pods through StatefulSet. It simplifies unnecessary configuration in the StatefulSet specification by abstracting the final state of the Doris cluster, thereby reducing the user's learning cost.

## Key capabilities

- **Final state deployment**:

Kubernetes uses the final state operation and maintenance mode to manage services, and Doris Operator defines a resource type that can describe the Doris cluster - DorisCluster. Users can refer to relevant documents and usage examples to easily configure the required cluster.
Users can send the configuration to the Kubernetes cluster through the Kubernetes command line tool kubectl. Doris Operator automatically builds the required cluster and updates the cluster status to the corresponding resources in real time. This process ensures efficient management and monitoring of the cluster and greatly simplifies operation and maintenance operations.

- **Easy to expand**:

Doris Operator supports concurrent real-time horizontal expansion in a cloud disk-based environment. All component services of Doris are deployed and managed through Kubernetes' StatefulSet. When deploying or expanding, Pods are created using StatefulSet's Parallel mode, so that in theory all replicas can be started within the time it takes to start a node. The startup of each replica does not interfere with each other, and when a service fails to start, the startup of other services will not be affected.
Doris Operator uses concurrent mode to start services and has a built-in distributed architecture, which greatly simplifies the process of service expansion. Users only need to set the number of replicas to easily complete the expansion, completely freeing up the complexity of operation and maintenance operations.

- **Unnoticeable changes**:

In a distributed environment, service restarts may cause temporary instability of services. Especially for services such as databases that have extremely high requirements for stability, how to ensure the stability of services during the restart process is a very important topic. Doris uses the following three mechanisms on Kubernetes to ensure the stability of the service restart process, thereby achieving an imperceptible experience for the business during the restart and upgrade process.

1. Graceful exit
2. Rolling restart
3. Actively stop query allocation

- **Host system configuration**:

In some scenarios, it is necessary to configure the host system parameters to achieve the ideal performance of Apache Doris. In the containerized scenario, the uncertainty of host deployment and the difficulty of parameter modification bring challenges to users. To solve this problem, Doris Operator uses Kubernetes's initialization container to make the host parameters configurable.
Doris Operator allows users to configure commands executed on the host and make them effective by initializing containers. To improve availability, Doris Operator abstracts the configuration method of Kubernetes initialization containers, making the setting of host commands simpler and more intuitive.

- **Persistent configuration**:

Doris Operator uses the Kubernetes StorageClass mode to provide storage configuration for each service. It allows users to customize the mount directory. When customizing the startup configuration, if the storage directory is modified, the directory can be set as a persistent location in the custom resource, so that the service uses the specified directory in the container to store data.

- **Runtime debugging**:

One of the biggest challenges for Trouble Shorting with containerized services is how to debug at runtime. While pursuing availability and ease of use, Doris Operator also provides more convenient conditions for problem location. In the basic image of Doris, a variety of tools for problem location are pre-set. When you need to view the status in real time, you can enter the container through the exec command provided by kubectl and use the built-in tools for troubleshooting.
When the service cannot be started for unknown reasons, Doris Operator provides a Debug running mode. When a Pod is set to Debug startup mode, the container will automatically enter the running state. At this time, you can enter the container through the `exec` command, manually start the service and locate the problem. For details, please refer to [this document](../../install/cluster-deployment/k8s-deploy/compute-storage-coupled/cluster-operation.md#How to enter the container in the case of service-crash)

## Compatibility

Doris Operator is developed in accordance with standard K8s specifications and is compatible with all standard K8s platforms, including those provided by mainstream cloud vendors, self-built K8s platforms based on standards, and user-built platforms.
### Cloud vendor compatibility

Fully compatible with the containerized service platforms of mainstream cloud vendors. For environment preparation and usage suggestions for Doris Operator, please refer to the following documents:

- [Alibaba Cloud](./on-alibaba)

- [AWS](./on-aws)

## Installation and management

### Prerequisites

Before deployment, you need to check the host system. Refer to [Operating System Check](../../install/preparation/os-checking.md)

### Deploy Doris Operator

Before deploying Doris Operator on Kubernetes, you need to install Doris Operator CRD and Doris Operator management components.

* For detailed installation documents, please refer to: [Doris Operator Installation](../../install/cluster-deployment/k8s-deploy/compute-storage-coupled/install-doris-operator.md)

### Deploy Doris cluster

* For cluster configuration documents, please refer to: [Doris Operator Cluster Configuration](../../install/cluster-deployment/k8s-deploy/compute-storage-coupled/install-config-cluster.md)
* For installation documents, please refer to: [Doris Cluster Installation](../../install/cluster-deployment/k8s-deploy/compute-storage-coupled/install-doris-cluster.md)

### Cluster operation and maintenance

* For cluster operation and maintenance documents, please refer to: [Doris Operator Cluster Operation](../../install/cluster-deployment/k8s-deploy/compute-storage-coupled/cluster-operation.md)
* For cluster access documents, please refer to: [Doris Operator Cluster Access](../../install/cluster-deployment/k8s-deploy/compute-storage-coupled/access-cluster.md)

158 changes: 158 additions & 0 deletions docs/ecosystem/doris-operator/on-alibaba.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
---
{
"title": "Recommendations on Alibaba Cloud",
"language": "en"
}
---

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

## Alibaba ACK

Alibaba Cloud Container Service ACK is a managed containerized service after purchasing an ECS instance, so you can obtain full access control permissions to adjust related system parameters. Use the instance image: Alibaba Cloud Linux 3. The current system parameters fully meet the requirements for running Doris. Those that do not meet the requirements can also be corrected in the container through the K8s privileged mode to ensure stable operation.
**Alibaba Cloud ACK cluster, deployed using Doris Operator, most environmental requirements can be met by the ECS default configuration. If not met, Doris Operator can correct it by itself**. Users can also manually correct it, as follows:

### Already exists cluster

If the Container Service cluster has already been created, you can modify it by referring to this document: [Cluster Environment OS Checking](../../install/preparation/os-checking.md)
Focus on the BE startup parameter requirements:
1. Disable and close swap: `swapon --show` will not be output if it is not enabled
2. Check the maximum number of open file handles in the system `ulimit -n`
3. Check and modify the number of virtual memory areas `sysctl vm.max_map_count`
4. Whether transparent huge pages are closed `cat /sys/kernel/mm/transparent_hugepage/enabled` contains never
The default values of the corresponding parameters are as follows:

```shell
[root@iZj6c12a1czxk5oer9rbp8Z ~]# swapon --show
[root@iZj6c12a1czxk5oer9rbp8Z ~]# ulimit -n
65535
[root@iZj6c12a1czxk5oer9rbp8Z ~]# sysctl vm.max_map_count
vm.max_map_count = 262144
[root@iZj6c12a1czxk5oer9rbp8Z ~]# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
```

### Create a new cluster

If the cluster has not been purchased and created, you can click "Create Cluster" in the Alibaba Cloud Container Service ACK console to purchase it. You can adjust the configuration as needed. The above parameters can be added to the system adjustment script in "Instance Pre-customized Data" in the "Node Pool Configuration" step of creating a cluster.
After the cluster is started, restart the node to complete the configuration. The reference script is as follows:

```shell
#!/bin/bash
chmod +x /etc/rc.d/rc.local
echo "sudo systemctl stop firewalld.service" >> /etc/rc.d/rc.local
echo "sudo systemctl disable firewalld.service" >> /etc/rc.d/rc.local
echo "sysctl -w vm.max_map_count=2000000" >> /etc/rc.d/rc.local
echo "swapoff -a" >> /etc/rc.d/rc.local
current_limit=$(ulimit -n)
desired_limit=1000000
config_file="/etc/security/limits.conf"
if [ "$current_limit" -ne "$desired_limit" ]; then
echo "* soft nofile 1000000" >> "$config_file"
echo "* hard nofile 1000000" >> "$config_file"
fi
```

## Alibaba ACS

The ACS service is a cloud computing service that uses K8s as the user interface to provide container computing resources, providing elastic computing resources that are billed on demand. Unlike the above ACK, you do not need to pay attention to the specific use of ECS.
The following points should be noted when using ACS:

### Image repository

When using ACS, it is recommended to use the supporting Alibaba [Container Registry](https://www.alibabacloud.com/en/product/container-registry)(ACR). The personal and enterprise versions are enabled on demand.

After configuring the ACR and image transfer environment, you need to migrate the official image provided by Doris to the corresponding ACR.

If you use a private ACR to enable authentication, you can refer to the following steps:

1. You need to set a `secret` of type `docker-registry` in advance to configure the authentication information for accessing the image warehouse.

```shell
kubectl create secret docker-registry image-hub-secret --docker-server={your-server} --docker-username={your-username} --docker-password={your-pwd}
```

2. Configure the secret using the above steps on DCR:

```yaml
spec:
feSpec:
replicas: 1
image: crpi-4q6quaxa0ta96k7h-vpc.cn-hongkong.personal.cr.aliyuncs.com/selectdb-test/doris.fe-ubuntu:3.0.3
imagePullSecrets:
- name: image-hub-secret
beSpec:
replicas: 3
image: crpi-4q6quaxa0ta96k7h-vpc.cn-hongkong.personal.cr.aliyuncs.com/selectdb-test/doris.be-ubuntu:3.0.3
imagePullSecrets:
- name: image-hub-secret
systemInitialization:
initImage: crpi-4q6quaxa0ta96k7h-vpc.cn-hongkong.personal.cr.aliyuncs.com/selectdb-test/alpine:latest
```
### Be systemInitialization
Currently, Alibaba Cloud is gradually pushing the ability to enable privileged mode on fully managed ACS services (some regions may not be enabled yet, you can submit a work order to apply for the ability to be enabled).
The Doris BE node startup requires some special environment parameters, such as Modify the number of virtual memory areas `sysctl -w vm.max_map_count=2000000`
Setting this parameter inside the container requires modifying the host configuration, so regular K8s clusters need to enable privileged mode in the pod. Operator adds `InitContainer` to the BE pod through `systemInitialization` to perform such operations.

:::tip Tip
**If the current cluster cannot use privileged mode, the BE node cannot be started**. You can choose ACK container service + host to deploy the cluster.
:::

### Service

Since the ACS service is a cloud computing service that uses K8s as the user interface to provide container computing resources, it provides computing resources. Its nodes are virtual computing resources, and users do not need to pay attention to them. They are charged according to the amount of resources used, and can be expanded infinitely. That is, there is no physical concept of conventional nodes:

```shell
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
virtual-kubelet-cn-hongkong-d Ready agent 27h v1.31.1-aliyun.1
```

Therefore, when deploying the Doris cluster, serviceType disables the NodePort mode and allows the use of ClusterIP and LB modes.

- ClusterIP mode:

ClusterIP modethe default network mode of Operator. For specific usage and access methods, please refer to [this document](https://kubernetes.io/docs/concepts/services-networking/service/#type-clusterip)

- Load balancing mode:

can be configured as follows:

- Configure LB access through the DCR service annotations provided by Operator. The steps are as follows:
1. A CLB or NLB instance has been created through the load balancing console, and the instance is in the same region as the ACK cluster. If you haven't created one yet, see [Create and manage a CLB instance](https://www.alibabacloud.com/help/en/slb/classic-load-balancer/user-guide/create-and-manage-a-clb-instance) and [Create and manage an NLB instance](https://www.alibabacloud.com/help/en/slb/network-load-balancer/user-guide/create-and-manage-an-nlb-instance)。
2. Through DCR configuration, the access annotations of the above LB are in the following format:
```yaml
feSpec:
replicas: 3
image: crpi-4q6quaxa0ta96k7h-vpc.cn-hongkong.personal.cr.aliyuncs.com/selectdb-test/doris.fe-ubuntu:3.0.3
service:
type: LoadBalancer
annotations:
service.beta.kubernetes.io/alibaba-cloud-loadbalancer-address-type: "intranet"
```

- Host the LB service through the ACS console and generate a statefulset service bound to the corresponding resource control of FE or BE
The steps are as follows:
1. serviceType is ClusterIP (default policy)
2. You can create a load balancing service through the Alibaba Cloud console interface: Container Compute Service ACS -> Cluster List -> Cluster -> Service, and use the `Create` button.
3. Select the newly created LB in the interface for creating `service`, which will be bound to `service` and will also be deregistered when the `service` is deregistered. However, this `service` is not controlled by Doris Operator.

Loading

0 comments on commit 467d8ad

Please sign in to comment.