From e59a29450a886c8d998b61158d8d77a4b44248e4 Mon Sep 17 00:00:00 2001 From: Jialin Ma <281648921@qq.com> Date: Tue, 10 Dec 2024 17:03:58 +0800 Subject: [PATCH] New Table Model Deployment and Operations Document --- .../Cluster-Deployment_timecho.md | 377 ++++++++++ .../Database-Resources.md | 194 +++++ .../Environment-Requirements.md | 191 +++++ .../IoTDB-Package_timecho.md | 42 ++ .../Monitoring-panel-deployment.md | 680 +++++++++++++++++ .../Stand-Alone-Deployment_timecho.md | 244 +++++++ .../Cluster-Deployment_timecho.md | 362 ++++++++++ .../Database-Resources.md | 193 +++++ .../Environment-Requirements.md | 205 ++++++ .../IoTDB-Package_timecho.md | 45 ++ .../Monitoring-panel-deployment.md | 682 ++++++++++++++++++ .../Stand-Alone-Deployment_timecho.md | 217 ++++++ 12 files changed, 3432 insertions(+) create mode 100644 src/UserGuide/Master/Table/Deployment-and-Maintenance/Cluster-Deployment_timecho.md create mode 100644 src/UserGuide/Master/Table/Deployment-and-Maintenance/Database-Resources.md create mode 100644 src/UserGuide/Master/Table/Deployment-and-Maintenance/Environment-Requirements.md create mode 100644 src/UserGuide/Master/Table/Deployment-and-Maintenance/IoTDB-Package_timecho.md create mode 100644 src/UserGuide/Master/Table/Deployment-and-Maintenance/Monitoring-panel-deployment.md create mode 100644 src/UserGuide/Master/Table/Deployment-and-Maintenance/Stand-Alone-Deployment_timecho.md create mode 100644 src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Cluster-Deployment_timecho.md create mode 100644 src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Database-Resources.md create mode 100644 src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Environment-Requirements.md create mode 100644 src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/IoTDB-Package_timecho.md create mode 100644 src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Monitoring-panel-deployment.md create mode 100644 src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Stand-Alone-Deployment_timecho.md diff --git a/src/UserGuide/Master/Table/Deployment-and-Maintenance/Cluster-Deployment_timecho.md b/src/UserGuide/Master/Table/Deployment-and-Maintenance/Cluster-Deployment_timecho.md new file mode 100644 index 00000000..19e2f6f6 --- /dev/null +++ b/src/UserGuide/Master/Table/Deployment-and-Maintenance/Cluster-Deployment_timecho.md @@ -0,0 +1,377 @@ + +# Cluster Deployment + +This section describes how to manually deploy an instance that includes 3 ConfigNodes and 3 DataNodes, commonly known as a 3C3D cluster. + +
+ +
+ + +## Note + +1. Before installation, ensure that the system is complete by referring to [System Requirements](./Environment-Requirements.md) + +2. It is recommended to prioritize using `hostname` for IP configuration during deployment, which can avoid the problem of modifying the host IP in the later stage and causing the database to fail to start. To set the host name, you need to configure /etc/hosts on the target server. For example, if the local IP is 192.168.1.3 and the host name is iotdb-1, you can use the following command to set the server's host name and configure the `cn_internal_address` and `dn_internal_address` of IoTDB using the host name. + + ``` shell + echo "192.168.1.3 iotdb-1" >> /etc/hosts + ``` + +3. Some parameters cannot be modified after the first startup. Please refer to the "Parameter Configuration" section below for settings. + +4. Whether in linux or windows, ensure that the IoTDB installation path does not contain Spaces and Chinese characters to avoid software exceptions. + +5. Please note that when installing and deploying IoTDB (including activating and using software), it is necessary to use the same user for operations. You can: + +- Using root user (recommended): Using root user can avoid issues such as permissions. +- Using a fixed non root user: + - Using the same user operation: Ensure that the same user is used for start, activation, stop, and other operations, and do not switch users. + - Avoid using sudo: Try to avoid using sudo commands as they execute commands with root privileges, which may cause confusion or security issues. + +6. It is recommended to deploy a monitoring panel, which can monitor important operational indicators and keep track of database operation status at any time. The monitoring panel can be obtained by contacting the business department,The steps for deploying a monitoring panel can refer to:[Monitoring Panel Deployment](./Monitoring-panel-deployment.md) + +## Preparation Steps + +1. Prepare the IoTDB database installation package: timechodb-{version}-bin.zip(The installation package can be obtained from:[IoTDB-Package](./IoTDB-Package_timecho.md)) +2. Configure the operating system environment according to environmental requirements(The system environment configuration can be found in:[Environment Requirement](./Environment-Requirements.md)) + +## Installation Steps + +Assuming there are three Linux servers now, the IP addresses and service roles are assigned as follows: + +| Node IP | Host Name | Service | +| ------------- | --------- | -------------------- | +| 11.101.17.224 | iotdb-1 | ConfigNode、DataNode | +| 11.101.17.225 | iotdb-2 | ConfigNode、DataNode | +| 11.101.17.226 | iotdb-3 | ConfigNode、DataNode | + +### Set Host Name + +On three machines, configure the host names separately. To set the host names, configure `/etc/hosts` on the target server. Use the following command: + +```Bash +echo "11.101.17.224 iotdb-1" >> /etc/hosts +echo "11.101.17.225 iotdb-2" >> /etc/hosts +echo "11.101.17.226 iotdb-3" >> /etc/hosts +``` + +### Configuration + +Unzip the installation package and enter the installation directory + +```Plain +unzip timechodb-{version}-bin.zip +cd timechodb-{version}-bin +``` + +#### Environment script configuration + +- `./conf/confignode-env.sh` configuration + + | **Configuration** | **Description** | **Default** | **Recommended value** | **Note** | + | :---------------- | :----------------------------------------------------------- | :---------- | :----------------------------------------------------------- | :---------------------------------- | + | MEMORY_SIZE | The total amount of memory that IoTDB ConfigNode nodes can use | - | Can be filled in as needed, and the system will allocate memory based on the filled in values | Restarting the service takes effect | + +- `./conf/datanode-env.sh` configuration + + | **Configuration** | **Description** | **Default** | **Recommended value** | **Note** | + | :---------------- | :----------------------------------------------------------- | :---------- | :----------------------------------------------------------- | :---------------------------------- | + | MEMORY_SIZE | The total amount of memory that IoTDB DataNode nodes can use | - | Can be filled in as needed, and the system will allocate memory based on the filled in values | Restarting the service takes effect | + +#### General Configuration(./conf/iotdb-system.properties) + +- Cluster Configuration + + | **Configuration** | **Description** | 11.101.17.224 | 11.101.17.225 | 11.101.17.226 | + | ------------------------- | ------------------------------------------------------------ | -------------- | -------------- | -------------- | + | cluster_name | Cluster Name | defaultCluster | defaultCluster | defaultCluster | + | schema_replication_factor | The number of metadata replicas, the number of DataNodes should not be less than this number | 3 | 3 | 3 | + | data_replication_factor | The number of data replicas should not be less than this number of DataNodes | 2 | 2 | 2 | + +#### ConfigNode Configuration + +| **Configuration** | **Description** | **Default** | **Recommended value** | 11.101.17.224 | 11.101.17.225 | 11.101.17.226 | Note | +| ------------------- | ------------------------------------------------------------ | --------------- | ------------------------------------------------------------ | ------------- | ------------- | ------------- | ---------------------------------------- | +| cn_internal_address | The address used by ConfigNode for communication within the cluster | 127.0.0.1 | The IPV4 address or host name of the server where it is located, and it is recommended to use host name | iotdb-1 | iotdb-2 | iotdb-3 | Cannot be modified after initial startup | +| cn_internal_port | The port used by ConfigNode for communication within the cluster | 10710 | 10710 | 10710 | 10710 | 10710 | Cannot be modified after initial startup | +| cn_consensus_port | The port used for ConfigNode replica group consensus protocol communication | 10720 | 10720 | 10720 | 10720 | 10720 | Cannot be modified after initial startup | +| cn_seed_config_node | The address of the ConfigNode that the node connects to when registering to join the cluster, `cn_internal_address:cn_internal_port` | 127.0.0.1:10710 | The first CongfigNode's `cn_internal-address: cn_internal_port` | iotdb-1:10710 | iotdb-1:10710 | iotdb-1:10710 | Cannot be modified after initial startup | + +#### Datanode Configuration + +| **Configuration** | **Description** | **Default** | **Recommended value** | 11.101.17.224 | 11.101.17.225 | 11.101.17.226 | Note | +| ------------------------------- | ------------------------------------------------------------ | --------------- | ------------------------------------------------------------ | ------------- | ------------- | ------------- | ---------------------------------------- | +| dn_rpc_address | The address of the client RPC service | 127.0.0.1 | Recommend using the **IPV4 address or hostname** of the server where it is located | iotdb-1 | iotdb-2 | iotdb-3 | Restarting the service takes effect | +| dn_rpc_port | The port of the client RPC service | 6667 | 6667 | 6667 | 6667 | 6667 | Restarting the service takes effect | +| dn_internal_address | The address used by DataNode for communication within the cluster | 127.0.0.1 | The IPV4 address or host name of the server where it is located, and it is recommended to use host name | iotdb-1 | iotdb-2 | iotdb-3 | Cannot be modified after initial startup | +| dn_internal_port | The port used by DataNode for communication within the cluster | 10730 | 10730 | 10730 | 10730 | 10730 | Cannot be modified after initial startup | +| dn_mpp_data_exchange_port | The port used by DataNode to receive data streams | 10740 | 10740 | 10740 | 10740 | 10740 | Cannot be modified after initial startup | +| dn_data_region_consensus_port | The port used by DataNode for data replica consensus protocol communication | 10750 | 10750 | 10750 | 10750 | 10750 | Cannot be modified after initial startup | +| dn_schema_region_consensus_port | The port used by DataNode for metadata replica consensus protocol communication | 10760 | 10760 | 10760 | 10760 | 10760 | Cannot be modified after initial startup | +| dn_seed_config_node | The address of the ConfigNode that the node connects to when registering to join the cluster, i.e. `cn_internal-address: cn_internal_port` | 127.0.0.1:10710 | The first CongfigNode's cn_internal-address: cn_internal_port | iotdb-1:10710 | iotdb-1:10710 | iotdb-1:10710 | Cannot be modified after initial startup | + +> ❗️Attention: Editors such as VSCode Remote do not have automatic configuration saving function. Please ensure that the modified files are saved persistently, otherwise the configuration items will not take effect + +### Start ConfigNode + +Start the first confignode of IoTDB-1 first, ensuring that the seed confignode node starts first, and then start the second and third confignode nodes in sequence + +```Bash +cd sbin + +./start-confignode.sh -d #"- d" parameter will start in the background +``` + +If the startup fails, please refer to [Common Questions](#common-questions). + +### Start DataNode + + Enter the `sbin` directory of iotdb and start three datanode nodes in sequence: + +```Go +cd sbin + +./start-datanode.sh -d #"- d" parameter will start in the background +``` + +### Activate Database + +#### Method 1: Activate file copy activation + +- After starting three Confignode Datanode nodes in sequence, copy the `activation` folder of each machine and the `system_info` file of each machine to the Timecho staff; + +- The staff will return the license files for each ConfigNode Datanode node, where 3 license files will be returned; + +- Put the three license files into the `activation` folder of the corresponding ConfigNode node; + +#### Method 2: Activate Script Activation + +- Retrieve the machine codes of three machines in sequence, enter them into the CLI of the IoTDB tree model (./start-cli.sh-sql-dialect table/start-cli.bat - sql-dialect table), and execute the following tasks: + - Note: When sql-dialect is a table, it is temporarily not supported to use + +```Bash +show system info +``` + +- The following information is displayed, where the machine code of one machine is displayed: + +```Bash ++--------------------------------------------------------------+ +| SystemInfo| ++--------------------------------------------------------------+ +|01-TE5NLES4-UDDWCMYE,01-GG5NLES4-XXDWCMYE,01-FF5NLES4-WWWWCMYE| ++--------------------------------------------------------------+ +Total line number = 1 +It costs 0.030s +``` + +- The other two nodes enter the CLI of the IoTDB tree model in sequence, execute the statement, and copy the machine codes of the three machines obtained to the Timecho staff + +- The staff will return three activation codes, which normally correspond to the order of the three machine codes provided. Please paste each activation code into the CLI separately, as prompted below: + + - Note: The activation code needs to be marked with a 'symbol before and after, as shown in + + ```Bash + IoTDB> activate '01-D4EYQGPZ-EAUJJODW-NUKRDR6F-TUQS3B75-EDZFLK3A-6BOKJFFZ-ALDHOMN7-NB2E4BHI-7ZKGFVK6-GCIFXA4T-UG3XJTTD-SHJV6F2P-Q27B4OMJ-R47ZDIM3-UUASUXG2-OQXGVZCO-MMYKICZU-TWFQYYAO-ZOAGOKJA-NYHQTA5U-EWAR4EP5-MRC6R2CI-PKUTKRCT-7UDGRH3F-7BYV4P5D-6KKIA===,01-D4EYQGPZ-EAUJJODW-NUKRDR6F-TUQS3B75-EDZFLK3A-6BOKJFFZ-ALDHOMN7-NB2E4BHI-7ZKGFVK6-GCIFXA4T-UG3XJTTD-SHJV6F2P-Q27B4OMJ-R47ZDIM3-UUASUXG2-OQXGVZCO-MMYKICZU-TWFQYYAO-ZOAGOKJA-NYHQTA5U-EWAR4EP5-MRC6R2CI-PKUTKRCT-7UDGRH3F-7BYV4P5D-6KKIA===,01-D4EYQGPZ-EAUJJODW-NUKRDR6F-TUQS3B75-EDZFLK3A-6BOKJFFZ-ALDHOMN7-NB2E4BHI-7ZKGFVK6-GCIFXA4T-UG3XJTTD-SHJV6F2P-Q27B4OMJ-R47ZDIM3-UUASUXG2-OQXGVZCO-MMYKICZU-TWFQYYAO-ZOAGOKJA-NYHQTA5U-EWAR4EP5-MRC6R2CI-PKUTKRCT-7UDGRH3F-7BYV4P5D-6KKIA===' + ``` + +### Verify Activation + +When the status of the 'Result' field is displayed as' success', it indicates successful activation + +![](https://alioss.timecho.com/docs/img/%E9%9B%86%E7%BE%A4-%E9%AA%8C%E8%AF%81.png) + +## Node Maintenance Steps + +### ConfigNode Node Maintenance + +ConfigNode node maintenance is divided into two types of operations: adding and removing ConfigNodes, with two common use cases: + +- Cluster expansion: For example, when there is only one ConfigNode in the cluster, and you want to increase the high availability of ConfigNode nodes, you can add two ConfigNodes, making a total of three ConfigNodes in the cluster. + +- Cluster failure recovery: When the machine where a ConfigNode is located fails, making the ConfigNode unable to run normally, you can remove this ConfigNode and then add a new ConfigNode to the cluster. + +> ❗️Note, after completing ConfigNode node maintenance, you need to ensure that there are 1 or 3 ConfigNodes running normally in the cluster. Two ConfigNodes do not have high availability, and more than three ConfigNodes will lead to performance loss. + +#### Adding ConfigNode Nodes + +Script command: + +```shell +# Linux / MacOS +# First switch to the IoTDB root directory +sbin/start-confignode.sh + +# Windows +# First switch to the IoTDB root directory +sbin/start-confignode.bat +``` + +#### Removing ConfigNode Nodes + +First connect to the cluster through the CLI and confirm the internal address and port number of the ConfigNode you want to remove by using `show confignodes`: + +```Bash +IoTDB> show confignodes ++------+-------+---------------+------------+--------+ +|NodeID| Status|InternalAddress|InternalPort| Role| ++------+-------+---------------+------------+--------+ +| 0|Running| 127.0.0.1| 10710| Leader| +| 1|Running| 127.0.0.1| 10711|Follower| +| 2|Running| 127.0.0.1| 10712|Follower| ++------+-------+---------------+------------+--------+ +Total line number = 3 +It costs 0.030s +``` + +Then use the script to remove the DataNode. Script command: + +```Bash +# Linux / MacOS +sbin/remove-confignode.sh [confignode_id] + +#Windows +sbin/remove-confignode.bat [confignode_id] + +``` + +### DataNode Node Maintenance + +There are two common scenarios for DataNode node maintenance: + +- Cluster expansion: For the purpose of expanding cluster capabilities, add new DataNodes to the cluster + +- Cluster failure recovery: When a machine where a DataNode is located fails, making the DataNode unable to run normally, you can remove this DataNode and add a new DataNode to the cluster + +> ❗️Note, in order for the cluster to work normally, during the process of DataNode node maintenance and after the maintenance is completed, the total number of DataNodes running normally should not be less than the number of data replicas (usually 2), nor less than the number of metadata replicas (usually 3). + +#### Adding DataNode Nodes + +Script command: + +```Bash +# Linux / MacOS +# First switch to the IoTDB root directory +sbin/start-datanode.sh + +# Windows +# First switch to the IoTDB root directory +sbin/start-datanode.bat +``` + +Note: After adding a DataNode, as new writes arrive (and old data expires, if TTL is set), the cluster load will gradually balance towards the new DataNode, eventually achieving a balance of storage and computation resources on all nodes. + +#### Removing DataNode Nodes + +First connect to the cluster through the CLI and confirm the RPC address and port number of the DataNode you want to remove with `show datanodes`: + +```Bash +IoTDB> show datanodes ++------+-------+----------+-------+-------------+---------------+ +|NodeID| Status|RpcAddress|RpcPort|DataRegionNum|SchemaRegionNum| ++------+-------+----------+-------+-------------+---------------+ +| 1|Running| 0.0.0.0| 6667| 0| 0| +| 2|Running| 0.0.0.0| 6668| 1| 1| +| 3|Running| 0.0.0.0| 6669| 1| 0| ++------+-------+----------+-------+-------------+---------------+ +Total line number = 3 +It costs 0.110s +``` + +Then use the script to remove the DataNode. Script command: + +```Bash +# Linux / MacOS +sbin/remove-datanode.sh [datanode_id] + +#Windows +sbin/remove-datanode.bat [datanode_id] +``` + +## Common Questions + +1. Multiple prompts indicating activation failure during deployment process + + - Use the `ls -al` command: Use the `ls -al` command to check if the owner information of the installation package root directory is the current user. + + - Check activation directory: Check all files in the `./activation` directory and whether the owner information is the current user. + +2. Confignode failed to start + + Step 1: Please check the startup log to see if any parameters that cannot be changed after the first startup have been modified. + + Step 2: Please check the startup log for any other abnormalities. If there are any abnormal phenomena in the log, please contact Timecho Technical Support personnel for consultation on solutions. + + Step 3: If it is the first deployment or data can be deleted, you can also clean up the environment according to the following steps, redeploy, and restart. + + Step 4: Clean up the environment: + + a. Terminate all ConfigNode Node and DataNode processes. + + ```Bash + # 1. Stop the ConfigNode and DataNode services + sbin/stop-standalone.sh + + # 2. Check for any remaining processes + jps + # Or + ps -ef|gerp iotdb + + # 3. If there are any remaining processes, manually kill the + kill -9 + # If you are sure there is only one iotdb on the machine, you can use the following command to clean up residual processes + ps -ef|grep iotdb|grep -v grep|tr -s ' ' ' ' |cut -d ' ' -f2|xargs kill -9 + ``` + + b. Delete the data and logs directories. + + Explanation: Deleting the data directory is necessary, deleting the logs directory is for clean logs and is not mandatory. + + ```Bash + cd /data/iotdb + rm -rf data logs + ``` + +## Appendix + +### Introduction to Configuration Node Parameters + +| Parameter | Description | Is it required | +| :-------- | :---------------------------------------------- | :------------- | +| -d | Start in daemon mode, running in the background | No | + +### Introduction to Datanode Node Parameters + +| Abbreviation | Description | Is it required | +| :----------- | :----------------------------------------------------------- | :------------- | +| -v | Show version information | No | +| -f | Run the script in the foreground, do not put it in the background | No | +| -d | Start in daemon mode, i.e. run in the background | No | +| -p | Specify a file to store the process ID for process management | No | +| -c | Specify the path to the configuration file folder, the script will load the configuration file from here | No | +| -g | Print detailed garbage collection (GC) information | No | +| -H | Specify the path of the Java heap dump file, used when JVM memory overflows | No | +| -E | Specify the path of the JVM error log file | No | +| -D | Define system properties, in the format key=value | No | +| -X | Pass -XX parameters directly to the JVM | No | +| -h | Help instruction | No | \ No newline at end of file diff --git a/src/UserGuide/Master/Table/Deployment-and-Maintenance/Database-Resources.md b/src/UserGuide/Master/Table/Deployment-and-Maintenance/Database-Resources.md new file mode 100644 index 00000000..59a380db --- /dev/null +++ b/src/UserGuide/Master/Table/Deployment-and-Maintenance/Database-Resources.md @@ -0,0 +1,194 @@ + +# Database Resources +## CPU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Number of timeseries (frequency<=1HZ)CPUNumber of nodes
standalone modeDouble activeDistributed
Within 1000002core-4core123
Within 3000004core-8core123
Within 5000008core-26core123
Within 100000016core-32core123
Within 200000032core-48core123
Within 1000000048core12Please contact Timecho Business for consultation
Over 10000000Please contact Timecho Business for consultation
+ +## Memory + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Number of timeseries (frequency<=1HZ)MemoryNumber of nodes
standalone modeDouble activeDistributed
Within 1000004G-8G123
Within 30000012G-32G123
Within 50000024G-48G123
Within 100000032G-96G123
Within 200000064G-128G123
Within 10000000128G12Please contact Timecho Business for consultation
Over 10000000Please contact Timecho Business for consultation
+ +## Storage (Disk) +### Storage space +Calculation formula: Number of measurement points * Sampling frequency (Hz) * Size of each data point (Byte, different data types may vary, see table below) * Storage time (seconds) * Number of copies (usually 1 copy for a single node and 2 copies for a cluster) ÷ Compression ratio (can be estimated at 5-10 times, but may be higher in actual situations) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Data point size calculation
data typeTimestamp (Bytes) Value (Bytes) Total size of data points (in bytes) +
Boolean819
INT32/FLOAT8412
INT64/DOUBLE8816
TEXT8The average is a8+a
+ +Example: 1000 devices, each with 100 measurement points, a total of 100000 sequences, INT32 type. Sampling frequency 1Hz (once per second), storage for 1 year, 3 copies. +- Complete calculation formula: 1000 devices * 100 measurement points * 12 bytes per data point * 86400 seconds per day * 365 days per year * 3 copies/10 compression ratio=11T +- Simplified calculation formula: 1000 * 100 * 12 * 86400 * 365 * 3/10=11T +### Storage Configuration +If the number of nodes is over 10000000 or the query load is high, it is recommended to configure SSD +## Network (Network card) +If the write throughput does not exceed 10 million points/second, configure 1Gbps network card. When the write throughput exceeds 10 million points per second, a 10Gbps network card needs to be configured. +| **Write throughput (data points per second)** | **NIC rate** | +| ------------------- | ------------- | +| <10 million | 1Gbps | +| >=10 million | 10Gbps | +## Other instructions +IoTDB has the ability to scale up clusters in seconds, and expanding node data does not require migration. Therefore, you do not need to worry about the limited cluster capacity estimated based on existing data. In the future, you can add new nodes to the cluster when you need to scale up. \ No newline at end of file diff --git a/src/UserGuide/Master/Table/Deployment-and-Maintenance/Environment-Requirements.md b/src/UserGuide/Master/Table/Deployment-and-Maintenance/Environment-Requirements.md new file mode 100644 index 00000000..539d03b0 --- /dev/null +++ b/src/UserGuide/Master/Table/Deployment-and-Maintenance/Environment-Requirements.md @@ -0,0 +1,191 @@ + +# System Requirements + +## Disk Array + +### Configuration Suggestions + +IoTDB has no strict operation requirements on disk array configuration. It is recommended to use multiple disk arrays to store IoTDB data to achieve the goal of concurrent writing to multiple disk arrays. For configuration, refer to the following suggestions: + +1. Physical environment + System disk: You are advised to use two disks as Raid1, considering only the space occupied by the operating system itself, and do not reserve system disk space for the IoTDB + Data disk: + Raid is recommended to protect data on disks + It is recommended to provide multiple disks (1-6 disks) or disk groups for the IoTDB. (It is not recommended to create a disk array for all disks, as this will affect the maximum performance of the IoTDB.) +2. Virtual environment + You are advised to mount multiple hard disks (1-6 disks). + +### Configuration Example + +- Example 1: Four 3.5-inch hard disks + +Only a few hard disks are installed on the server. Configure Raid5 directly. +The recommended configurations are as follows: +| **Use classification** | **Raid type** | **Disk number** | **Redundancy** | **Available capacity** | +| ----------- | -------- | -------- | --------- | -------- | +| system/data disk | RAID5 | 4 | 1 | 3 | is allowed to fail| + +- Example 2: Twelve 3.5-inch hard disks + +The server is configured with twelve 3.5-inch disks. +Two disks are recommended as Raid1 system disks. The two data disks can be divided into two Raid5 groups. Each group of five disks can be used as four disks. +The recommended configurations are as follows: +| **Use classification** | **Raid type** | **Disk number** | **Redundancy** | **Available capacity** | +| -------- | -------- | -------- | --------- | -------- | +| system disk | RAID1 | 2 | 1 | 1 | +| data disk | RAID5 | 5 | 1 | 4 | +| data disk | RAID5 | 5 | 1 | 4 | +- Example 3:24 2.5-inch disks + +The server is configured with 24 2.5-inch disks. +Two disks are recommended as Raid1 system disks. The last two disks can be divided into three Raid5 groups. Each group of seven disks can be used as six disks. The remaining block can be idle or used to store pre-write logs. +The recommended configurations are as follows: +| **Use classification** | **Raid type** | **Disk number** | **Redundancy** | **Available capacity** | +| -------- | -------- | -------- | --------- | -------- | +| system disk | RAID1 | 2 | 1 | 1 | +| data disk | RAID5 | 7 | 1 | 6 | +| data disk | RAID5 | 7 | 1 | 6 | +| data disk | RAID5 | 7 | 1 | 6 | +| data disk | NoRaid | 1 | 0 | 1 | + +## Operating System + +### Version Requirements + +IoTDB supports operating systems such as Linux, Windows, and MacOS, while the enterprise version supports domestic CPUs such as Loongson, Phytium, and Kunpeng. It also supports domestic server operating systems such as Neokylin, KylinOS, UOS, and Linx. + +### Disk Partition + +- The default standard partition mode is recommended. LVM extension and hard disk encryption are not recommended. +- The system disk needs only the space used by the operating system, and does not need to reserve space for the IoTDB. +- Each disk group corresponds to only one partition. Data disks (with multiple disk groups, corresponding to raid) do not need additional partitions. All space is used by the IoTDB. +The following table lists the recommended disk partitioning methods. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Disk classificationDisk setDriveCapacityFile system type
System diskDisk group0/boot1GBAcquiesce
/Remaining space of the disk groupAcquiesce
Data diskDisk set1/data1Full space of disk group1Acquiesce
Disk set2/data2Full space of disk group2Acquiesce
......
+### Network Configuration + +1. Disable the firewall + +```Bash +# View firewall +systemctl status firewalld +# Disable firewall +systemctl stop firewalld +# Disable firewall permanently +systemctl disable firewalld +``` +2. Ensure that the required port is not occupied + +(1) Check the ports occupied by the cluster: In the default cluster configuration, ConfigNode occupies ports 10710 and 10720, and DataNode occupies ports 6667, 10730, 10740, 10750, 10760, 9090, 9190, and 3000. Ensure that these ports are not occupied. Check methods are as follows: + +```Bash +lsof -i:6667 or netstat -tunp | grep 6667 +lsof -i:10710 or netstat -tunp | grep 10710 +lsof -i:10720 or netstat -tunp | grep 10720 +# If the command outputs, the port is occupied. +``` + +(2) Checking the port occupied by the cluster deployment tool: When using the cluster management tool opskit to install and deploy the cluster, enable the SSH remote connection service configuration and open port 22. + +```Bash +yum install openssh-server # Install the ssh service +systemctl start sshd # Enable port 22 +``` + +3. Ensure that servers are connected to each other + +### Other Configuration + +1. Disable the system swap memory + +```Bash +echo "vm.swappiness = 0">> /etc/sysctl.conf +# The swapoff -a and swapon -a commands are executed together to dump the data in swap back to memory and to empty the data in swap. +# Do not omit the swappiness setting and just execute swapoff -a; Otherwise, swap automatically opens again after the restart, making the operation invalid. +swapoff -a && swapon -a +# Make the configuration take effect without restarting. +sysctl -p +# Check memory allocation, expecting swap to be 0 +free -m +``` +2. Set the maximum number of open files to 65535 to avoid the error of "too many open files". + +```Bash +# View current restrictions +ulimit -n +# Temporary changes +ulimit -n 65535 +# Permanent modification +echo "* soft nofile 65535" >> /etc/security/limits.conf +echo "* hard nofile 65535" >> /etc/security/limits.conf +# View after exiting the current terminal session, expect to display 65535 +ulimit -n +``` +## Software Dependence + +Install the Java runtime environment (Java version >= 1.8). Ensure that jdk environment variables are set. (It is recommended to deploy JDK17 for V1.3.2.2 or later. In some scenarios, the performance of JDK of earlier versions is compromised, and Datanodes cannot be stopped.) + +```Bash +# The following is an example of installing in centos7 using JDK-17: +tar -zxvf JDk-17_linux-x64_bin.tar # Decompress the JDK file +Vim ~/.bashrc # Configure the JDK environment +{ export JAVA_HOME=/usr/lib/jvm/jdk-17.0.9 + export PATH=$JAVA_HOME/bin:$PATH +} # Add JDK environment variables +source ~/.bashrc # The configuration takes effect +java -version # Check the JDK environment +``` \ No newline at end of file diff --git a/src/UserGuide/Master/Table/Deployment-and-Maintenance/IoTDB-Package_timecho.md b/src/UserGuide/Master/Table/Deployment-and-Maintenance/IoTDB-Package_timecho.md new file mode 100644 index 00000000..57cad838 --- /dev/null +++ b/src/UserGuide/Master/Table/Deployment-and-Maintenance/IoTDB-Package_timecho.md @@ -0,0 +1,42 @@ + +# Obtain TimechoDB +## How to obtain TimechoDB +The enterprise version installation package can be obtained through product trial application or by directly contacting the business personnel who are in contact with you. + +## Installation Package Structure +The directory structure after unpacking the installation package is as follows: +| **catalogue** | **Type** | **Explanation** | +| :--------------: | -------- | ------------------------------------------------------------ | +| activation | folder | The directory where the activation file is located, including the generated machine code and the enterprise version activation code obtained from the business side (this directory will only be generated after starting ConfigNode to obtain the activation code) | +| conf | folder | Configuration file directory, including configuration files such as ConfigNode, DataNode, JMX, and logback | +| data | folder | The default data file directory contains data files for ConfigNode and DataNode. (The directory will only be generated after starting the program) | +| lib | folder | IoTDB executable library file directory | +| licenses | folder | Open source community certificate file directory | +| logs | folder | The default log file directory, which includes log files for ConfigNode and DataNode (this directory will only be generated after starting the program) | +| sbin | folder | Main script directory, including start, stop, and other scripts | +| tools | folder | Directory of System Peripheral Tools | +| ext | folder | Related files for pipe, trigger, and UDF plugins (created by the user when needed) | +| LICENSE | file | certificate | +| NOTICE | file | Tip | +| README_ZH\.md | file | Explanation of the Chinese version in Markdown format | +| README\.md | file | Instructions for use | +| RELEASE_NOTES\.md | file | Version Description | diff --git a/src/UserGuide/Master/Table/Deployment-and-Maintenance/Monitoring-panel-deployment.md b/src/UserGuide/Master/Table/Deployment-and-Maintenance/Monitoring-panel-deployment.md new file mode 100644 index 00000000..4e9a50a1 --- /dev/null +++ b/src/UserGuide/Master/Table/Deployment-and-Maintenance/Monitoring-panel-deployment.md @@ -0,0 +1,680 @@ + +# Monitoring Panel Deployment + +The IoTDB monitoring panel is one of the supporting tools for the IoTDB Enterprise Edition. It aims to solve the monitoring problems of IoTDB and its operating system, mainly including operating system resource monitoring, IoTDB performance monitoring, and hundreds of kernel monitoring indicators, in order to help users monitor the health status of the cluster, and perform cluster optimization and operation. This article will take common 3C3D clusters (3 Confignodes and 3 Datanodes) as examples to introduce how to enable the system monitoring module in an IoTDB instance and use Prometheus+Grafana to visualize the system monitoring indicators. + +## Installation Preparation + +1. Installing IoTDB: You need to first install IoTDB V1.0 or above Enterprise Edition. You can contact business or technical support to obtain +2. Obtain the IoTDB monitoring panel installation package: Based on the enterprise version of IoTDB database monitoring panel, you can contact business or technical support to obtain + +## Installation Steps + +### Step 1: IoTDB enables monitoring indicator collection + +1. Open the monitoring configuration item. The configuration items related to monitoring in IoTDB are disabled by default. Before deploying the monitoring panel, you need to open the relevant configuration items (note that the service needs to be restarted after enabling monitoring configuration). + +| **Configuration** | Located in the configuration file | **Description** | +| :--------------------------------- | :-------------------------------- | :----------------------------------------------------------- | +| cn_metric_reporter_list | conf/iotdb-system.properties | Uncomment the configuration item and set the value to PROMETHEUS | +| cn_metric_level | conf/iotdb-system.properties | Uncomment the configuration item and set the value to IMPORTANT | +| cn_metric_prometheus_reporter_port | conf/iotdb-system.properties | Uncomment the configuration item to maintain the default setting of 9091. If other ports are set, they will not conflict with each other | +| dn_metric_reporter_list | conf/iotdb-system.properties | Uncomment the configuration item and set the value to PROMETHEUS | +| dn_metric_level | conf/iotdb-system.properties | Uncomment the configuration item and set the value to IMPORTANT | +| dn_metric_prometheus_reporter_port | conf/iotdb-system.properties | Uncomment the configuration item and set it to 9092 by default. If other ports are set, they will not conflict with each other | + +Taking the 3C3D cluster as an example, the monitoring configuration that needs to be modified is as follows: + +| Node IP | Host Name | Cluster Role | Configuration File Path | Configuration | +| ----------- | --------- | ------------ | -------------------------------- | ------------------------------------------------------------ | +| 192.168.1.3 | iotdb-1 | confignode | conf/iotdb-system.properties | cn_metric_reporter_list=PROMETHEUS cn_metric_level=IMPORTANT cn_metric_prometheus_reporter_port=9091 | +| 192.168.1.4 | iotdb-2 | confignode | conf/iotdb-system.properties | cn_metric_reporter_list=PROMETHEUS cn_metric_level=IMPORTANT cn_metric_prometheus_reporter_port=9091 | +| 192.168.1.5 | iotdb-3 | confignode | conf/iotdb-system.properties | cn_metric_reporter_list=PROMETHEUS cn_metric_level=IMPORTANT cn_metric_prometheus_reporter_port=9091 | +| 192.168.1.3 | iotdb-1 | datanode | conf/iotdb-system.properties | dn_metric_reporter_list=PROMETHEUS dn_metric_level=IMPORTANT dn_metric_prometheus_reporter_port=9092 | +| 192.168.1.4 | iotdb-2 | datanode | conf/iotdb-system.properties | dn_metric_reporter_list=PROMETHEUS dn_metric_level=IMPORTANT dn_metric_prometheus_reporter_port=9092 | +| 192.168.1.5 | iotdb-3 | datanode | conf/iotdb-system.properties | dn_metric_reporter_list=PROMETHEUS dn_metric_level=IMPORTANT dn_metric_prometheus_reporter_port=9092 | + +2. Restart all nodes. After modifying the monitoring indicator configuration of three nodes, the confignode and datanode services of all nodes can be restarted: + +```Bash +./sbin/stop-standalone.sh #Stop confignode and datanode first +./sbin/start-confignode.sh -d #Start confignode +./sbin/start-datanode.sh -d #Start datanode +``` + +3. After restarting, confirm the running status of each node through the client. If the status is Running, it indicates successful configuration: + +![](https://alioss.timecho.com/docs/img/%E5%90%AF%E5%8A%A8.PNG) + +### Step 2: Install and configure Prometheus + +> Taking Prometheus installed on server 192.168.1.3 as an example. + +1. Download the Prometheus installation package, which requires installation of V2.30.3 and above. You can go to the Prometheus official website to download it(https://prometheus.io/docs/introduction/first_steps/) +2. Unzip the installation package and enter the unzipped folder: + +```Shell +tar xvfz prometheus-*.tar.gz +cd prometheus-* +``` + +3. Modify the configuration. Modify the configuration file prometheus.yml as follows + 1. Add configNode task to collect monitoring data for ConfigNode + 2. Add a datanode task to collect monitoring data for DataNodes + +```YAML +global: + scrape_interval: 15s + evaluation_interval: 15s +scrape_configs: + - job_name: "prometheus" + static_configs: + - targets: ["localhost:9090"] + - job_name: "confignode" + static_configs: + - targets: ["iotdb-1:9091","iotdb-2:9091","iotdb-3:9091"] + honor_labels: true + - job_name: "datanode" + static_configs: + - targets: ["iotdb-1:9092","iotdb-2:9092","iotdb-3:9092"] + honor_labels: true +``` + +4. Start Prometheus. The default expiration time for Prometheus monitoring data is 15 days. In production environments, it is recommended to adjust it to 180 days or more to track historical monitoring data for a longer period of time. The startup command is as follows: + +```Shell +./prometheus --config.file=prometheus.yml --storage.tsdb.retention.time=180d +``` + +5. Confirm successful startup. Enter in browser http://192.168.1.3:9090 Go to Prometheus and click on the Target interface under Status. When you see that all States are Up, it indicates successful configuration and connectivity. + +
+ + +
+ +6. Clicking on the left link in Targets will redirect you to web monitoring and view the monitoring information of the corresponding node: + +![](https://alioss.timecho.com/docs/img/%E8%8A%82%E7%82%B9%E7%9B%91%E6%8E%A7.png) + +### Step 3: Install Grafana and configure the data source + +> Taking Grafana installed on server 192.168.1.3 as an example. + +1. Download the Grafana installation package, which requires installing version 8.4.2 or higher. You can go to the Grafana official website to download it(https://grafana.com/grafana/download) +2. Unzip and enter the corresponding folder + +```Shell +tar -zxvf grafana-*.tar.gz +cd grafana-* +``` + +3. Start Grafana: + +```Shell +./bin/grafana-server web +``` + +4. Log in to Grafana. Enter in browser http://192.168.1.3:3000 (or the modified port), enter Grafana, and the default initial username and password are both admin. + +5. Configure data sources. Find Data sources in Connections, add a new data source, and configure the Data Source to Prometheus + +![](https://alioss.timecho.com/docs/img/%E6%B7%BB%E5%8A%A0%E9%85%8D%E7%BD%AE.png) + +When configuring the Data Source, pay attention to the URL where Prometheus is located. After configuring it, click on Save&Test and a Data Source is working prompt will appear, indicating successful configuration + +![](https://alioss.timecho.com/docs/img/%E9%85%8D%E7%BD%AE%E6%88%90%E5%8A%9F.png) + +### Step 4: Import IoTDB Grafana Dashboards + +1. Enter Grafana and select Dashboards: + + ![](https://alioss.timecho.com/docs/img/%E9%9D%A2%E6%9D%BF%E9%80%89%E6%8B%A9.png) + +2. Click the Import button on the right side + + ![](https://alioss.timecho.com/docs/img/Import%E6%8C%89%E9%92%AE.png) + +3. Import Dashboard using upload JSON file + + ![](https://alioss.timecho.com/docs/img/%E5%AF%BC%E5%85%A5Dashboard.png) + +4. Select the JSON file of one of the panels in the IoTDB monitoring panel, using the Apache IoTDB ConfigNode Dashboard as an example (refer to the installation preparation section in this article for the monitoring panel installation package): + + ![](https://alioss.timecho.com/docs/img/%E9%80%89%E6%8B%A9%E9%9D%A2%E6%9D%BF.png) + +5. Select Prometheus as the data source and click Import + + ![](https://alioss.timecho.com/docs/img/%E9%80%89%E6%8B%A9%E6%95%B0%E6%8D%AE%E6%BA%90.png) + +6. Afterwards, you can see the imported Apache IoTDB ConfigNode Dashboard monitoring panel + + ![](https://alioss.timecho.com/docs/img/%E9%9D%A2%E6%9D%BF.png) + +7. Similarly, we can import the Apache IoTDB DataNode Dashboard Apache Performance Overview Dashboard、Apache System Overview Dashboard, You can see the following monitoring panel: + +
+ + + +
+ +8. At this point, all IoTDB monitoring panels have been imported and monitoring information can now be viewed at any time. + + ![](https://alioss.timecho.com/docs/img/%E9%9D%A2%E6%9D%BF%E6%B1%87%E6%80%BB.png) + +## Appendix, Detailed Explanation of Monitoring Indicators + +### System Dashboard + +This panel displays the current usage of system CPU, memory, disk, and network resources, as well as partial status of the JVM. + +#### CPU + +- CPU Core:CPU cores +- CPU Load: + - System CPU Load:The average CPU load and busyness of the entire system during the sampling time + - Process CPU Load:The proportion of CPU occupied by the IoTDB process during sampling time +- CPU Time Per Minute:The total CPU time of all processes in the system per minute + +#### Memory + +- System Memory:The current usage of system memory. + - Commited vm size: The size of virtual memory allocated by the operating system to running processes. + - Total physical memory:The total amount of available physical memory in the system. + - Used physical memory:The total amount of memory already used by the system. Contains the actual amount of memory used by the process and the memory occupied by the operating system buffers/cache. +- System Swap Memory:Swap Space memory usage. +- Process Memory:The usage of memory by the IoTDB process. + - Max Memory:The maximum amount of memory that an IoTDB process can request from the operating system. (Configure the allocated memory size in the datanode env/configure env configuration file) + - Total Memory:The total amount of memory that the IoTDB process has currently requested from the operating system. + - Used Memory:The total amount of memory currently used by the IoTDB process. + +#### Disk + +- Disk Space: + - Total disk space:The maximum disk space that IoTDB can use. + - Used disk space:The disk space already used by IoTDB. +- Log Number Per Minute:The average number of logs at each level of IoTDB per minute during the sampling time. +- File Count:Number of IoTDB related files + - all:All file quantities + - TsFile:Number of TsFiles + - seq:Number of sequential TsFiles + - unseq:Number of unsequence TsFiles + - wal:Number of WAL files + - cross-temp:Number of cross space merge temp files + - inner-seq-temp:Number of merged temp files in sequential space + - innser-unseq-temp:Number of merged temp files in unsequential space + - mods:Number of tombstone files +- Open File Count:Number of file handles opened by the system +- File Size:The size of IoTDB related files. Each sub item corresponds to the size of the corresponding file. +- Disk I/O Busy Rate:Equivalent to the% util indicator in iostat, it to some extent reflects the level of disk busyness. Each sub item is an indicator corresponding to the disk. +- Disk I/O Throughput:The average I/O throughput of each disk in the system over a period of time. Each sub item is an indicator corresponding to the disk. +- Disk I/O Ops:Equivalent to the four indicators of r/s, w/s, rrqm/s, and wrqm/s in iostat, it refers to the number of times a disk performs I/O per second. Read and write refer to the number of times a disk performs a single I/O. Due to the corresponding scheduling algorithm of block devices, in some cases, multiple adjacent I/Os can be merged into one. Merge read and merge write refer to the number of times multiple I/Os are merged into one I/O. +- Disk I/O Avg Time:Equivalent to the await of iostat, which is the average latency of each I/O request. Separate recording of read and write requests. +- Disk I/O Avg Size:Equivalent to the avgrq sz of iostat, it reflects the size of each I/O request. Separate recording of read and write requests. +- Disk I/O Avg Queue Size:Equivalent to avgqu sz in iostat, which is the average length of the I/O request queue. +- I/O System Call Rate:The frequency of process calls to read and write system calls, similar to IOPS. +- I/O Throughput:The throughput of process I/O can be divided into two categories: actual-read/write and attemppt-read/write. Actual read and actual write refer to the number of bytes that a process actually causes block devices to perform I/O, excluding the parts processed by Page Cache. + +#### JVM + +- GC Time Percentage:The proportion of GC time spent by the node JVM in the past minute's time window +- GC Allocated/Promoted Size Detail: The average size of objects promoted to the old era per minute by the node JVM, as well as the size of objects newly applied for by the new generation/old era and non generational new applications +- GC Data Size Detail:The long-term surviving object size of the node JVM and the maximum intergenerational allowed value +- Heap Memory:JVM heap memory usage. + - Maximum heap memory:The maximum available heap memory size for the JVM. + - Committed heap memory:The size of heap memory that has been committed by the JVM. + - Used heap memory:The size of heap memory already used by the JVM. + - PS Eden Space:The size of the PS Young area. + - PS Old Space:The size of the PS Old area. + - PS Survivor Space:The size of the PS survivor area. + - ...(CMS/G1/ZGC, etc) +- Off Heap Memory:Out of heap memory usage. + - direct memory:Out of heap direct memory. + - mapped memory:Out of heap mapped memory. +- GC Number Per Minute:The average number of garbage collection attempts per minute by the node JVM, including YGC and FGC +- GC Time Per Minute:The average time it takes for node JVM to perform garbage collection per minute, including YGC and FGC +- GC Number Per Minute Detail:The average number of garbage collections per minute by node JVM due to different reasons, including YGC and FGC +- GC Time Per Minute Detail:The average time spent by node JVM on garbage collection per minute due to different reasons, including YGC and FGC +- Time Consumed Of Compilation Per Minute:The total time JVM spends compiling per minute +- The Number of Class: + - loaded:The number of classes currently loaded by the JVM + - unloaded:The number of classes uninstalled by the JVM since system startup +- The Number of Java Thread:The current number of surviving threads in IoTDB. Each sub item represents the number of threads in each state. + +#### Network + +Eno refers to the network card connected to the public network, while lo refers to the virtual network card. + +- Net Speed:The speed of network card sending and receiving data +- Receive/Transmit Data Size:The size of data packets sent or received by the network card, calculated from system restart +- Packet Speed:The speed at which the network card sends and receives packets, and one RPC request can correspond to one or more packets +- Connection Num:The current number of socket connections for the selected process (IoTDB only has TCP) + +### Performance Overview Dashboard + +#### Cluster Overview + +- Total CPU Core:Total CPU cores of cluster machines +- DataNode CPU Load:CPU usage of each DataNode node in the cluster +- Disk + - Total Disk Space: Total disk size of cluster machines + - DataNode Disk Usage: The disk usage rate of each DataNode in the cluster +- Total Timeseries: The total number of time series managed by the cluster (including replicas), the actual number of time series needs to be calculated in conjunction with the number of metadata replicas +- Cluster: Number of ConfigNode and DataNode nodes in the cluster +- Up Time: The duration of cluster startup until now +- Total Write Point Per Second: The total number of writes per second in the cluster (including replicas), and the actual total number of writes needs to be analyzed in conjunction with the number of data replicas +- Memory + - Total System Memory: Total memory size of cluster machine system + - Total Swap Memory: Total size of cluster machine swap memory + - DataNode Process Memory Usage: Memory usage of each DataNode in the cluster +- Total File Number:Total number of cluster management files +- Cluster System Overview:Overview of cluster machines, including average DataNode node memory usage and average machine disk usage +- Total DataBase: The total number of databases managed by the cluster (including replicas) +- Total DataRegion: The total number of DataRegions managed by the cluster +- Total SchemaRegion: The total number of SchemeRegions managed by the cluster + +#### Node Overview + +- CPU Core: The number of CPU cores in the machine where the node is located +- Disk Space: The disk size of the machine where the node is located +- Timeseries: Number of time series managed by the machine where the node is located (including replicas) +- System Overview: System overview of the machine where the node is located, including CPU load, process memory usage ratio, and disk usage ratio +- Write Point Per Second: The write speed per second of the machine where the node is located (including replicas) +- System Memory: The system memory size of the machine where the node is located +- Swap Memory:The swap memory size of the machine where the node is located +- File Number: Number of files managed by nodes + +#### Performance + +- Session Idle Time:The total idle time and total busy time of the session connection of the node +- Client Connection: The client connection status of the node, including the total number of connections and the number of active connections +- Time Consumed Of Operation: The time consumption of various types of node operations, including average and P99 +- Average Time Consumed Of Interface: The average time consumption of each thrust interface of a node +- P99 Time Consumed Of Interface: P99 time consumption of various thrust interfaces of nodes +- Task Number: The number of system tasks for each node +- Average Time Consumed of Task: The average time spent on various system tasks of a node +- P99 Time Consumed of Task: P99 time consumption for various system tasks of nodes +- Operation Per Second: The number of operations per second for a node +- Mainstream Process + - Operation Per Second Of Stage: The number of operations per second for each stage of the node's main process + - Average Time Consumed Of Stage: The average time consumption of each stage in the main process of a node + - P99 Time Consumed Of Stage: P99 time consumption for each stage of the node's main process +- Schedule Stage + - OPS Of Schedule: The number of operations per second in each sub stage of the node schedule stage + - Average Time Consumed Of Schedule Stage:The average time consumption of each sub stage in the node schedule stage + - P99 Time Consumed Of Schedule Stage: P99 time consumption for each sub stage of the schedule stage of the node +- Local Schedule Sub Stages + - OPS Of Local Schedule Stage: The number of operations per second in each sub stage of the local schedule node + - Average Time Consumed Of Local Schedule Stage: The average time consumption of each sub stage in the local schedule stage of the node + - P99 Time Consumed Of Local Schedule Stage: P99 time consumption for each sub stage of the local schedule stage of the node +- Storage Stage + - OPS Of Storage Stage: The number of operations per second in each sub stage of the node storage stage + - Average Time Consumed Of Storage Stage: Average time consumption of each sub stage in the node storage stage + - P99 Time Consumed Of Storage Stage: P99 time consumption for each sub stage of node storage stage +- Engine Stage + - OPS Of Engine Stage: The number of operations per second in each sub stage of the node engine stage + - Average Time Consumed Of Engine Stage: The average time consumption of each sub stage in the engine stage of a node + - P99 Time Consumed Of Engine Stage: P99 time consumption of each sub stage in the node engine stage + +#### System + +- CPU Load: CPU load of nodes +- CPU Time Per Minute: The CPU time per minute of a node, with the maximum value related to the number of CPU cores +- GC Time Per Minute:The average GC time per minute for nodes, including YGC and FGC +- Heap Memory: Node's heap memory usage +- Off Heap Memory: Non heap memory usage of nodes +- The Number Of Java Thread: Number of Java threads on nodes +- File Count:Number of files managed by nodes +- File Size: Node management file size situation +- Log Number Per Minute: Different types of logs per minute for nodes + +### ConfigNode Dashboard + +This panel displays the performance of all management nodes in the cluster, including partitioning, node information, and client connection statistics. + +#### Node Overview + +- Database Count: Number of databases for nodes +- Region + - DataRegion Count:Number of DataRegions for nodes + - DataRegion Current Status: The state of the DataRegion of the node + - SchemaRegion Count: Number of SchemeRegions for nodes + - SchemaRegion Current Status: The state of the SchemeRegion of the node +- System Memory: The system memory size of the node +- Swap Memory: Node's swap memory size +- ConfigNodes: The running status of the ConfigNode in the cluster where the node is located +- DataNodes:The DataNode situation of the cluster where the node is located +- System Overview: System overview of nodes, including system memory, disk usage, process memory, and CPU load + +#### NodeInfo + +- Node Count: The number of nodes in the cluster where the node is located, including ConfigNode and DataNode +- ConfigNode Status: The status of the ConfigNode node in the cluster where the node is located +- DataNode Status: The status of the DataNode node in the cluster where the node is located +- SchemaRegion Distribution: The distribution of SchemaRegions in the cluster where the node is located +- SchemaRegionGroup Leader Distribution: The distribution of leaders in the SchemaRegionGroup of the cluster where the node is located +- DataRegion Distribution: The distribution of DataRegions in the cluster where the node is located +- DataRegionGroup Leader Distribution:The distribution of leaders in the DataRegionGroup of the cluster where the node is located + +#### Protocol + +- Client Count + - Active Client Num: The number of active clients in each thread pool of a node + - Idle Client Num: The number of idle clients in each thread pool of a node + - Borrowed Client Count: Number of borrowed clients in each thread pool of the node + - Created Client Count: Number of created clients for each thread pool of the node + - Destroyed Client Count: The number of destroyed clients in each thread pool of the node +- Client time situation + - Client Mean Active Time: The average active time of clients in each thread pool of a node + - Client Mean Borrow Wait Time: The average borrowing waiting time of clients in each thread pool of a node + - Client Mean Idle Time: The average idle time of clients in each thread pool of a node + +#### Partition Table + +- SchemaRegionGroup Count: The number of SchemaRegionGroups in the Database of the cluster where the node is located +- DataRegionGroup Count: The number of DataRegionGroups in the Database of the cluster where the node is located +- SeriesSlot Count: The number of SeriesSlots in the Database of the cluster where the node is located +- TimeSlot Count: The number of TimeSlots in the Database of the cluster where the node is located +- DataRegion Status: The DataRegion status of the cluster where the node is located +- SchemaRegion Status: The status of the SchemeRegion of the cluster where the node is located + +#### Consensus + +- Ratis Stage Time: The time consumption of each stage of the node's Ratis +- Write Log Entry: The time required to write a log for the Ratis of a node +- Remote / Local Write Time: The time consumption of remote and local writes for the Ratis of nodes +- Remote / Local Write QPS: Remote and local QPS written to node Ratis +- RatisConsensus Memory: Memory usage of Node Ratis consensus protocol + +### DataNode Dashboard + +This panel displays the monitoring status of all data nodes in the cluster, including write time, query time, number of stored files, etc. + +#### Node Overview + +- The Number Of Entity: Entity situation of node management +- Write Point Per Second: The write speed per second of the node +- Memory Usage: The memory usage of the node, including the memory usage of various parts of IoT Consensus, the total memory usage of SchemaRegion, and the memory usage of various databases. + +#### Protocol + +- Node Operation Time Consumption + - The Time Consumed Of Operation (avg): The average time spent on various operations of a node + - The Time Consumed Of Operation (50%): The median time spent on various operations of a node + - The Time Consumed Of Operation (99%): P99 time consumption for various operations of nodes +- Thrift Statistics + - The QPS Of Interface: QPS of various Thrift interfaces of nodes + - The Avg Time Consumed Of Interface: The average time consumption of each Thrift interface of a node + - Thrift Connection: The number of Thrfit connections of each type of node + - Thrift Active Thread: The number of active Thrift connections for each type of node +- Client Statistics + - Active Client Num: The number of active clients in each thread pool of a node + - Idle Client Num: The number of idle clients in each thread pool of a node + - Borrowed Client Count:Number of borrowed clients for each thread pool of a node + - Created Client Count: Number of created clients for each thread pool of the node + - Destroyed Client Count: The number of destroyed clients in each thread pool of the node + - Client Mean Active Time: The average active time of clients in each thread pool of a node + - Client Mean Borrow Wait Time: The average borrowing waiting time of clients in each thread pool of a node + - Client Mean Idle Time: The average idle time of clients in each thread pool of a node + +#### Storage Engine + +- File Count: Number of files of various types managed by nodes +- File Size: Node management of various types of file sizes +- TsFile + - TsFile Total Size In Each Level: The total size of TsFile files at each level of node management + - TsFile Count In Each Level: Number of TsFile files at each level of node management + - Avg TsFile Size In Each Level: The average size of TsFile files at each level of node management +- Task Number: Number of Tasks for Nodes +- The Time Consumed of Task: The time consumption of tasks for nodes +- Compaction + - Compaction Read And Write Per Second: The merge read and write speed of nodes per second + - Compaction Number Per Minute: The number of merged nodes per minute + - Compaction Process Chunk Status: The number of Chunks in different states merged by nodes + - Compacted Point Num Per Minute: The number of merged nodes per minute + +#### Write Performance + +- Write Cost(avg): Average node write time, including writing wal and memtable +- Write Cost(50%): Median node write time, including writing wal and memtable +- Write Cost(99%): P99 for node write time, including writing wal and memtable +- WAL + - WAL File Size: Total size of WAL files managed by nodes + - WAL File Num:Number of WAL files managed by nodes + - WAL Nodes Num: Number of WAL nodes managed by nodes + - Make Checkpoint Costs: The time required to create various types of CheckPoints for nodes + - WAL Serialize Total Cost: Total time spent on node WAL serialization + - Data Region Mem Cost: Memory usage of different DataRegions of nodes, total memory usage of DataRegions of the current instance, and total memory usage of DataRegions of the current cluster + - Serialize One WAL Info Entry Cost: Node serialization time for a WAL Info Entry + - Oldest MemTable Ram Cost When Cause Snapshot: MemTable size when node WAL triggers oldest MemTable snapshot + - Oldest MemTable Ram Cost When Cause Flush: MemTable size when node WAL triggers oldest MemTable flush + - Effective Info Ratio Of WALNode: The effective information ratio of different WALNodes of nodes + - WAL Buffer + - WAL Buffer Cost: Node WAL flush SyncBuffer takes time, including both synchronous and asynchronous options + - WAL Buffer Used Ratio: The usage rate of the WAL Buffer of the node + - WAL Buffer Entries Count: The number of entries in the WAL Buffer of a node +- Flush Statistics + - Flush MemTable Cost(avg): The total time spent on node Flush and the average time spent on each sub stage + - Flush MemTable Cost(50%): The total time spent on node Flush and the median time spent on each sub stage + - Flush MemTable Cost(99%): The total time spent on node Flush and the P99 time spent on each sub stage + - Flush Sub Task Cost(avg): The average time consumption of each node's Flush subtask, including sorting, encoding, and IO stages + - Flush Sub Task Cost(50%): The median time consumption of each subtask of the Flush node, including sorting, encoding, and IO stages + - Flush Sub Task Cost(99%): The average subtask time P99 for Flush of nodes, including sorting, encoding, and IO stages +- Pending Flush Task Num: The number of Flush tasks in a blocked state for a node +- Pending Flush Sub Task Num: Number of Flush subtasks blocked by nodes +- Tsfile Compression Ratio Of Flushing MemTable: The compression rate of TsFile corresponding to node flashing Memtable +- Flush TsFile Size Of DataRegions: The corresponding TsFile size for each disk flush of nodes in different DataRegions +- Size Of Flushing MemTable: The size of the Memtable for node disk flushing +- Points Num Of Flushing MemTable: The number of points when flashing data in different DataRegions of a node +- Series Num Of Flushing MemTable: The number of time series when flashing Memtables in different DataRegions of a node +- Average Point Num Of Flushing MemChunk: The average number of disk flushing points for node MemChunk + +#### Schema Engine + +- Schema Engine Mode: The metadata engine pattern of nodes +- Schema Consensus Protocol: Node metadata consensus protocol +- Schema Region Number:Number of SchemeRegions managed by nodes +- Schema Region Memory Overview: The amount of memory in the SchemeRegion of a node +- Memory Usgae per SchemaRegion:The average memory usage size of node SchemaRegion +- Cache MNode per SchemaRegion: The number of cache nodes in each SchemeRegion of a node +- MLog Length and Checkpoint: The total length and checkpoint position of the current mlog for each SchemeRegion of the node (valid only for SimpleConsense) +- Buffer MNode per SchemaRegion: The number of buffer nodes in each SchemeRegion of a node +- Activated Template Count per SchemaRegion: The number of activated templates in each SchemeRegion of a node +- Time Series statistics + - Timeseries Count per SchemaRegion: The average number of time series for node SchemaRegion + - Series Type: Number of time series of different types of nodes + - Time Series Number: The total number of time series nodes + - Template Series Number: The total number of template time series for nodes + - Template Series Count per SchemaRegion: The number of sequences created through templates in each SchemeRegion of a node +- IMNode Statistics + - Pinned MNode per SchemaRegion: Number of IMNode nodes with Pinned nodes in each SchemeRegion + - Pinned Memory per SchemaRegion: The memory usage size of the IMNode node for Pinned nodes in each SchemeRegion of the node + - Unpinned MNode per SchemaRegion: The number of unpinned IMNode nodes in each SchemeRegion of a node + - Unpinned Memory per SchemaRegion: Memory usage size of unpinned IMNode nodes in each SchemeRegion of the node + - Schema File Memory MNode Number: Number of IMNode nodes with global pinned and unpinned nodes + - Release and Flush MNode Rate: The number of IMNodes that release and flush nodes per second +- Cache Hit Rate: Cache hit rate of nodes +- Release and Flush Thread Number: The current number of active Release and Flush threads on the node +- Time Consumed of Relead and Flush (avg): The average time taken for node triggered cache release and buffer flushing +- Time Consumed of Relead and Flush (99%): P99 time consumption for node triggered cache release and buffer flushing + +#### Query Engine + +- Time Consumption In Each Stage + - The time consumed of query plan stages(avg): The average time spent on node queries at each stage + - The time consumed of query plan stages(50%): Median time spent on node queries at each stage + - The time consumed of query plan stages(99%): P99 time consumption for node query at each stage +- Execution Plan Distribution Time + - The time consumed of plan dispatch stages(avg): The average time spent on node query execution plan distribution + - The time consumed of plan dispatch stages(50%): Median time spent on node query execution plan distribution + - The time consumed of plan dispatch stages(99%): P99 of node query execution plan distribution time +- Execution Plan Execution Time + - The time consumed of query execution stages(avg): The average execution time of node query execution plan + - The time consumed of query execution stages(50%):Median execution time of node query execution plan + - The time consumed of query execution stages(99%): P99 of node query execution plan execution time +- Operator Execution Time + - The time consumed of operator execution stages(avg): The average execution time of node query operators + - The time consumed of operator execution(50%): Median execution time of node query operator + - The time consumed of operator execution(99%): P99 of node query operator execution time +- Aggregation Query Computation Time + - The time consumed of query aggregation(avg): The average computation time for node aggregation queries + - The time consumed of query aggregation(50%): Median computation time for node aggregation queries + - The time consumed of query aggregation(99%): P99 of node aggregation query computation time +- File/Memory Interface Time Consumption + - The time consumed of query scan(avg): The average time spent querying file/memory interfaces for nodes + - The time consumed of query scan(50%): Median time spent querying file/memory interfaces for nodes + - The time consumed of query scan(99%): P99 time consumption for node query file/memory interface +- Number Of Resource Visits + - The usage of query resource(avg): The average number of resource visits for node queries + - The usage of query resource(50%): Median number of resource visits for node queries + - The usage of query resource(99%): P99 for node query resource access quantity +- Data Transmission Time + - The time consumed of query data exchange(avg): The average time spent on node query data transmission + - The time consumed of query data exchange(50%): Median query data transmission time for nodes + - The time consumed of query data exchange(99%): P99 for node query data transmission time +- Number Of Data Transfers + - The count of Data Exchange(avg): The average number of data transfers queried by nodes + - The count of Data Exchange: The quantile of the number of data transfers queried by nodes, including the median and P99 +- Task Scheduling Quantity And Time Consumption + - The number of query queue: Node query task scheduling quantity + - The time consumed of query schedule time(avg): The average time spent on scheduling node query tasks + - The time consumed of query schedule time(50%): Median time spent on node query task scheduling + - The time consumed of query schedule time(99%): P99 of node query task scheduling time + +#### Query Interface + +- Load Time Series Metadata + - The time consumed of load timeseries metadata(avg): The average time taken for node queries to load time series metadata + - The time consumed of load timeseries metadata(50%): Median time spent on loading time series metadata for node queries + - The time consumed of load timeseries metadata(99%): P99 time consumption for node query loading time series metadata +- Read Time Series + - The time consumed of read timeseries metadata(avg): The average time taken for node queries to read time series + - The time consumed of read timeseries metadata(50%): The median time taken for node queries to read time series + - The time consumed of read timeseries metadata(99%): P99 time consumption for node query reading time series +- Modify Time Series Metadata + - The time consumed of timeseries metadata modification(avg):The average time taken for node queries to modify time series metadata + - The time consumed of timeseries metadata modification(50%): Median time spent on querying and modifying time series metadata for nodes + - The time consumed of timeseries metadata modification(99%): P99 time consumption for node query and modification of time series metadata +- Load Chunk Metadata List + - The time consumed of load chunk metadata list(avg): The average time it takes for node queries to load Chunk metadata lists + - The time consumed of load chunk metadata list(50%): Median time spent on node query loading Chunk metadata list + - The time consumed of load chunk metadata list(99%): P99 time consumption for node query loading Chunk metadata list +- Modify Chunk Metadata + - The time consumed of chunk metadata modification(avg): The average time it takes for node queries to modify Chunk metadata + - The time consumed of chunk metadata modification(50%): The total number of bits spent on modifying Chunk metadata for node queries + - The time consumed of chunk metadata modification(99%): P99 time consumption for node query and modification of Chunk metadata +- Filter According To Chunk Metadata + - The time consumed of chunk metadata filter(avg): The average time spent on node queries filtering by Chunk metadata + - The time consumed of chunk metadata filter(50%): Median filtering time for node queries based on Chunk metadata + - The time consumed of chunk metadata filter(99%): P99 time consumption for node query filtering based on Chunk metadata +- Constructing Chunk Reader + - The time consumed of construct chunk reader(avg): The average time spent on constructing Chunk Reader for node queries + - The time consumed of construct chunk reader(50%): Median time spent on constructing Chunk Reader for node queries + - The time consumed of construct chunk reader(99%): P99 time consumption for constructing Chunk Reader for node queries +- Read Chunk + - The time consumed of read chunk(avg): The average time taken for node queries to read Chunks + - The time consumed of read chunk(50%): Median time spent querying nodes to read Chunks + - The time consumed of read chunk(99%): P99 time spent on querying and reading Chunks for nodes +- Initialize Chunk Reader + - The time consumed of init chunk reader(avg): The average time spent initializing Chunk Reader for node queries + - The time consumed of init chunk reader(50%): Median time spent initializing Chunk Reader for node queries + - The time consumed of init chunk reader(99%):P99 time spent initializing Chunk Reader for node queries +- Constructing TsBlock Through Page Reader + - The time consumed of build tsblock from page reader(avg): The average time it takes for node queries to construct TsBlock through Page Reader + - The time consumed of build tsblock from page reader(50%): The median time spent on constructing TsBlock through Page Reader for node queries + - The time consumed of build tsblock from page reader(99%):Node query using Page Reader to construct TsBlock time-consuming P99 +- Query the construction of TsBlock through Merge Reader + - The time consumed of build tsblock from merge reader(avg): The average time taken for node queries to construct TsBlock through Merge Reader + - The time consumed of build tsblock from merge reader(50%): The median time spent on constructing TsBlock through Merge Reader for node queries + - The time consumed of build tsblock from merge reader(99%): Node query using Merge Reader to construct TsBlock time-consuming P99 + +#### Query Data Exchange + +The data exchange for the query is time-consuming. + +- Obtain TsBlock through source handle + - The time consumed of source handle get tsblock(avg): The average time taken for node queries to obtain TsBlock through source handle + - The time consumed of source handle get tsblock(50%):Node query obtains the median time spent on TsBlock through source handle + - The time consumed of source handle get tsblock(99%): Node query obtains TsBlock time P99 through source handle +- Deserialize TsBlock through source handle + - The time consumed of source handle deserialize tsblock(avg): The average time taken for node queries to deserialize TsBlock through source handle + - The time consumed of source handle deserialize tsblock(50%): The median time taken for node queries to deserialize TsBlock through source handle + - The time consumed of source handle deserialize tsblock(99%): P99 time spent on deserializing TsBlock through source handle for node query +- Send TsBlock through sink handle + - The time consumed of sink handle send tsblock(avg): The average time taken for node queries to send TsBlock through sink handle + - The time consumed of sink handle send tsblock(50%): Node query median time spent sending TsBlock through sink handle + - The time consumed of sink handle send tsblock(99%): Node query sends TsBlock through sink handle with a time consumption of P99 +- Callback data block event + - The time consumed of on acknowledge data block event task(avg): The average time taken for node query callback data block event + - The time consumed of on acknowledge data block event task(50%): Median time spent on node query callback data block event + - The time consumed of on acknowledge data block event task(99%): P99 time consumption for node query callback data block event +- Get Data Block Tasks + - The time consumed of get data block task(avg): The average time taken for node queries to obtain data block tasks + - The time consumed of get data block task(50%): The median time taken for node queries to obtain data block tasks + - The time consumed of get data block task(99%): P99 time consumption for node query to obtain data block task + +#### Query Related Resource + +- MppDataExchangeManager:The number of shuffle sink handles and source handles during node queries +- LocalExecutionPlanner: The remaining memory that nodes can allocate to query shards +- FragmentInstanceManager: The query sharding context information and the number of query shards that the node is running +- Coordinator: The number of queries recorded on the node +- MemoryPool Size: Node query related memory pool situation +- MemoryPool Capacity: The size of memory pools related to node queries, including maximum and remaining available values +- DriverScheduler: Number of queue tasks related to node queries + +#### Consensus - IoT Consensus + +- Memory Usage + - IoTConsensus Used Memory: The memory usage of IoT Consumes for nodes, including total memory usage, queue usage, and synchronization usage +- Synchronization Status Between Nodes + - IoTConsensus Sync Index: SyncIndex size for different DataRegions of IoT Consumption nodes + - IoTConsensus Overview:The total synchronization gap and cached request count of IoT consumption for nodes + - IoTConsensus Search Index Rate: The growth rate of writing SearchIndex for different DataRegions of IoT Consumer nodes + - IoTConsensus Safe Index Rate: The growth rate of synchronous SafeIndex for different DataRegions of IoT Consumer nodes + - IoTConsensus LogDispatcher Request Size: The request size for node IoT Consusus to synchronize different DataRegions to other nodes + - Sync Lag: The size of synchronization gap between different DataRegions in IoT Consumption node + - Min Peer Sync Lag: The minimum synchronization gap between different DataRegions and different replicas of node IoT Consumption + - Sync Speed Diff Of Peers: The maximum difference in synchronization from different DataRegions to different replicas for node IoT Consumption + - IoTConsensus LogEntriesFromWAL Rate: The rate at which nodes IoT Consumus obtain logs from WAL for different DataRegions + - IoTConsensus LogEntriesFromQueue Rate: The rate at which nodes IoT Consumes different DataRegions retrieve logs from the queue +- Different Execution Stages Take Time + - The Time Consumed Of Different Stages (avg): The average time spent on different execution stages of node IoT Consumus + - The Time Consumed Of Different Stages (50%): The median time spent on different execution stages of node IoT Consusus + - The Time Consumed Of Different Stages (99%):P99 of the time consumption for different execution stages of node IoT Consusus + +#### Consensus - DataRegion Ratis Consensus + +- Ratis Stage Time: The time consumption of different stages of node Ratis +- Write Log Entry: The time consumption of writing logs at different stages of node Ratis +- Remote / Local Write Time: The time it takes for node Ratis to write locally or remotely +- Remote / Local Write QPS: QPS written by node Ratis locally or remotely +- RatisConsensus Memory:Memory usage of node Ratis + +#### Consensus - SchemaRegion Ratis Consensus + +- Ratis Stage Time: The time consumption of different stages of node Ratis +- Write Log Entry: The time consumption for writing logs at each stage of node Ratis +- Remote / Local Write Time: The time it takes for node Ratis to write locally or remotelyThe time it takes for node Ratis to write locally or remotely +- Remote / Local Write QPS: QPS written by node Ratis locally or remotely +- RatisConsensus Memory: Node Ratis Memory Usage \ No newline at end of file diff --git a/src/UserGuide/Master/Table/Deployment-and-Maintenance/Stand-Alone-Deployment_timecho.md b/src/UserGuide/Master/Table/Deployment-and-Maintenance/Stand-Alone-Deployment_timecho.md new file mode 100644 index 00000000..571f6724 --- /dev/null +++ b/src/UserGuide/Master/Table/Deployment-and-Maintenance/Stand-Alone-Deployment_timecho.md @@ -0,0 +1,244 @@ + +# Stand-Alone Deployment + +This chapter will introduce how to start an IoTDB standalone instance, which includes 1 ConfigNode and 1 DataNode (commonly known as 1C1D). + +## Note + +1. Before installation, ensure that the system is complete by referring to [System Requirements](./Environment-Requirements.md). + + 2. It is recommended to prioritize using 'hostname' for IP configuration during deployment, which can avoid the problem of modifying the host IP in the later stage and causing the database to fail to start. To set the host name, you need to configure/etc/hosts on the target server. For example, if the local IP is 192.168.1.3 and the host name is iotdb-1, you can use the following command to set the server's host name and configure IoTDB's' cn_internal-address' using the host name dn_internal_address、dn_rpc_address。 + + ```shell + echo "192.168.1.3 iotdb-1" >> /etc/hosts + ``` + + 3. Some parameters cannot be modified after the first startup. Please refer to the "Parameter Configuration" section below for settings. + + 4. Whether in linux or windows, ensure that the IoTDB installation path does not contain Spaces and Chinese characters to avoid software exceptions. + + 5. Please note that when installing and deploying IoTDB (including activating and using software), it is necessary to use the same user for operations. You can: + + - Using root user (recommended): Using root user can avoid issues such as permissions. + - Using a fixed non root user: + - Using the same user operation: Ensure that the same user is used for start, activation, stop, and other operations, and do not switch users. + - Avoid using sudo: Try to avoid using sudo commands as they execute commands with root privileges, which may cause confusion or security issues. + + 6. It is recommended to deploy a monitoring panel, which can monitor important operational indicators and keep track of database operation status at any time. The monitoring panel can be obtained by contacting the business department, and the steps for deploying the monitoring panel can be referred to:[Monitoring Board Install and Deploy](./Monitoring-panel-deployment.md). + +## Installation Steps + +### 1、Unzip the installation package and enter the installation directory + +```Plain +unzip timechodb-{version}-bin.zip +cd timechodb-{version}-bin +``` + +### 2、Parameter Configuration + +#### Memory Configuration + +- conf/confignode-env.sh(or .bat) + + | **Configuration** | **Description** | **Default** | **Recommended value** | Note | + | :---------------: | :----------------------------------------------------------: | :---------: | :----------------------------------------------------------: | :---------------------------------: | + | MEMORY_SIZE | The total amount of memory that IoTDB ConfigNode nodes can use | empty | Can be filled in as needed, and the system will allocate memory based on the filled in values | Restarting the service takes effect | + +- conf/datanode-env.sh(or .bat) + + | **Configuration** | **Description** | **Default** | **Recommended value** | **Note** | + | :---------------: | :----------------------------------------------------------: | :---------: | :----------------------------------------------------------: | :---------------------------------: | + | MEMORY_SIZE | The total amount of memory that IoTDB DataNode nodes can use | empty | Can be filled in as needed, and the system will allocate memory based on the filled in values | Restarting the service takes effect | + +#### Function Configuration + +The parameters that actually take effect in the system are in the file conf/iotdb-system.exe. To start, the following parameters need to be set, which can be viewed in the conf/iotdb-system.exe file for all parameters + +Cluster function configuration + +| **Configuration** | **Description** | **Default** | **Recommended value** | Note | +| :-----------------------: | :----------------------------------------------------------: | :------------: | :----------------------------------------------------------: | :---------------------------------------------------: | +| cluster_name | Cluster Name | defaultCluster | The cluster name can be set as needed, and if there are no special needs, the default can be kept | Cannot be modified after initial startup | +| schema_replication_factor | Number of metadata replicas, set to 1 for the standalone version here | 1 | 1 | Default 1, cannot be modified after the first startup | +| data_replication_factor | Number of data replicas, set to 1 for the standalone version here | 1 | 1 | Default 1, cannot be modified after the first startup | + +ConfigNode Configuration + +| **Configuration** | **Description** | **Default** | **Recommended value** | Note | +| :-----------------: | :----------------------------------------------------------: | :-------------: | :----------------------------------------------------------: | :--------------------------------------: | +| cn_internal_address | The address used by ConfigNode for communication within the cluster | 127.0.0.1 | The IPV4 address or host name of the server where it is located, and it is recommended to use host name | Cannot be modified after initial startup | +| cn_internal_port | The port used by ConfigNode for communication within the cluster | 10710 | 10710 | Cannot be modified after initial startup | +| cn_consensus_port | The port used for ConfigNode replica group consensus protocol communication | 10720 | 10720 | Cannot be modified after initial startup | +| cn_seed_config_node | The address of the ConfigNode that the node connects to when registering to join the cluster, cn_internal_address:cn_internal_port | 127.0.0.1:10710 | cn_internal_address:cn_internal_port | Cannot be modified after initial startup | + +DataNode Configuration + +| **Configuration** | **Description** | **Default** | **Recommended value** | **Note** | +| :------------------------------ | :----------------------------------------------------------- | :-------------- | :----------------------------------------------------------- | :--------------------------------------- | +| dn_rpc_address | The address of the client RPC service | 0.0.0.0 | The IPV4 address or host name of the server where it is located, and it is recommended to use host name | Restarting the service takes effect | +| dn_rpc_port | The port of the client RPC service | 6667 | 6667 | Restarting the service takes effect | +| dn_internal_address | The address used by DataNode for communication within the cluster | 127.0.0.1 | The IPV4 address or host name of the server where it is located, and it is recommended to use host name | Cannot be modified after initial startup | +| dn_internal_port | The port used by DataNode for communication within the cluster | 10730 | 10730 | Cannot be modified after initial startup | +| dn_mpp_data_exchange_port | The port used by DataNode to receive data streams | 10740 | 10740 | Cannot be modified after initial startup | +| dn_data_region_consensus_port | The port used by DataNode for data replica consensus protocol communication | 10750 | 10750 | Cannot be modified after initial startup | +| dn_schema_region_consensus_port | The port used by DataNode for metadata replica consensus protocol communication | 10760 | 10760 | Cannot be modified after initial startup | +| dn_seed_config_node | The ConfigNode address that the node connects to when registering to join the cluster, i.e. cn_internal-address: cn_internal_port | 127.0.0.1:10710 | cn_internal_address:cn_internal_port | Cannot be modified after initial startup | + +### 3、Start ConfigNode + +Enter the sbin directory of iotdb and start confignode + +```shell + +./start-confignode.sh -d #The "- d" parameter will start in the background + +``` + +If the startup fails, please refer to [Common Problem](#common-problem). + +### 4、Start DataNode + + Enter the sbin directory of iotdb and start datanode: + +```shell + +cd sbin + +./start-datanode.sh -d # The "- d" parameter will start in the background + +``` + +### 5、Activate Database + +#### Method 1: Activate file copy activation + +- After starting the confignode datanode node, enter the activation folder and copy the systeminfo file to the Timecho staff + +- Received the license file returned by the staff + +- Place the license file in the activation folder of the corresponding node; + +#### Method 2: Activate Script Activation + +- Obtain the required machine code for activation, enter the IoTDB CLI (./start-cli.sh-sql-dialect table/start-cli.bat - sql-dialect table), and perform the following: + + - Note: When sql-dialect is a table, it is temporarily not supported to use + +```shell +show system info +``` + +- Display the following information, please copy the machine code (i.e. green string) to the Timecho staff: + +```sql ++--------------------------------------------------------------+ +| SystemInfo| ++--------------------------------------------------------------+ +| 01-TE5NLES4-UDDWCMYE| ++--------------------------------------------------------------+ +Total line number = 1 +It costs 0.030s +``` + +- Enter the activation code returned by the staff into the CLI and enter the following content + + - Note: The activation code needs to be marked with a `'`symbol before and after, as shown in + +```sql +IoTDB> activate '01-D4EYQGPZ-EAUJJODW-NUKRDR6F-TUQS3B75-EDZFLK3A-6BOKJFFZ-ALDHOMN7-NB2E4BHI-7ZKGFVK6-GCIFXA4T-UG3XJTTD-SHJV6F2P-Q27B4OMJ-R47ZDIM3-UUASUXG2-OQXGVZCO-MMYKICZU-TWFQYYAO-ZOAGOKJA-NYHQTA5U-EWAR4EP5-MRC6R2CI-PKUTKRCT-7UDGRH3F-7BYV4P5D-6KKIA===' +``` + +### 6、Verify Activation + +When the "ClusterActivation Status" field is displayed as Activated, it indicates successful activation + +![](https://alioss.timecho.com/docs/img/%E5%8D%95%E6%9C%BA-%E9%AA%8C%E8%AF%81.png) + +## Common Problem + +1. Multiple prompts indicating activation failure during deployment process + +​ - Use the `ls -al` command: Use the `ls -al` command to check if the owner information of the installation package root directory is the current user. + +​ - Check activation directory: Check all files in the `./activation` directory and whether the owner information is the current user. + +2. Confignode failed to start + + Step 1: Please check the startup log to see if any parameters that cannot be changed after the first startup have been modified. + + Step 2: Please check the startup log for any other abnormalities. If there are any abnormal phenomena in the log, please contact Timecho Technical Support personnel for consultation on solutions. + + Step 3: If it is the first deployment or data can be deleted, you can also clean up the environment according to the following steps, redeploy, and restart. + + Step 4: Clean up the environment: + + a. Terminate all ConfigNode Node and DataNode processes. + + ```Bash + # 1. Stop the ConfigNode and DataNode services + sbin/stop-standalone.sh + + # 2. Check for any remaining processes + jps + # Or + ps -ef|gerp iotdb + + # 3. If there are any remaining processes, manually kill the + kill -9 + # If you are sure there is only one iotdb on the machine, you can use the following command to clean up residual processes + ps -ef|grep iotdb|grep -v grep|tr -s ' ' ' ' |cut -d ' ' -f2|xargs kill -9 + ``` + + b. Delete the data and logs directories. + + Explanation: Deleting the data directory is necessary, deleting the logs directory is for clean logs and is not mandatory. + + ```Bash + cd /data/iotdb + rm -rf data logs + ``` + +## Appendix + +### Introduction to Configuration Node Parameters + +| Parameter | Description | Is it required | +| :-------- | :---------------------------------------------- | :----------------- | +| -d | Start in daemon mode, running in the background | No | + +### Introduction to Datanode Node Parameters + +| Abbreviation | Description | Is it required | +| :----------- | :----------------------------------------------------------- | :------------- | +| -v | Show version information | No | +| -f | Run the script in the foreground, do not put it in the background | No | +| -d | Start in daemon mode, i.e. run in the background | No | +| -p | Specify a file to store the process ID for process management | No | +| -c | Specify the path to the configuration file folder, the script will load the configuration file from here | No | +| -g | Print detailed garbage collection (GC) information | No | +| -H | Specify the path of the Java heap dump file, used when JVM memory overflows | No | +| -E | Specify the path of the JVM error log file | No | +| -D | Define system properties, in the format key=value | No | +| -X | Pass -XX parameters directly to the JVM | No | +| -h | Help instruction | No | + diff --git a/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Cluster-Deployment_timecho.md b/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Cluster-Deployment_timecho.md new file mode 100644 index 00000000..f44f729b --- /dev/null +++ b/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Cluster-Deployment_timecho.md @@ -0,0 +1,362 @@ + +# 集群版安装部署 + +本小节描述如何手动部署包括3个ConfigNode和3个DataNode的实例,即通常所说的3C3D集群。 + +
+ +
+ +## 注意事项 + +1. 安装前请确认系统已参照[系统配置](../Deployment-and-Maintenance/Environment-Requirements.md)准备完成。 + +2. 推荐使用`hostname`进行IP配置,可避免后期修改主机ip导致数据库无法启动的问题。设置hostname需要在服务器上配`/etc/hosts`,如本机ip是11.101.17.224,hostname是iotdb-1,则可以使用以下命令设置服务器的 hostname,并使用hostname配置IoTDB的`cn_internal_address`、`dn_internal_address`。 + + ```shell + echo "11.101.17.224 iotdb-1" >> /etc/hosts + ``` + +3. 有些参数首次启动后不能修改,请参考下方的[参数配置](#参数配置)章节来进行设置。 + +4. 无论是在linux还是windows中,请确保IoTDB的安装路径中不含空格和中文,避免软件运行异常。 + +5. 请注意,安装部署(包括激活和使用软件)IoTDB时,您可以: + +- 使用 root 用户(推荐):可以避免权限等问题。 + +- 使用固定的非 root 用户: + + - 使用同一用户操作:确保在启动、激活、停止等操作均保持使用同一用户,不要切换用户。 + + - 避免使用 sudo:使用 sudo 命令会以 root 用户权限执行命令,可能会引起权限混淆或安全问题。 + +6. 推荐部署监控面板,可以对重要运行指标进行监控,随时掌握数据库运行状态,监控面板可以联系商务获取,部署监控面板步骤可以参考:[监控面板部署](./Monitoring-panel-deployment.md) + +## 准备步骤 + +1. 准备IoTDB数据库安装包 :timechodb-{version}-bin.zip(安装包获取见:[链接](./IoTDB-Package_timecho.md)) +2. 按环境要求配置好操作系统环境(系统环境配置见:[链接](./Environment-Requirements.md)) + +## 安装步骤 + +假设现在有3台linux服务器,IP地址和服务角色分配如下: + +| 节点ip | 主机名 | 服务 | +| ------------- | ------- | -------------------- | +| 11.101.17.224 | iotdb-1 | ConfigNode、DataNode | +| 11.101.17.225 | iotdb-2 | ConfigNode、DataNode | +| 11.101.17.226 | iotdb-3 | ConfigNode、DataNode | + +### 设置主机名 + +在3台机器上分别配置主机名,设置主机名需要在目标服务器上配置/etc/hosts,使用如下命令: + +```shell +echo "11.101.17.224 iotdb-1" >> /etc/hosts +echo "11.101.17.225 iotdb-2" >> /etc/hosts +echo "11.101.17.226 iotdb-3" >> /etc/hosts +``` + +### 参数配置 + +解压安装包并进入安装目录 + +```shell +unzip timechodb-{version}-bin.zip +cd timechodb-{version}-bin +``` + +#### 环境脚本配置 + +- ./conf/confignode-env.sh配置 + +| **配置项** | **说明** | **默认值** | **推荐值** | 备注 | +| :---------- | :------------------------------------- | :--------- | :----------------------------------------------- | :----------- | +| MEMORY_SIZE | IoTDB ConfigNode节点可以使用的内存总量 | 空 | 可按需填写,填写后系统会根据填写的数值来分配内存 | 重启服务生效 | + +- ./conf/datanode-env.sh配置 + +| **配置项** | **说明** | **默认值** | **推荐值** | 备注 | +| :---------- | :----------------------------------- | :--------- | :----------------------------------------------- | :----------- | +| MEMORY_SIZE | IoTDB DataNode节点可以使用的内存总量 | 空 | 可按需填写,填写后系统会根据填写的数值来分配内存 | 重启服务生效 | + +#### 通用配置(./conf/iotdb-system.properties) + +- 集群配置 + +| 配置项 | 说明 | 11.101.17.224 | 11.101.17.225 | 11.101.17.226 | +| ------------------------- | ---------------------------------------- | -------------- | -------------- | -------------- | +| cluster_name | 集群名称 | defaultCluster | defaultCluster | defaultCluster | +| schema_replication_factor | 元数据副本数,DataNode数量不应少于此数目 | 3 | 3 | 3 | +| data_replication_factor | 数据副本数,DataNode数量不应少于此数目 | 2 | 2 | 2 | + +#### ConfigNode 配置 + +| 配置项 | 说明 | 默认 | 推荐值 | 11.101.17.224 | 11.101.17.225 | 11.101.17.226 | 备注 | +| ------------------- | ------------------------------------------------------------ | --------------- | ------------------------------------------------------- | ------------- | ------------- | ------------- | ------------------ | +| cn_internal_address | ConfigNode在集群内部通讯使用的地址 | 127.0.0.1 | 所在服务器的IPV4地址或hostname,推荐使用hostname | iotdb-1 | iotdb-2 | iotdb-3 | 首次启动后不能修改 | +| cn_internal_port | ConfigNode在集群内部通讯使用的端口 | 10710 | 10710 | 10710 | 10710 | 10710 | 首次启动后不能修改 | +| cn_consensus_port | ConfigNode副本组共识协议通信使用的端口 | 10720 | 10720 | 10720 | 10720 | 10720 | 首次启动后不能修改 | +| cn_seed_config_node | 节点注册加入集群时连接的ConfigNode 的地址,cn_internal_address:cn_internal_port | 127.0.0.1:10710 | 第一个CongfigNode的cn_internal_address:cn_internal_port | iotdb-1:10710 | iotdb-1:10710 | iotdb-1:10710 | 首次启动后不能修改 | + +#### DataNode 配置 + +| 配置项 | 说明 | 默认 | 推荐值 | 11.101.17.224 | 11.101.17.225 | 11.101.17.226 | 备注 | +| ------------------------------- | ------------------------------------------------------------ | --------------- | ------------------------------------------------------- | ------------- | ------------- | ------------- | ------------------ | +| dn_rpc_address | 客户端 RPC 服务的地址 | 0.0.0.0 | 0.0.0.0 | 0.0.0.0 | 0.0.0.0 | 0.0.0.0 | 重启服务生效 | +| dn_rpc_port | 客户端 RPC 服务的端口 | 6667 | 6667 | 6667 | 6667 | 6667 | 重启服务生效 | +| dn_internal_address | DataNode在集群内部通讯使用的地址 | 127.0.0.1 | 所在服务器的IPV4地址或hostname,推荐使用hostname | iotdb-1 | iotdb-2 | iotdb-3 | 首次启动后不能修改 | +| dn_internal_port | DataNode在集群内部通信使用的端口 | 10730 | 10730 | 10730 | 10730 | 10730 | 首次启动后不能修改 | +| dn_mpp_data_exchange_port | DataNode用于接收数据流使用的端口 | 10740 | 10740 | 10740 | 10740 | 10740 | 首次启动后不能修改 | +| dn_data_region_consensus_port | DataNode用于数据副本共识协议通信使用的端口 | 10750 | 10750 | 10750 | 10750 | 10750 | 首次启动后不能修改 | +| dn_schema_region_consensus_port | DataNode用于元数据副本共识协议通信使用的端口 | 10760 | 10760 | 10760 | 10760 | 10760 | 首次启动后不能修改 | +| dn_seed_config_node | 节点注册加入集群时连接的ConfigNode地址,即cn_internal_address:cn_internal_port | 127.0.0.1:10710 | 第一个CongfigNode的cn_internal_address:cn_internal_port | iotdb-1:10710 | iotdb-1:10710 | iotdb-1:10710 | 首次启动后不能修改 | + +> ❗️注意:VSCode Remote等编辑器无自动保存配置功能,请确保修改的文件被持久化保存,否则配置项无法生效 + +### 启动ConfigNode节点 + +先启动第一个iotdb-1的confignode, 保证种子confignode节点先启动,然后依次启动第2和第3个confignode节点 + +```shell +cd sbin +./start-confignode.sh -d #“-d”参数将在后台进行启动 +``` + +如果启动失败,请参考下[常见问题](#常见问题) + +### 启动DataNode 节点 + + 分别进入iotdb的sbin目录下,依次启动3个datanode节点: + +```shell +cd sbin +./start-datanode.sh -d #-d参数将在后台进行启动 +``` + +### 激活数据库 + +#### 方式一:激活文件拷贝激活 + +- 依次启动3个Confignode、Datanode节点后,每台机器各自的activation文件夹, 分别拷贝每台机器的system_info文件给天谋工作人员; +- 工作人员将返回每个ConfigNode、Datanode节点的license文件,这里会返回3个license文件; +- 将3个license文件分别放入对应的ConfigNode节点的activation文件夹下; + +#### 方式二:激活脚本激活 + +- 依次获取3台机器的机器码,进入到IoTDB树模型的CLI中(./start-cli.sh -sql_dialect table/start-cli.bat -sql_dialect table),执行以下内容: + - 注:当 sql_dialect 为 table 时,暂时不支持使用 + +```shell +show system info +``` + +- 显示如下信息,这里显示的是1台机器的机器码 : + +```shell ++--------------------------------------------------------------+ +| SystemInfo| ++--------------------------------------------------------------+ +|01-TE5NLES4-UDDWCMYE,01-GG5NLES4-XXDWCMYE,01-FF5NLES4-WWWWCMYE| ++--------------------------------------------------------------+ +Total line number = 1 +It costs 0.030s +``` + +- 其他2个节点依次进入到IoTDB树模型的CLI中,执行语句后将获取的3台机器的机器码都复制给天谋工作人员 +- 工作人员会返回3段激活码,正常是与提供的3个机器码的顺序对应的,请分别将各自的激活码粘贴到CLI中,如下提示: + - 注:激活码前后需要用`'`符号进行标注,如下所示 +```shell + IoTDB> activate '01-D4EYQGPZ-EAUJJODW-NUKRDR6F-TUQS3B75-EDZFLK3A-6BOKJFFZ-ALDHOMN7-NB2E4BHI-7ZKGFVK6-GCIFXA4T-UG3XJTTD-SHJV6F2P-Q27B4OMJ-R47ZDIM3-UUASUXG2-OQXGVZCO-MMYKICZU-TWFQYYAO-ZOAGOKJA-NYHQTA5U-EWAR4EP5-MRC6R2CI-PKUTKRCT-7UDGRH3F-7BYV4P5D-6KKIA===,01-D4EYQGPZ-EAUJJODW-NUKRDR6F-TUQS3B75-EDZFLK3A-6BOKJFFZ-ALDHOMN7-NB2E4BHI-7ZKGFVK6-GCIFXA4T-UG3XJTTD-SHJV6F2P-Q27B4OMJ-R47ZDIM3-UUASUXG2-OQXGVZCO-MMYKICZU-TWFQYYAO-ZOAGOKJA-NYHQTA5U-EWAR4EP5-MRC6R2CI-PKUTKRCT-7UDGRH3F-7BYV4P5D-6KKIA===,01-D4EYQGPZ-EAUJJODW-NUKRDR6F-TUQS3B75-EDZFLK3A-6BOKJFFZ-ALDHOMN7-NB2E4BHI-7ZKGFVK6-GCIFXA4T-UG3XJTTD-SHJV6F2P-Q27B4OMJ-R47ZDIM3-UUASUXG2-OQXGVZCO-MMYKICZU-TWFQYYAO-ZOAGOKJA-NYHQTA5U-EWAR4EP5-MRC6R2CI-PKUTKRCT-7UDGRH3F-7BYV4P5D-6KKIA===' +``` + +### 验证激活 + +当看到“Result”字段状态显示为success表示激活成功 + +![](https://alioss.timecho.com/docs/img/%E9%9B%86%E7%BE%A4-%E9%AA%8C%E8%AF%81.png) + +## 节点维护步骤 + +### ConfigNode节点维护 + +ConfigNode节点维护分为ConfigNode添加和移除两种操作,有两个常见使用场景: + +- 集群扩展:如集群中只有1个ConfigNode时,希望增加ConfigNode以提升ConfigNode节点高可用性,则可以添加2个ConfigNode,使得集群中有3个ConfigNode。 +- 集群故障恢复:1个ConfigNode所在机器发生故障,使得该ConfigNode无法正常运行,此时可以移除该ConfigNode,然后添加一个新的ConfigNode进入集群。 + +> ❗️注意,在完成ConfigNode节点维护后,需要保证集群中有1或者3个正常运行的ConfigNode。2个ConfigNode不具备高可用性,超过3个ConfigNode会导致性能损失。 + +#### 添加ConfigNode节点 + +脚本命令: + +```shell +# Linux / MacOS +# 首先切换到IoTDB根目录 +sbin/start-confignode.sh + +# Windows +# 首先切换到IoTDB根目录 +sbin/start-confignode.bat +``` + +#### 移除ConfigNode节点 + +首先通过CLI连接集群,通过`show confignodes`确认想要移除ConfigNode的内部地址与端口号: + +```shell +IoTDB> show confignodes ++------+-------+---------------+------------+--------+ +|NodeID| Status|InternalAddress|InternalPort| Role| ++------+-------+---------------+------------+--------+ +| 0|Running| 127.0.0.1| 10710| Leader| +| 1|Running| 127.0.0.1| 10711|Follower| +| 2|Running| 127.0.0.1| 10712|Follower| ++------+-------+---------------+------------+--------+ +Total line number = 3 +It costs 0.030s +``` + +然后使用脚本将DataNode移除。脚本命令: + +```Bash +# Linux / MacOS +sbin/remove-confignode.sh [confignode_id] +或 +./sbin/remove-confignode.sh [cn_internal_address:cn_internal_port] + +#Windows +sbin/remove-confignode.bat [confignode_id] +或 +./sbin/remove-confignode.bat [cn_internal_address:cn_internal_port] +``` + +### DataNode节点维护 + +DataNode节点维护有两个常见场景: + +- 集群扩容:出于集群能力扩容等目的,添加新的DataNode进入集群 +- 集群故障恢复:一个DataNode所在机器出现故障,使得该DataNode无法正常运行,此时可以移除该DataNode,并添加新的DataNode进入集群 + +> ❗️注意,为了使集群能正常工作,在DataNode节点维护过程中以及维护完成后,正常运行的DataNode总数不得少于数据副本数(通常为2),也不得少于元数据副本数(通常为3)。 + +#### 添加DataNode节点 + +脚本命令: + +```Bash +# Linux / MacOS +# 首先切换到IoTDB根目录 +sbin/start-datanode.sh + +#Windows +# 首先切换到IoTDB根目录 +sbin/start-datanode.bat +``` + +说明:在添加DataNode后,随着新的写入到来(以及旧数据过期,如果设置了TTL),集群负载会逐渐向新的DataNode均衡,最终在所有节点上达到存算资源的均衡。 + +#### 移除DataNode节点 + +首先通过CLI连接集群,通过`show datanodes`确认想要移除的DataNode的RPC地址与端口号: + +```Bash +IoTDB> show datanodes ++------+-------+----------+-------+-------------+---------------+ +|NodeID| Status|RpcAddress|RpcPort|DataRegionNum|SchemaRegionNum| ++------+-------+----------+-------+-------------+---------------+ +| 1|Running| 0.0.0.0| 6667| 0| 0| +| 2|Running| 0.0.0.0| 6668| 1| 1| +| 3|Running| 0.0.0.0| 6669| 1| 0| ++------+-------+----------+-------+-------------+---------------+ +Total line number = 3 +It costs 0.110s +``` + +然后使用脚本将DataNode移除。脚本命令: + +```Bash +# Linux / MacOS +sbin/remove-datanode.sh [dn_rpc_address:dn_rpc_port] + +#Windows +sbin/remove-datanode.bat [dn_rpc_address:dn_rpc_port] +``` + +## 常见问题 + +1. 部署过程中多次提示激活失败 + - 使用 `ls -al` 命令:使用 `ls -al` 命令检查安装包根目录的所有者信息是否为当前用户。 + - 检查激活目录:检查 `./activation` 目录下的所有文件,所有者信息是否为当前用户。 +2. Confignode节点启动失败 + - 步骤 1: 请查看启动日志,检查是否修改了某些首次启动后不可改的参数。 + - 步骤 2: 请查看启动日志,检查是否出现其他异常。日志中若存在异常现象,请联系天谋技术支持人员咨询解决方案。 + - 步骤 3: 如果是首次部署或者数据可删除,也可按下述步骤清理环境,重新部署后,再次启动。 + - 清理环境: + + 1. 结束所有 ConfigNode 和 DataNode 进程。 + ```Bash + # 1. 停止 ConfigNode 和 DataNode 服务 + sbin/stop-standalone.sh + + # 2. 检查是否还有进程残留 + jps + # 或者 + ps -ef|gerp iotdb + + # 3. 如果有进程残留,则手动kill + kill -9 + # 如果确定机器上仅有1个iotdb,可以使用下面命令清理残留进程 + ps -ef|grep iotdb|grep -v grep|tr -s ' ' ' ' |cut -d ' ' -f2|xargs kill -9 + ``` + + 2. 删除 data 和 logs 目录。 + - 说明:删除 data 目录是必要的,删除 logs 目录是为了纯净日志,非必需。 + ```shell + cd /data/iotdb rm -rf data logs + ``` +## 附录 + +### Confignode节点参数介绍 + +| 参数 | 描述 | 是否为必填项 | +| :--- | :------------------------------- | :----------- | +| -d | 以守护进程模式启动,即在后台运行 | 否 | + +### Datanode节点参数介绍 + +| 缩写 | 描述 | 是否为必填项 | +| :--- | :--------------------------------------------- | :----------- | +| -v | 显示版本信息 | 否 | +| -f | 在前台运行脚本,不将其放到后台 | 否 | +| -d | 以守护进程模式启动,即在后台运行 | 否 | +| -p | 指定一个文件来存放进程ID,用于进程管理 | 否 | +| -c | 指定配置文件夹的路径,脚本会从这里加载配置文件 | 否 | +| -g | 打印垃圾回收(GC)的详细信息 | 否 | +| -H | 指定Java堆转储文件的路径,当JVM内存溢出时使用 | 否 | +| -E | 指定JVM错误日志文件的路径 | 否 | +| -D | 定义系统属性,格式为 key=value | 否 | +| -X | 直接传递 -XX 参数给 JVM | 否 | +| -h | 帮助指令 | 否 | + diff --git a/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Database-Resources.md b/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Database-Resources.md new file mode 100644 index 00000000..17e09aa0 --- /dev/null +++ b/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Database-Resources.md @@ -0,0 +1,193 @@ + +# 资源规划 +## CPU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
序列数(采集频率<=1HZ)CPU节点数
单机双活分布式
10W以内2核-4核123
30W以内4核-8核123
50W以内8核-16核123
100W以内16核-32核123
200w以内32核-48核123
1000w以内48核12请联系天谋商务咨询
1000w以上请联系天谋商务咨询
+ +## 内存 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
序列数(采集频率<=1HZ)内存节点数
单机双活分布式
10W以内4G-8G123
30W以内12G-32G123
50W以内24G-48G123
100W以内32G-96G123
200w以内64G-128G123
1000w以内128G12请联系天谋商务咨询
1000w以上请联系天谋商务咨询
+ +## 存储(磁盘) +### 存储空间 +计算公式:测点数量 * 采样频率(Hz)* 每个数据点大小(Byte,不同数据类型不一样,见下表) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
数据点大小计算表
数据类型 时间戳(字节)值(字节)数据点总大小(字节)
开关量(Boolean)819
整型(INT32)/ 单精度浮点数(FLOAT)8412
长整型(INT64)/ 双精度浮点数(DOUBLE)8816
字符串(TEXT)8平均为a8+a
+ +示例:1000设备,每个设备100 测点,共 100000 序列,INT32 类型。采样频率1Hz(每秒一次),存储1年,3副本。 +- 完整计算公式:1000设备 * 100测点 * 12字节每数据点 * 86400秒每天 * 365天每年 * 3副本/10压缩比=11T +- 简版计算公式:1000 * 100 * 12 * 86400 * 365 * 3 / 10 = 11T +### 存储配置 +1000w 点位以上或查询负载较大,推荐配置 SSD。 +## 网络(网卡) +在写入吞吐不超过1000万点/秒时,需配置千兆网卡;当写入吞吐超过 1000万点/秒时,需配置万兆网卡。 +| **写入吞吐(数据点/秒)** | **网卡速率** | +| ------------------- | ------------- | +| <1000万 | 1Gbps(千兆) | +| >=1000万 | 10Gbps(万兆) | +## 其他说明 +IoTDB 具有集群秒级扩容能力,扩容节点数据可不迁移,因此您无需担心按现有数据情况估算的集群能力有限,未来您可在需要扩容时为集群加入新的节点。 \ No newline at end of file diff --git a/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Environment-Requirements.md b/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Environment-Requirements.md new file mode 100644 index 00000000..99c5b14c --- /dev/null +++ b/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Environment-Requirements.md @@ -0,0 +1,205 @@ + +# 系统配置 + +## 磁盘阵列 + +### 配置建议 + +IoTDB对磁盘阵列配置没有严格运行要求,推荐使用多个磁盘阵列存储IoTDB的数据,以达到多个磁盘阵列并发写入的目标,配置可参考以下建议: + +1. 物理环境 + 系统盘:建议使用2块磁盘做Raid1,仅考虑操作系统自身所占空间即可,可以不为IoTDB预留系统盘空间 + 数据盘 + 建议做Raid,在磁盘维度进行数据保护 + 建议为IoTDB提供多块磁盘(1-6块左右)或磁盘组(不建议将所有磁盘做成一个磁盘阵列,会影响 IoTDB的性能上限) +2. 虚拟环境 + 建议挂载多块硬盘(1-6块左右) + +### 配置示例 + +- 示例1,4块3.5英寸硬盘 + +因服务器安装的硬盘较少,直接做Raid5即可,无需其他配置。 + +推荐配置如下: + +| **使用分类** | **Raid类型** | **硬盘数量** | **冗余** | **可用容量** | +| ----------- | -------- | -------- | --------- | -------- | +| 系统/数据盘 | RAID5 | 4 | 允许坏1块 | 3 | + +- 示例2,12块3.5英寸硬盘 + +服务器配置12块3.5英寸盘。 + +前2块盘推荐Raid1作系统盘,2组数据盘可分为2组Raid5,每组5块盘实际可用4块。 + +推荐配置如下: + +| **使用分类** | **Raid类型** | **硬盘数量** | **冗余** | **可用容量** | +| -------- | -------- | -------- | --------- | -------- | +| 系统盘 | RAID1 | 2 | 允许坏1块 | 1 | +| 数据盘 | RAID5 | 5 | 允许坏1块 | 4 | +| 数据盘 | RAID5 | 5 | 允许坏1块 | 4 | + +- 示例3,24块2.5英寸盘 + +服务器配置24块2.5英寸盘。 + +前2块盘推荐Raid1作系统盘,后面可分为3组Raid5,每组7块盘实际可用6块。剩余一块可闲置或存储写前日志使用。 + +推荐配置如下: + +| **使用分类** | **Raid类型** | **硬盘数量** | **冗余** | **可用容量** | +| -------- | -------- | -------- | --------- | -------- | +| 系统盘 | RAID1 | 2 | 允许坏1块 | 1 | +| 数据盘 | RAID5 | 7 | 允许坏1块 | 6 | +| 数据盘 | RAID5 | 7 | 允许坏1块 | 6 | +| 数据盘 | RAID5 | 7 | 允许坏1块 | 6 | +| 数据盘 | NoRaid | 1 | 损坏丢失 | 1 | + +## 操作系统 + +### 版本要求 + +IoTDB支持Linux、Windows、MacOS等操作系统,同时企业版支持龙芯、飞腾、鲲鹏等国产 CPU,支持中标麒麟、银河麒麟、统信、凝思等国产服务器操作系统。 + +### 硬盘分区 + +- 建议使用默认的标准分区方式,不推荐LVM扩展和硬盘加密。 +- 系统盘只需满足操作系统的使用空间即可,不需要为IoTDB预留空间。 +- 每个硬盘组只对应一个分区即可,数据盘(里面有多个磁盘组,对应raid)不用再额外分区,所有空间给IoTDB使用。 + +建议的磁盘分区方式如下表所示。 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
硬盘分类磁盘组对应盘符大小文件系统类型
系统盘磁盘组0/boot1GB默认
/磁盘组剩余全部空间默认
数据盘磁盘组1/data1磁盘组1全部空间默认
磁盘组2/data2磁盘组2全部空间默认
......
+ +### 网络配置 + +1. 关闭防火墙 + +```Bash +# 查看防火墙 +systemctl status firewalld +# 关闭防火墙 +systemctl stop firewalld +# 永久关闭防火墙 +systemctl disable firewalld +``` + +2. 保证所需端口不被占用 + +(1)集群占用端口的检查:在集群默认配置中,ConfigNode 会占用端口 10710 和 10720,DataNode 会占用端口 6667、10730、10740、10750 、10760、9090、9190、3000请确保这些端口未被占用。检查方式如下: + +```Bash +lsof -i:6667 或 netstat -tunp | grep 6667 +lsof -i:10710 或 netstat -tunp | grep 10710 +lsof -i:10720 或 netstat -tunp | grep 10720 +#如果命令有输出,则表示该端口已被占用。 +``` + +(2)集群部署工具占用端口的检查:使用集群管理工具opskit安装部署集群时,需打开SSH远程连接服务配置,并开放22号端口。 + +```Bash +yum install openssh-server #安装ssh服务 +systemctl start sshd #启用22号端口 +``` + +3. 保证服务器之间的网络相互连通 + +### 其他配置 + +1. 关闭系统 swap 内存 + +```Bash +echo "vm.swappiness = 0">> /etc/sysctl.conf +# 一起执行 swapoff -a 和 swapon -a 命令是为了将 swap 里的数据转储回内存,并清空 swap 里的数据。 +# 不可省略 swappiness 设置而只执行 swapoff -a;否则,重启后 swap 会再次自动打开,使得操作失效。 +swapoff -a && swapon -a +# 在不重启的情况下使配置生效。 +sysctl -p +# 检查内存分配,预期 swap 为 0 +free -m +``` + +2. 设置系统最大打开文件数为 65535,以避免出现 "太多的打开文件 "的错误。 + +```Bash +#查看当前限制 +ulimit -n +# 临时修改 +ulimit -n 65535 +# 永久修改 +echo "* soft nofile 65535" >> /etc/security/limits.conf +echo "* hard nofile 65535" >> /etc/security/limits.conf +#退出当前终端会话后查看,预期显示65535 +ulimit -n +``` + +## 软件依赖 + +安装 Java 运行环境 ,Java 版本 >= 1.8,请确保已设置 jdk 环境变量。(V1.3.2.2 及之上版本推荐直接部署JDK17,老版本JDK部分场景下性能有问题,且datanode会出现stop不掉的问题) + +```Bash + #下面以在centos7,使用JDK-17安装为例: + tar -zxvf jdk-17_linux-x64_bin.tar #解压JDK文件 + Vim ~/.bashrc #配置JDK环境 + { export JAVA_HOME=/usr/lib/jvm/jdk-17.0.9 + export PATH=$JAVA_HOME/bin:$PATH + } #添加JDK环境变量 + source ~/.bashrc #配置环境生效 + java -version #检查JDK环境 +``` \ No newline at end of file diff --git a/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/IoTDB-Package_timecho.md b/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/IoTDB-Package_timecho.md new file mode 100644 index 00000000..6c66c7fb --- /dev/null +++ b/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/IoTDB-Package_timecho.md @@ -0,0 +1,45 @@ + +# 安装包获取 +## 获取方式 + +企业版安装包可通过产品试用申请,或直接联系与您对接的工作人员获取。 + +## 安装包结构 + +安装包解压后目录结构如下: + +| **目录** | **类型** | **说明** | +| :--------------- | :------- | :----------------------------------------------------------- | +| activation | 文件夹 | 激活文件所在目录,包括生成的机器码以及从天谋工作人员获取的企业版激活码(启动ConfigNode后才会生成该目录,即可获取激活码) | +| conf | 文件夹 | 配置文件目录,包含 ConfigNode、DataNode、JMX 和 logback 等配置文件 | +| data | 文件夹 | 默认的数据文件目录,包含 ConfigNode 和 DataNode 的数据文件。(启动程序后才会生成该目录) | +| lib | 文件夹 | 库文件目录 | +| licenses | 文件夹 | 开源协议证书文件目录 | +| logs | 文件夹 | 默认的日志文件目录,包含 ConfigNode 和 DataNode 的日志文件(启动程序后才会生成该目录) | +| sbin | 文件夹 | 主要脚本目录,包含数据库启、停等脚本 | +| tools | 文件夹 | 工具目录 | +| ext | 文件夹 | pipe,trigger,udf插件的相关文件 | +| LICENSE | 文件 | 开源许可证文件 | +| NOTICE | 文件 | 开源声明文件 | +| README_ZH.md | 文件 | 使用说明(中文版) | +| README.md | 文件 | 使用说明(英文版) | +| RELEASE_NOTES.md | 文件 | 版本说明 | diff --git a/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Monitoring-panel-deployment.md b/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Monitoring-panel-deployment.md new file mode 100644 index 00000000..c7fba837 --- /dev/null +++ b/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Monitoring-panel-deployment.md @@ -0,0 +1,682 @@ + +# 监控面板部署 + +IoTDB配套监控面板是IoTDB企业版配套工具之一。它旨在解决IoTDB及其所在操作系统的监控问题,主要包括:操作系统资源监控、IoTDB性能监控,及上百项内核监控指标,从而帮助用户监控集群健康状态,并进行集群调优和运维。本文将以常见的3C3D集群(3个Confignode和3个Datanode)为例,为您介绍如何在IoTDB的实例中开启系统监控模块,并且使用Prometheus + Grafana的方式完成对系统监控指标的可视化。 + +## 安装准备 + +1. 安装 IoTDB:需先安装IoTDB V1.0 版本及以上企业版,您可联系商务或技术支持获取 +2. 获取 IoTDB 监控面板安装包:基于企业版 IoTDB 的数据库监控面板,您可联系商务或技术支持获取 + +## 安装步骤 + +### 步骤一:IoTDB开启监控指标采集 + +1. 打开监控配置项。IoTDB中监控有关的配置项默认是关闭的,在部署监控面板前,您需要打开相关配置项(注意开启监控配置后需要重启服务)。 + +| 配置项 | 所在配置文件 | 配置说明 | +| :--------------------------------- | :------------------------------- | :----------------------------------------------------------- | +| cn_metric_reporter_list | conf/iotdb-system.properties | 将配置项取消注释,值设置为PROMETHEUS | +| cn_metric_level | conf/iotdb-system.properties | 将配置项取消注释,值设置为IMPORTANT | +| cn_metric_prometheus_reporter_port | conf/iotdb-system.properties | 将配置项取消注释,可保持默认设置9091,如设置其他端口,不与其他端口冲突即可 | +| dn_metric_reporter_list | conf/iotdb-system.properties | 将配置项取消注释,值设置为PROMETHEUS | +| dn_metric_level | conf/iotdb-system.properties | 将配置项取消注释,值设置为IMPORTANT | +| dn_metric_prometheus_reporter_port | conf/iotdb-system.properties | 将配置项取消注释,可默认设置为9092,如设置其他端口,不与其他端口冲突即可 | + +以3C3D集群为例,需要修改的监控配置如下: + +| 节点ip | 主机名 | 集群角色 | 配置文件路径 | 配置项 | +| ----------- | ------- | ---------- | -------------------------------- | ------------------------------------------------------------ | +| 192.168.1.3 | iotdb-1 | confignode | conf/iotdb-system.properties | cn_metric_reporter_list=PROMETHEUS cn_metric_level=IMPORTANT cn_metric_prometheus_reporter_port=9091 | +| 192.168.1.4 | iotdb-2 | confignode | conf/iotdb-system.properties | cn_metric_reporter_list=PROMETHEUS cn_metric_level=IMPORTANT cn_metric_prometheus_reporter_port=9091 | +| 192.168.1.5 | iotdb-3 | confignode | conf/iotdb-system.properties | cn_metric_reporter_list=PROMETHEUS cn_metric_level=IMPORTANT cn_metric_prometheus_reporter_port=9091 | +| 192.168.1.3 | iotdb-1 | datanode | conf/iotdb-system.properties | dn_metric_reporter_list=PROMETHEUS dn_metric_level=IMPORTANT dn_metric_prometheus_reporter_port=9092 | +| 192.168.1.4 | iotdb-2 | datanode | conf/iotdb-system.properties | dn_metric_reporter_list=PROMETHEUS dn_metric_level=IMPORTANT dn_metric_prometheus_reporter_port=9092 | +| 192.168.1.5 | iotdb-3 | datanode | conf/iotdb-system.properties | dn_metric_reporter_list=PROMETHEUS dn_metric_level=IMPORTANT dn_metric_prometheus_reporter_port=9092 | + +2. 重启所有节点。修改3个节点的监控指标配置后,可重新启动所有节点的confignode和datanode服务: + +```shell +./sbin/stop-standalone.sh #先停止confignode和datanode +./sbin/start-confignode.sh -d #启动confignode +./sbin/start-datanode.sh -d #启动datanode +``` + +3. 重启后,通过客户端确认各节点的运行状态,若状态都为Running,则为配置成功: + +![](https://alioss.timecho.com/docs/img/%E5%90%AF%E5%8A%A8.PNG) + +### 步骤二:安装、配置Prometheus + +> 此处以prometheus安装在服务器192.168.1.3为例。 + +1. 下载 Prometheus 安装包,要求安装 V2.30.3 版本及以上,可前往 Prometheus 官网下载(https://prometheus.io/docs/introduction/first_steps/) +2. 解压安装包,进入解压后的文件夹: + +```Shell +tar xvfz prometheus-*.tar.gz +cd prometheus-* +``` + +3. 修改配置。修改配置文件prometheus.yml如下 + 1. 新增confignode任务收集ConfigNode的监控数据 + 2. 新增datanode任务收集DataNode的监控数据 + +```shell +global: + scrape_interval: 15s + evaluation_interval: 15s +scrape_configs: + - job_name: "prometheus" + static_configs: + - targets: ["localhost:9090"] + - job_name: "confignode" + static_configs: + - targets: ["iotdb-1:9091","iotdb-2:9091","iotdb-3:9091"] + honor_labels: true + - job_name: "datanode" + static_configs: + - targets: ["iotdb-1:9092","iotdb-2:9092","iotdb-3:9092"] + honor_labels: true +``` + +4. 启动Prometheus。Prometheus 监控数据的默认过期时间为15天,在生产环境中,建议将其调整为180天以上,以对更长时间的历史监控数据进行追踪,启动命令如下所示: + +```Shell +./prometheus --config.file=prometheus.yml --storage.tsdb.retention.time=180d +``` + +5. 确认启动成功。在浏览器中输入 http://192.168.1.3:9090,进入Prometheus,点击进入Status下的Target界面,当看到State均为Up时表示配置成功并已经联通。 + +
+ + +
+ + + +6. 点击Targets中左侧链接可以跳转到网页监控,查看相应节点的监控信息: + +![](https://alioss.timecho.com/docs/img/%E8%8A%82%E7%82%B9%E7%9B%91%E6%8E%A7.png) + +### 步骤三:安装grafana并配置数据源 + +> 此处以Grafana安装在服务器192.168.1.3为例。 + +1. 下载 Grafana 安装包,要求安装 V8.4.2 版本及以上,可以前往Grafana官网下载(https://grafana.com/grafana/download) +2. 解压并进入对应文件夹 + +```Shell +tar -zxvf grafana-*.tar.gz +cd grafana-* +``` + +3. 启动Grafana: + +```Shell +./bin/grafana-server web +``` + +4. 登录Grafana。在浏览器中输入 http://192.168.1.3:3000(或修改后的端口),进入Grafana,默认初始用户名和密码均为 admin。 + +5. 配置数据源。在Connections中找到Data sources,新增一个data source并配置Data Source为Prometheus + +![](https://alioss.timecho.com/docs/img/%E6%B7%BB%E5%8A%A0%E9%85%8D%E7%BD%AE.png) + +在配置Data Source时注意Prometheus所在的URL,配置好后点击Save & Test 出现 Data source is working 提示则为配置成功 + +![](https://alioss.timecho.com/docs/img/%E9%85%8D%E7%BD%AE%E6%88%90%E5%8A%9F.png) + +### 步骤四:导入IoTDB Grafana看板 + +1. 进入Grafana,选择Dashboards: + + ![](https://alioss.timecho.com/docs/img/%E9%9D%A2%E6%9D%BF%E9%80%89%E6%8B%A9.png) + +2. 点击右侧 Import 按钮 + + ![](https://alioss.timecho.com/docs/img/Import%E6%8C%89%E9%92%AE.png) + +3. 使用upload json file的方式导入Dashboard + + ![](https://alioss.timecho.com/docs/img/%E5%AF%BC%E5%85%A5Dashboard.png) + +4. 选择IoTDB监控面板中其中一个面板的json文件,这里以选择 Apache IoTDB ConfigNode Dashboard为例(监控面板安装包获取参见本文【安装准备】): + + ![](https://alioss.timecho.com/docs/img/%E9%80%89%E6%8B%A9%E9%9D%A2%E6%9D%BF.png) + +5. 选择数据源为Prometheus,然后点击Import + + ![](https://alioss.timecho.com/docs/img/%E9%80%89%E6%8B%A9%E6%95%B0%E6%8D%AE%E6%BA%90.png) + +6. 之后就可以看到导入的Apache IoTDB ConfigNode Dashboard监控面板 + + ![](https://alioss.timecho.com/docs/img/%E9%9D%A2%E6%9D%BF.png) + +7. 同样地,我们可以导入Apache IoTDB DataNode Dashboard、Apache Performance Overview Dashboard、Apache System Overview Dashboard,可看到如下的监控面板: + +
+ + + +
+ +8. 至此,IoTDB监控面板就全部导入完成了,现在可以随时查看监控信息了。 + + ![](https://alioss.timecho.com/docs/img/%E9%9D%A2%E6%9D%BF%E6%B1%87%E6%80%BB.png) + +## 附录、监控指标详解 + +### 系统面板(System Dashboard) + +该面板展示了当前系统CPU、内存、磁盘、网络资源的使用情况已经JVM的部分状况。 + +#### CPU + +- CPU Core:CPU 核数 +- CPU Load: + - System CPU Load:整个系统在采样时间内 CPU 的平均负载和繁忙程度 + - Process CPU Load:IoTDB 进程在采样时间内占用的 CPU 比例 +- CPU Time Per Minute:系统每分钟内所有进程的 CPU 时间总和 + +#### Memory + +- System Memory:当前系统内存的使用情况。 + - Commited vm size: 操作系统分配给正在运行的进程使用的虚拟内存的大小。 + - Total physical memory:系统可用物理内存的总量。 + - Used physical memory:系统已经使用的内存总量。包含进程实际使用的内存量和操作系统 buffers/cache 占用的内存。 +- System Swap Memory:交换空间(Swap Space)内存用量。 +- Process Memory:IoTDB 进程使用内存的情况。 + - Max Memory:IoTDB 进程能够从操作系统那里最大请求到的内存量。(datanode-env/confignode-env 配置文件中配置分配的内存大小) + - Total Memory:IoTDB 进程当前已经从操作系统中请求到的内存总量。 + - Used Memory:IoTDB 进程当前已经使用的内存总量。 + +#### Disk + +- Disk Space: + - Total disk space:IoTDB 可使用的最大磁盘空间。 + - Used disk space:IoTDB 已经使用的磁盘空间。 +- Log Number Per Minute:采样时间内每分钟 IoTDB 各级别日志数量的平均值。 +- File Count:IoTDB 相关文件数量 + - all:所有文件数量 + - TsFile:TsFile 数量 + - seq:顺序 TsFile 数量 + - unseq:乱序 TsFile 数量 + - wal:WAL 文件数量 + - cross-temp:跨空间合并 temp 文件数量 + - inner-seq-temp:顺序空间内合并 temp 文件数量 + - innser-unseq-temp:乱序空间内合并 temp 文件数量 + - mods:墓碑文件数量 +- Open File Count:系统打开的文件句柄数量 +- File Size:IoTDB 相关文件的大小。各子项分别是对应文件的大小。 +- Disk I/O Busy Rate:等价于 iostat 中的 %util 指标,一定程度上反映磁盘的繁忙程度。各子项分别是对应磁盘的指标。 +- Disk I/O Throughput:系统各磁盘在一段时间 I/O Throughput 的平均值。各子项分别是对应磁盘的指标。 +- Disk I/O Ops:等价于 iostat 中的 r/s 、w/s、rrqm/s、wrqm/s 四个指标,指的是磁盘每秒钟进行 I/O 的次数。read 和 write 指的是磁盘执行单次 I/O 的次数,由于块设备有相应的调度算法,在某些情况下可以将多个相邻的 I/O 合并为一次进行,merged-read 和 merged-write 指的是将多个 I/O 合并为一个 I/O 进行的次数。 +- Disk I/O Avg Time:等价于 iostat 的 await,即每个 I/O 请求的平均时延。读和写请求分开记录。 +- Disk I/O Avg Size:等价于 iostat 的 avgrq-sz,反映了每个 I/O 请求的大小。读和写请求分开记录。 +- Disk I/O Avg Queue Size:等价于 iostat 中的 avgqu-sz,即 I/O 请求队列的平均长度。 +- I/O System Call Rate:进程调用读写系统调用的频率,类似于 IOPS。 +- I/O Throughput:进程进行 I/O 的吞吐量,分为 actual_read/write 和 attempt_read/write 两类。actual read 和 actual write 指的是进程实际导致块设备进行 I/O 的字节数,不包含被 Page Cache 处理的部分。 + +#### JVM + +- GC Time Percentage:节点 JVM 在过去一分钟的时间窗口内,GC 耗时所占的比例 +- GC Allocated/Promoted Size Detail: 节点 JVM 平均每分钟晋升到老年代的对象大小,新生代/老年代和非分代新申请的对象大小 +- GC Data Size Detail:节点 JVM 长期存活的对象大小和对应代际允许的最大值 +- Heap Memory:JVM 堆内存使用情况。 + - Maximum heap memory:JVM 最大可用的堆内存大小。 + - Committed heap memory:JVM 已提交的堆内存大小。 + - Used heap memory:JVM 已经使用的堆内存大小。 + - PS Eden Space:PS Young 区的大小。 + - PS Old Space:PS Old 区的大小。 + - PS Survivor Space:PS Survivor 区的大小。 + - ...(CMS/G1/ZGC 等) +- Off Heap Memory:堆外内存用量。 + - direct memory:堆外直接内存。 + - mapped memory:堆外映射内存。 +- GC Number Per Minute:节点 JVM 平均每分钟进行垃圾回收的次数,包括 YGC 和 FGC +- GC Time Per Minute:节点 JVM 平均每分钟进行垃圾回收的耗时,包括 YGC 和 FGC +- GC Number Per Minute Detail:节点 JVM 平均每分钟由于不同 cause 进行垃圾回收的次数,包括 YGC 和 FGC +- GC Time Per Minute Detail:节点 JVM 平均每分钟由于不同 cause 进行垃圾回收的耗时,包括 YGC 和 FGC +- Time Consumed Of Compilation Per Minute:每分钟 JVM 用于编译的总时间 +- The Number of Class: + - loaded:JVM 目前已经加载的类的数量 + - unloaded:系统启动至今 JVM 卸载的类的数量 +- The Number of Java Thread:IoTDB 目前存活的线程数。各子项分别为各状态的线程数。 + +#### Network + +eno 指的是到公网的网卡,lo 是虚拟网卡。 + +- Net Speed:网卡发送和接收数据的速度 +- Receive/Transmit Data Size:网卡发送或者接收的数据包大小,自系统重启后算起 +- Packet Speed:网卡发送和接收数据包的速度,一次 RPC 请求可以对应一个或者多个数据包 +- Connection Num:当前选定进程的 socket 连接数(IoTDB只有 TCP) + +### 整体性能面板(Performance Overview Dashboard) + +#### Cluster Overview + +- Total CPU Core: 集群机器 CPU 总核数 +- DataNode CPU Load: 集群各DataNode 节点的 CPU 使用率 +- 磁盘 + - Total Disk Space: 集群机器磁盘总大小 + - DataNode Disk Usage: 集群各 DataNode 的磁盘使用率 +- Total Timeseries: 集群管理的时间序列总数(含副本),实际时间序列数需结合元数据副本数计算 +- Cluster: 集群 ConfigNode 和 DataNode 节点数量 +- Up Time: 集群启动至今的时长 +- Total Write Point Per Second: 集群每秒写入总点数(含副本),实际写入总点数需结合数据副本数分析 +- 内存 + - Total System Memory: 集群机器系统内存总大小 + - Total Swap Memory: 集群机器交换内存总大小 + - DataNode Process Memory Usage: 集群各 DataNode 的内存使用率 +- Total File Number: 集群管理文件总数量 +- Cluster System Overview: 集群机器概述,包括平均 DataNode 节点内存占用率、平均机器磁盘使用率 +- Total DataBase: 集群管理的 Database 总数(含副本) +- Total DataRegion: 集群管理的 DataRegion 总数 +- Total SchemaRegion: 集群管理的 SchemaRegion 总数 + +#### Node Overview + +- CPU Core: 节点所在机器的 CPU 核数 +- Disk Space: 节点所在机器的磁盘大小 +- Timeseries: 节点所在机器管理的时间序列数量(含副本) +- System Overview: 节点所在机器的系统概述,包括 CPU 负载、进程内存使用比率、磁盘使用比率 +- Write Point Per Second: 节点所在机器的每秒写入速度(含副本) +- System Memory: 节点所在机器的系统内存大小 +- Swap Memory: 节点所在机器的交换内存大小 +- File Number: 节点管理的文件数 + +#### Performance + +- Session Idle Time: 节点的 session 连接的总空闲时间和总忙碌时间 +- Client Connection: 节点的客户端连接情况,包括总连接数和活跃连接数 +- Time Consumed Of Operation: 节点的各类型操作耗时,包括平均值和P99 +- Average Time Consumed Of Interface: 节点的各个 thrift 接口平均耗时 +- P99 Time Consumed Of Interface: 节点的各个 thrift 接口的 P99 耗时数 +- Task Number: 节点的各项系统任务数量 +- Average Time Consumed of Task: 节点的各项系统任务的平均耗时 +- P99 Time Consumed of Task: 节点的各项系统任务的 P99 耗时 +- Operation Per Second: 节点的每秒操作数 +- 主流程 + - Operation Per Second Of Stage: 节点主流程各阶段的每秒操作数 + - Average Time Consumed Of Stage: 节点主流程各阶段平均耗时 + - P99 Time Consumed Of Stage: 节点主流程各阶段 P99 耗时 +- Schedule 阶段 + - OPS Of Schedule: 节点 schedule 阶段各子阶段每秒操作数 + - Average Time Consumed Of Schedule Stage: 节点 schedule 阶段各子阶段平均耗时 + - P99 Time Consumed Of Schedule Stage: 节点的 schedule 阶段各子阶段 P99 耗时 +- Local Schedule 各子阶段 + - OPS Of Local Schedule Stage: 节点 local schedule 各子阶段每秒操作数 + - Average Time Consumed Of Local Schedule Stage: 节点 local schedule 阶段各子阶段平均耗时 + - P99 Time Consumed Of Local Schedule Stage: 节点的 local schedule 阶段各子阶段 P99 耗时 +- Storage 阶段 + - OPS Of Storage Stage: 节点 storage 阶段各子阶段每秒操作数 + - Average Time Consumed Of Storage Stage: 节点 storage 阶段各子阶段平均耗时 + - P99 Time Consumed Of Storage Stage: 节点 storage 阶段各子阶段 P99 耗时 +- Engine 阶段 + - OPS Of Engine Stage: 节点 engine 阶段各子阶段每秒操作数 + - Average Time Consumed Of Engine Stage: 节点的 engine 阶段各子阶段平均耗时 + - P99 Time Consumed Of Engine Stage: 节点 engine 阶段各子阶段的 P99 耗时 + +#### System + +- CPU Load: 节点的 CPU 负载 +- CPU Time Per Minute: 节点的每分钟 CPU 时间,最大值和 CPU 核数相关 +- GC Time Per Minute: 节点的平均每分钟 GC 耗时,包括 YGC 和 FGC +- Heap Memory: 节点的堆内存使用情况 +- Off Heap Memory: 节点的非堆内存使用情况 +- The Number Of Java Thread: 节点的 Java 线程数量情况 +- File Count: 节点管理的文件数量情况 +- File Size: 节点管理文件大小情况 +- Log Number Per Minute: 节点的每分钟不同类型日志情况 + +### ConfigNode 面板(ConfigNode Dashboard) + +该面板展示了集群中所有管理节点的表现情况,包括分区、节点信息、客户端连接情况统计等。 + +#### Node Overview + +- Database Count: 节点的数据库数量 +- Region + - DataRegion Count: 节点的 DataRegion 数量 + - DataRegion Current Status: 节点的 DataRegion 的状态 + - SchemaRegion Count: 节点的 SchemaRegion 数量 + - SchemaRegion Current Status: 节点的 SchemaRegion 的状态 +- System Memory: 节点的系统内存大小 +- Swap Memory: 节点的交换区内存大小 +- ConfigNodes: 节点所在集群的 ConfigNode 的运行状态 +- DataNodes: 节点所在集群的 DataNode 情况 +- System Overview: 节点的系统概述,包括系统内存、磁盘使用、进程内存以及CPU负载 + +#### NodeInfo + +- Node Count: 节点所在集群的节点数量,包括 ConfigNode 和 DataNode +- ConfigNode Status: 节点所在集群的 ConfigNode 节点的状态 +- DataNode Status: 节点所在集群的 DataNode 节点的状态 +- SchemaRegion Distribution: 节点所在集群的 SchemaRegion 的分布情况 +- SchemaRegionGroup Leader Distribution: 节点所在集群的 SchemaRegionGroup 的 Leader 分布情况 +- DataRegion Distribution: 节点所在集群的 DataRegion 的分布情况 +- DataRegionGroup Leader Distribution: 节点所在集群的 DataRegionGroup 的 Leader 分布情况 + +#### Protocol + +- 客户端数量统计 + - Active Client Num: 节点各线程池的活跃客户端数量 + - Idle Client Num: 节点各线程池的空闲客户端数量 + - Borrowed Client Count: 节点各线程池的借用客户端数量 + - Created Client Count: 节点各线程池的创建客户端数量 + - Destroyed Client Count: 节点各线程池的销毁客户端数量 +- 客户端时间情况 + - Client Mean Active Time: 节点各线程池客户端的平均活跃时间 + - Client Mean Borrow Wait Time: 节点各线程池的客户端平均借用等待时间 + - Client Mean Idle Time: 节点各线程池的客户端平均空闲时间 + +#### Partition Table + +- SchemaRegionGroup Count: 节点所在集群的 Database 的 SchemaRegionGroup 的数量 +- DataRegionGroup Count: 节点所在集群的 Database 的 DataRegionGroup 的数量 +- SeriesSlot Count: 节点所在集群的 Database 的 SeriesSlot 的数量 +- TimeSlot Count: 节点所在集群的 Database 的 TimeSlot 的数量 +- DataRegion Status: 节点所在集群的 DataRegion 状态 +- SchemaRegion Status: 节点所在集群的 SchemaRegion 的状态 + +#### Consensus + +- Ratis Stage Time: 节点的 Ratis 各阶段耗时 +- Write Log Entry: 节点的 Ratis 写 Log 的耗时 +- Remote / Local Write Time: 节点的 Ratis 的远程写入和本地写入的耗时 +- Remote / Local Write QPS: 节点 Ratis 的远程和本地写入的 QPS +- RatisConsensus Memory: 节点 Ratis 共识协议的内存使用 + +### DataNode 面板(DataNode Dashboard) + +该面板展示了集群中所有数据节点的监控情况,包含写入耗时、查询耗时、存储文件数等。 + +#### Node Overview + +- The Number Of Entity: 节点管理的实体情况 +- Write Point Per Second: 节点的每秒写入速度 +- Memory Usage: 节点的内存使用情况,包括 IoT Consensus 各部分内存占用、SchemaRegion内存总占用和各个数据库的内存占用。 + +#### Protocol + +- 节点操作耗时 + - The Time Consumed Of Operation (avg): 节点的各项操作的平均耗时 + - The Time Consumed Of Operation (50%): 节点的各项操作耗时的中位数 + - The Time Consumed Of Operation (99%): 节点的各项操作耗时的P99 +- Thrift统计 + - The QPS Of Interface: 节点各个 Thrift 接口的 QPS + - The Avg Time Consumed Of Interface: 节点各个 Thrift 接口的平均耗时 + - Thrift Connection: 节点的各类型的 Thrfit 连接数量 + - Thrift Active Thread: 节点各类型的活跃 Thrift 连接数量 +- 客户端统计 + - Active Client Num: 节点各线程池的活跃客户端数量 + - Idle Client Num: 节点各线程池的空闲客户端数量 + - Borrowed Client Count: 节点的各线程池借用客户端数量 + - Created Client Count: 节点各线程池的创建客户端数量 + - Destroyed Client Count: 节点各线程池的销毁客户端数量 + - Client Mean Active Time: 节点各线程池的客户端平均活跃时间 + - Client Mean Borrow Wait Time: 节点各线程池的客户端平均借用等待时间 + - Client Mean Idle Time: 节点各线程池的客户端平均空闲时间 + +#### Storage Engine + +- File Count: 节点管理的各类型文件数量 +- File Size: 节点管理的各类型文件大小 +- TsFile + - TsFile Total Size In Each Level: 节点管理的各级别 TsFile 文件总大小 + - TsFile Count In Each Level: 节点管理的各级别 TsFile 文件数量 + - Avg TsFile Size In Each Level: 节点管理的各级别 TsFile 文件的平均大小 +- Task Number: 节点的 Task 数量 +- The Time Consumed of Task: 节点的 Task 的耗时 +- Compaction + - Compaction Read And Write Per Second: 节点的每秒钟合并读写速度 + - Compaction Number Per Minute: 节点的每分钟合并数量 + - Compaction Process Chunk Status: 节点合并不同状态的 Chunk 的数量 + - Compacted Point Num Per Minute: 节点每分钟合并的点数 + +#### Write Performance + +- Write Cost(avg): 节点写入耗时平均值,包括写入 wal 和 memtable +- Write Cost(50%): 节点写入耗时中位数,包括写入 wal 和 memtable +- Write Cost(99%): 节点写入耗时的P99,包括写入 wal 和 memtable +- WAL + - WAL File Size: 节点管理的 WAL 文件总大小 + - WAL File Num: 节点管理的 WAL 文件数量 + - WAL Nodes Num: 节点管理的 WAL Node 数量 + - Make Checkpoint Costs: 节点创建各类型的 CheckPoint 的耗时 + - WAL Serialize Total Cost: 节点 WAL 序列化总耗时 + - Data Region Mem Cost: 节点不同的DataRegion的内存占用、当前实例的DataRegion的内存总占用、当前集群的 DataRegion 的内存总占用 + - Serialize One WAL Info Entry Cost: 节点序列化一个WAL Info Entry 耗时 + - Oldest MemTable Ram Cost When Cause Snapshot: 节点 WAL 触发 oldest MemTable snapshot 时 MemTable 大小 + - Oldest MemTable Ram Cost When Cause Flush: 节点 WAL 触发 oldest MemTable flush 时 MemTable 大小 + - Effective Info Ratio Of WALNode: 节点的不同 WALNode 的有效信息比 + - WAL Buffer + - WAL Buffer Cost: 节点 WAL flush SyncBuffer 耗时,包含同步和异步两种 + - WAL Buffer Used Ratio: 节点的 WAL Buffer 的使用率 + - WAL Buffer Entries Count: 节点的 WAL Buffer 的条目数量 +- Flush统计 + - Flush MemTable Cost(avg): 节点 Flush 的总耗时和各个子阶段耗时的平均值 + - Flush MemTable Cost(50%): 节点 Flush 的总耗时和各个子阶段耗时的中位数 + - Flush MemTable Cost(99%): 节点 Flush 的总耗时和各个子阶段耗时的 P99 + - Flush Sub Task Cost(avg): 节点的 Flush 平均子任务耗时平均情况,包括排序、编码、IO 阶段 + - Flush Sub Task Cost(50%): 节点的 Flush 各个子任务的耗时中位数情况,包括排序、编码、IO 阶段 + - Flush Sub Task Cost(99%): 节点的 Flush 平均子任务耗时P99情况,包括排序、编码、IO 阶段 +- Pending Flush Task Num: 节点的处于阻塞状态的 Flush 任务数量 +- Pending Flush Sub Task Num: 节点阻塞的 Flush 子任务数量 +- Tsfile Compression Ratio Of Flushing MemTable: 节点刷盘 Memtable 时对应的 TsFile 压缩率 +- Flush TsFile Size Of DataRegions: 节点不同 DataRegion 的每次刷盘时对应的 TsFile 大小 +- Size Of Flushing MemTable: 节点刷盘的 Memtable 的大小 +- Points Num Of Flushing MemTable: 节点不同 DataRegion 刷盘时的点数 +- Series Num Of Flushing MemTable: 节点的不同 DataRegion 的 Memtable 刷盘时的时间序列数 +- Average Point Num Of Flushing MemChunk: 节点 MemChunk 刷盘的平均点数 + +#### Schema Engine + +- Schema Engine Mode: 节点的元数据引擎模式 +- Schema Consensus Protocol: 节点的元数据共识协议 +- Schema Region Number: 节点管理的 SchemaRegion 数量 +- Schema Region Memory Overview: 节点的 SchemaRegion 的内存数量 +- Memory Usgae per SchemaRegion: 节点 SchemaRegion 的平均内存使用大小 +- Cache MNode per SchemaRegion: 节点每个 SchemaRegion 中 cache node 个数 +- MLog Length and Checkpoint: 节点每个 SchemaRegion 的当前 mlog 的总长度和检查点位置(仅 SimpleConsensus 有效) +- Buffer MNode per SchemaRegion: 节点每个 SchemaRegion 中 buffer node 个数 +- Activated Template Count per SchemaRegion: 节点每个SchemaRegion中已激活的模版数 +- 时间序列统计 + - Timeseries Count per SchemaRegion: 节点 SchemaRegion 的平均时间序列数 + - Series Type: 节点不同类型的时间序列数量 + - Time Series Number: 节点的时间序列总数 + - Template Series Number: 节点的模板时间序列总数 + - Template Series Count per SchemaRegion: 节点每个SchemaRegion中通过模版创建的序列数 +- IMNode统计 + - Pinned MNode per SchemaRegion: 节点每个 SchemaRegion 中 Pinned 的 IMNode 节点数 + - Pinned Memory per SchemaRegion: 节点每个 SchemaRegion 中 Pinned 的 IMNode 节点的内存占用大小 + - Unpinned MNode per SchemaRegion: 节点每个 SchemaRegion 中 Unpinned 的 IMNode 节点数 + - Unpinned Memory per SchemaRegion: 节点每个 SchemaRegion 中 Unpinned 的 IMNode 节点的内存占用大小 + - Schema File Memory MNode Number: 节点全局 pinned 和 unpinned 的 IMNode 节点数 + - Release and Flush MNode Rate: 节点每秒 release 和 flush 的 IMNode 数量 +- Cache Hit Rate: 节点的缓存命中率 +- Release and Flush Thread Number: 节点当前活跃的 Release 和 Flush 线程数量 +- Time Consumed of Relead and Flush (avg): 节点触发 cache 释放和 buffer 刷盘耗时的平均值 +- Time Consumed of Relead and Flush (99%): 节点触发 cache 释放和 buffer 刷盘的耗时的 P99 + +#### Query Engine + +- 各阶段耗时 + - The time consumed of query plan stages(avg): 节点查询各阶段耗时的平均值 + - The time consumed of query plan stages(50%): 节点查询各阶段耗时的中位数 + - The time consumed of query plan stages(99%): 节点查询各阶段耗时的P99 +- 执行计划分发耗时 + - The time consumed of plan dispatch stages(avg): 节点查询执行计划分发耗时的平均值 + - The time consumed of plan dispatch stages(50%): 节点查询执行计划分发耗时的中位数 + - The time consumed of plan dispatch stages(99%): 节点查询执行计划分发耗时的P99 +- 执行计划执行耗时 + - The time consumed of query execution stages(avg): 节点查询执行计划执行耗时的平均值 + - The time consumed of query execution stages(50%): 节点查询执行计划执行耗时的中位数 + - The time consumed of query execution stages(99%): 节点查询执行计划执行耗时的P99 +- 算子执行耗时 + - The time consumed of operator execution stages(avg): 节点查询算子执行耗时的平均值 + - The time consumed of operator execution(50%): 节点查询算子执行耗时的中位数 + - The time consumed of operator execution(99%): 节点查询算子执行耗时的P99 +- 聚合查询计算耗时 + - The time consumed of query aggregation(avg): 节点聚合查询计算耗时的平均值 + - The time consumed of query aggregation(50%): 节点聚合查询计算耗时的中位数 + - The time consumed of query aggregation(99%): 节点聚合查询计算耗时的P99 +- 文件/内存接口耗时 + - The time consumed of query scan(avg): 节点查询文件/内存接口耗时的平均值 + - The time consumed of query scan(50%): 节点查询文件/内存接口耗时的中位数 + - The time consumed of query scan(99%): 节点查询文件/内存接口耗时的P99 +- 资源访问数量 + - The usage of query resource(avg): 节点查询资源访问数量的平均值 + - The usage of query resource(50%): 节点查询资源访问数量的中位数 + - The usage of query resource(99%): 节点查询资源访问数量的P99 +- 数据传输耗时 + - The time consumed of query data exchange(avg): 节点查询数据传输耗时的平均值 + - The time consumed of query data exchange(50%): 节点查询数据传输耗时的中位数 + - The time consumed of query data exchange(99%): 节点查询数据传输耗时的P99 +- 数据传输数量 + - The count of Data Exchange(avg): 节点查询的数据传输数量的平均值 + - The count of Data Exchange: 节点查询的数据传输数量的分位数,包括中位数和P99 +- 任务调度数量与耗时 + - The number of query queue: 节点查询任务调度数量 + - The time consumed of query schedule time(avg): 节点查询任务调度耗时的平均值 + - The time consumed of query schedule time(50%): 节点查询任务调度耗时的中位数 + - The time consumed of query schedule time(99%): 节点查询任务调度耗时的P99 + +#### Query Interface + +- 加载时间序列元数据 + - The time consumed of load timeseries metadata(avg): 节点查询加载时间序列元数据耗时的平均值 + - The time consumed of load timeseries metadata(50%): 节点查询加载时间序列元数据耗时的中位数 + - The time consumed of load timeseries metadata(99%): 节点查询加载时间序列元数据耗时的P99 +- 读取时间序列 + - The time consumed of read timeseries metadata(avg): 节点查询读取时间序列耗时的平均值 + - The time consumed of read timeseries metadata(50%): 节点查询读取时间序列耗时的中位数 + - The time consumed of read timeseries metadata(99%): 节点查询读取时间序列耗时的P99 +- 修改时间序列元数据 + - The time consumed of timeseries metadata modification(avg): 节点查询修改时间序列元数据耗时的平均值 + - The time consumed of timeseries metadata modification(50%): 节点查询修改时间序列元数据耗时的中位数 + - The time consumed of timeseries metadata modification(99%): 节点查询修改时间序列元数据耗时的P99 +- 加载Chunk元数据列表 + - The time consumed of load chunk metadata list(avg): 节点查询加载Chunk元数据列表耗时的平均值 + - The time consumed of load chunk metadata list(50%): 节点查询加载Chunk元数据列表耗时的中位数 + - The time consumed of load chunk metadata list(99%): 节点查询加载Chunk元数据列表耗时的P99 +- 修改Chunk元数据 + - The time consumed of chunk metadata modification(avg): 节点查询修改Chunk元数据耗时的平均值 + - The time consumed of chunk metadata modification(50%): 节点查询修改Chunk元数据耗时的总位数 + - The time consumed of chunk metadata modification(99%): 节点查询修改Chunk元数据耗时的P99 +- 按照Chunk元数据过滤 + - The time consumed of chunk metadata filter(avg): 节点查询按照Chunk元数据过滤耗时的平均值 + - The time consumed of chunk metadata filter(50%): 节点查询按照Chunk元数据过滤耗时的中位数 + - The time consumed of chunk metadata filter(99%): 节点查询按照Chunk元数据过滤耗时的P99 +- 构造Chunk Reader + - The time consumed of construct chunk reader(avg): 节点查询构造Chunk Reader耗时的平均值 + - The time consumed of construct chunk reader(50%): 节点查询构造Chunk Reader耗时的中位数 + - The time consumed of construct chunk reader(99%): 节点查询构造Chunk Reader耗时的P99 +- 读取Chunk + - The time consumed of read chunk(avg): 节点查询读取Chunk耗时的平均值 + - The time consumed of read chunk(50%): 节点查询读取Chunk耗时的中位数 + - The time consumed of read chunk(99%): 节点查询读取Chunk耗时的P99 +- 初始化Chunk Reader + - The time consumed of init chunk reader(avg): 节点查询初始化Chunk Reader耗时的平均值 + - The time consumed of init chunk reader(50%): 节点查询初始化Chunk Reader耗时的中位数 + - The time consumed of init chunk reader(99%): 节点查询初始化Chunk Reader耗时的P99 +- 通过 Page Reader 构造 TsBlock + - The time consumed of build tsblock from page reader(avg): 节点查询通过 Page Reader 构造 TsBlock 耗时的平均值 + - The time consumed of build tsblock from page reader(50%): 节点查询通过 Page Reader 构造 TsBlock 耗时的中位数 + - The time consumed of build tsblock from page reader(99%): 节点查询通过 Page Reader 构造 TsBlock 耗时的P99 +- 查询通过 Merge Reader 构造 TsBlock + - The time consumed of build tsblock from merge reader(avg): 节点查询通过 Merge Reader 构造 TsBlock 耗时的平均值 + - The time consumed of build tsblock from merge reader(50%): 节点查询通过 Merge Reader 构造 TsBlock 耗时的中位数 + - The time consumed of build tsblock from merge reader(99%): 节点查询通过 Merge Reader 构造 TsBlock 耗时的P99 + +#### Query Data Exchange + +查询的数据交换耗时。 + +- 通过 source handle 获取 TsBlock + - The time consumed of source handle get tsblock(avg): 节点查询通过 source handle 获取 TsBlock 耗时的平均值 + - The time consumed of source handle get tsblock(50%): 节点查询通过 source handle 获取 TsBlock 耗时的中位数 + - The time consumed of source handle get tsblock(99%): 节点查询通过 source handle 获取 TsBlock 耗时的P99 +- 通过 source handle 反序列化 TsBlock + - The time consumed of source handle deserialize tsblock(avg): 节点查询通过 source handle 反序列化 TsBlock 耗时的平均值 + - The time consumed of source handle deserialize tsblock(50%): 节点查询通过 source handle 反序列化 TsBlock 耗时的中位数 + - The time consumed of source handle deserialize tsblock(99%): 节点查询通过 source handle 反序列化 TsBlock 耗时的P99 +- 通过 sink handle 发送 TsBlock + - The time consumed of sink handle send tsblock(avg): 节点查询通过 sink handle 发送 TsBlock 耗时的平均值 + - The time consumed of sink handle send tsblock(50%): 节点查询通过 sink handle 发送 TsBlock 耗时的中位数 + - The time consumed of sink handle send tsblock(99%): 节点查询通过 sink handle 发送 TsBlock 耗时的P99 +- 回调 data block event + - The time consumed of on acknowledge data block event task(avg): 节点查询回调 data block event 耗时的平均值 + - The time consumed of on acknowledge data block event task(50%): 节点查询回调 data block event 耗时的中位数 + - The time consumed of on acknowledge data block event task(99%): 节点查询回调 data block event 耗时的P99 +- 获取 data block task + - The time consumed of get data block task(avg): 节点查询获取 data block task 耗时的平均值 + - The time consumed of get data block task(50%): 节点查询获取 data block task 耗时的中位数 + - The time consumed of get data block task(99%): 节点查询获取 data block task 耗时的 P99 + +#### Query Related Resource + +- MppDataExchangeManager: 节点查询时 shuffle sink handle 和 source handle 的数量 +- LocalExecutionPlanner: 节点可分配给查询分片的剩余内存 +- FragmentInstanceManager: 节点正在运行的查询分片上下文信息和查询分片的数量 +- Coordinator: 节点上记录的查询数量 +- MemoryPool Size: 节点查询相关的内存池情况 +- MemoryPool Capacity: 节点查询相关的内存池的大小情况,包括最大值和剩余可用值 +- DriverScheduler: 节点查询相关的队列任务数量 + +#### Consensus - IoT Consensus + +- 内存使用 + - IoTConsensus Used Memory: 节点的 IoT Consensus 的内存使用情况,包括总使用内存大小、队列使用内存大小、同步使用内存大小 +- 节点间同步情况 + - IoTConsensus Sync Index: 节点的 IoT Consensus 的 不同 DataRegion 的 SyncIndex 大小 + - IoTConsensus Overview: 节点的 IoT Consensus 的总同步差距和缓存的请求数量 + - IoTConsensus Search Index Rate: 节点 IoT Consensus 不同 DataRegion 的写入 SearchIndex 的增长速率 + - IoTConsensus Safe Index Rate: 节点 IoT Consensus 不同 DataRegion 的同步 SafeIndex 的增长速率 + - IoTConsensus LogDispatcher Request Size: 节点 IoT Consensus 不同 DataRegion 同步到其他节点的请求大小 + - Sync Lag: 节点 IoT Consensus 不同 DataRegion 的同步差距大小 + - Min Peer Sync Lag: 节点 IoT Consensus 不同 DataRegion 向不同副本的最小同步差距 + - Sync Speed Diff Of Peers: 节点 IoT Consensus 不同 DataRegion 向不同副本同步的最大差距 + - IoTConsensus LogEntriesFromWAL Rate: 节点 IoT Consensus 不同 DataRegion 从 WAL 获取日志的速率 + - IoTConsensus LogEntriesFromQueue Rate: 节点 IoT Consensus 不同 DataRegion 从 队列获取日志的速率 +- 不同执行阶段耗时 + - The Time Consumed Of Different Stages (avg): 节点 IoT Consensus 不同执行阶段的耗时的平均值 + - The Time Consumed Of Different Stages (50%): 节点 IoT Consensus 不同执行阶段的耗时的中位数 + - The Time Consumed Of Different Stages (99%): 节点 IoT Consensus 不同执行阶段的耗时的P99 + +#### Consensus - DataRegion Ratis Consensus + +- Ratis Stage Time: 节点 Ratis 不同阶段的耗时 +- Write Log Entry: 节点 Ratis 写 Log 不同阶段的耗时 +- Remote / Local Write Time: 节点 Ratis 在本地或者远端写入的耗时 +- Remote / Local Write QPS: 节点 Ratis 在本地或者远端写入的 QPS +- RatisConsensus Memory: 节点 Ratis 的内存使用情况 + +#### Consensus - SchemaRegion Ratis Consensus + +- Ratis Stage Time: 节点 Ratis 不同阶段的耗时 +- Write Log Entry: 节点 Ratis 写 Log 各阶段的耗时 +- Remote / Local Write Time: 节点 Ratis 在本地或者远端写入的耗时 +- Remote / Local Write QPS: 节点 Ratis 在本地或者远端写入的QPS +- RatisConsensus Memory: 节点 Ratis 内存使用情况 \ No newline at end of file diff --git a/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Stand-Alone-Deployment_timecho.md b/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Stand-Alone-Deployment_timecho.md new file mode 100644 index 00000000..ff5411a1 --- /dev/null +++ b/src/zh/UserGuide/Master/Table/Deployment-and-Maintenance/Stand-Alone-Deployment_timecho.md @@ -0,0 +1,217 @@ + +# 单机版安装部署 + +本章将介绍如何启动IoTDB单机实例,IoTDB单机实例包括 1 个ConfigNode 和1个DataNode(即通常所说的1C1D)。 + +## 注意事项 + +1. 安装前请确认系统已参照[系统配置](../Deployment-and-Maintenance/Environment-Requirements.md)准备完成。 +2. 推荐使用`hostname`进行IP配置,可避免后期修改主机ip导致数据库无法启动的问题。设置hostname需要在服务器上配置`/etc/hosts`,如本机ip是192.168.1.3,hostname是iotdb-1,则可以使用以下命令设置服务器的 hostname,并使用hostname配置IoTDB的 `cn_internal_address`、`dn_internal_address`。 + + ```shell + echo "192.168.1.3 iotdb-1" >> /etc/hosts + ``` + +3. 部分参数首次启动后不能修改,请参考下方的[参数配置](#2参数配置)章节进行设置。 +4. 无论是在linux还是windows中,请确保IoTDB的安装路径中不含空格和中文,避免软件运行异常。 +5. 请注意,安装部署(包括激活和使用软件)IoTDB时,您可以: + - 使用 root 用户(推荐):可以避免权限等问题。 + - 使用固定的非 root 用户: + - 使用同一用户操作:确保在启动、激活、停止等操作均保持使用同一用户,不要切换用户。 + - 避免使用 sudo:使用 sudo 命令会以 root 用户权限执行命令,可能会引起权限混淆或安全问题。 +6. 推荐部署监控面板,可以对重要运行指标进行监控,随时掌握数据库运行状态,监控面板可以联系工作人员获取,部署监控面板步骤可以参考:[监控面板部署](../Deployment-and-Maintenance/Monitoring-panel-deployment.md) + +## 安装步骤 + +### 1、解压安装包并进入安装目录 + +```Plain +unzip timechodb-{version}-bin.zip +cd timechodb-{version}-bin +``` + +### 2、参数配置 + +#### 内存配置 + +- conf/confignode-env.sh(或 .bat) + +| **配置项** | **说明** | **默认值** | **推荐值** | 备注 | +| :---------- | :------------------------------------- | :--------- | :----------------------------------------------- | :----------- | +| MEMORY_SIZE | IoTDB ConfigNode节点可以使用的内存总量 | 空 | 可按需填写,填写后系统会根据填写的数值来分配内存 | 重启服务生效 | + +- conf/datanode-env.sh(或 .bat) + +| **配置项** | **说明** | **默认值** | **推荐值** | 备注 | +| :---------- | :----------------------------------- | :--------- | :----------------------------------------------- | :----------- | +| MEMORY_SIZE | IoTDB DataNode节点可以使用的内存总量 | 空 | 可按需填写,填写后系统会根据填写的数值来分配内存 | 重启服务生效 | + +#### 功能配置 + +系统实际生效的参数在文件 conf/iotdb-system.properties 中,启动需设置以下参数,可以从 conf/iotdb-system.properties.template 文件中查看全部参数 + +集群级功能配置 + +| **配置项** | **说明** | **默认值** | **推荐值** | 备注 | +| :------------------------ | :------------------------------- | :------------- | :----------------------------------------------- | :------------------------ | +| cluster_name | 集群名称 | defaultCluster | 可根据需要设置集群名称,如无特殊需要保持默认即可 | 首次启动后不可修改 | +| schema_replication_factor | 元数据副本数,单机版此处设置为 1 | 1 | 1 | 默认1,首次启动后不可修改 | +| data_replication_factor | 数据副本数,单机版此处设置为 1 | 1 | 1 | 默认1,首次启动后不可修改 | + +ConfigNode 配置 + +| **配置项** | **说明** | **默认** | 推荐值 | **备注** | +| :------------------ | :----------------------------------------------------------- | :-------------- | :----------------------------------------------- | :----------------- | +| cn_internal_address | ConfigNode在集群内部通讯使用的地址 | 127.0.0.1 | 所在服务器的IPV4地址或hostname,推荐使用hostname | 首次启动后不能修改 | +| cn_internal_port | ConfigNode在集群内部通讯使用的端口 | 10710 | 10710 | 首次启动后不能修改 | +| cn_consensus_port | ConfigNode副本组共识协议通信使用的端口 | 10720 | 10720 | 首次启动后不能修改 | +| cn_seed_config_node | 节点注册加入集群时连接的ConfigNode 的地址,cn_internal_address:cn_internal_port | 127.0.0.1:10710 | cn_internal_address:cn_internal_port | 首次启动后不能修改 | + +DataNode 配置 + +| **配置项** | **说明** | **默认** | 推荐值 | **备注** | +| :------------------------------ | :----------------------------------------------------------- | :-------------- | :----------------------------------------------- | :----------------- | +| dn_rpc_address | 客户端 RPC 服务的地址 | 0.0.0.0 | 0.0.0.0 | 重启服务生效 | +| dn_rpc_port | 客户端 RPC 服务的端口 | 6667 | 6667 | 重启服务生效 | +| dn_internal_address | DataNode在集群内部通讯使用的地址 | 127.0.0.1 | 所在服务器的IPV4地址或hostname,推荐使用hostname | 首次启动后不能修改 | +| dn_internal_port | DataNode在集群内部通信使用的端口 | 10730 | 10730 | 首次启动后不能修改 | +| dn_mpp_data_exchange_port | DataNode用于接收数据流使用的端口 | 10740 | 10740 | 首次启动后不能修改 | +| dn_data_region_consensus_port | DataNode用于数据副本共识协议通信使用的端口 | 10750 | 10750 | 首次启动后不能修改 | +| dn_schema_region_consensus_port | DataNode用于元数据副本共识协议通信使用的端口 | 10760 | 10760 | 首次启动后不能修改 | +| dn_seed_config_node | 节点注册加入集群时连接的ConfigNode地址,即cn_internal_address:cn_internal_port | 127.0.0.1:10710 | cn_internal_address:cn_internal_port | 首次启动后不能修改 | + +### 3、启动 ConfigNode 节点 + +进入iotdb的sbin目录下,启动confignode + +```shell +./sbin/start-confignode.sh -d #“-d”参数将在后台进行启动 +``` + +如果启动失败,请参考下方[常见问题](#常见问题)。 + +### 4、启动 DataNode 节点 + + 进入iotdb的sbin目录下,启动datanode: + +```shell +./sbin/start-datanode.sh -d #“-d”参数将在后台进行启动 +``` + +### 5、激活数据库 + +#### 方式一:文件激活 + +- 启动Confignode、Datanode节点后,进入activation文件夹, 将 system_info文件复制给天谋工作人员 +- 收到工作人员返回的 license文件 +- 将license文件放入对应节点的activation文件夹下; + +#### 方式二:命令激活 + +- 获取激活所需机器码,进入到 IoTDB CLI 中(./start-cli.sh -sql_dialect table/start-cli.bat -sql_dialect table),执行以下内容: + - 注:当 sql_dialect 为 table 时,暂时不支持使用 + +```shell +show system info +``` + +- 显示如下信息,请将机器码(即绿色字符串)复制给天谋工作人员: + +```sql ++--------------------------------------------------------------+ +| SystemInfo| ++--------------------------------------------------------------+ +| 01-TE5NLES4-UDDWCMYE| ++--------------------------------------------------------------+ +Total line number = 1 +It costs 0.030s +``` + +- 将工作人员返回的激活码输入到CLI中,输入以下内容 + - 注:激活码前后需要用`'`符号进行标注,如所示 + +```sql +IoTDB> activate '01-D4EYQGPZ-EAUJJODW-NUKRDR6F-TUQS3B75-EDZFLK3A-6BOKJFFZ-ALDHOMN7-NB2E4BHI-7ZKGFVK6-GCIFXA4T-UG3XJTTD-SHJV6F2P-Q27B4OMJ-R47ZDIM3-UUASUXG2-OQXGVZCO-MMYKICZU-TWFQYYAO-ZOAGOKJA-NYHQTA5U-EWAR4EP5-MRC6R2CI-PKUTKRCT-7UDGRH3F-7BYV4P5D-6KKIA===' +``` + +### 6、验证激活 + +当看到“ClusterActivationStatus”字段状态显示为ACTIVATED表示激活成功 + +![](https://alioss.timecho.com/docs/img/%E5%8D%95%E6%9C%BA-%E9%AA%8C%E8%AF%81.png) + +## 常见问题 + +1. 部署过程中多次提示激活失败 + - 使用 `ls -al` 命令:使用 `ls -al` 命令检查安装包根目录的所有者信息是否为当前用户。 + - 检查激活目录:检查 `./activation` 目录下的所有文件,所有者信息是否为当前用户。 +2. Confignode节点启动失败 + - 步骤 1: 请查看启动日志,检查是否修改了某些首次启动后不可改的参数。 + - 步骤 2: 请查看启动日志,检查是否出现其他异常。日志中若存在异常现象,请联系天谋技术支持人员咨询解决方案。 + - 步骤 3: 如果是首次部署或者数据可删除,也可按下述步骤清理环境,重新部署后,再次启动。 + - 清理环境: + 1. 结束所有 ConfigNode 和 DataNode 进程。 + ```Bash + # 1. 停止 ConfigNode 和 DataNode 服务 + sbin/stop-standalone.sh + + # 2. 检查是否还有进程残留 + jps + # 或者 + ps -ef|gerp iotdb + + # 3. 如果有进程残留,则手动kill + kill -9 + # 如果确定机器上仅有1个iotdb,可以使用下面命令清理残留进程 + ps -ef|grep iotdb|grep -v grep|tr -s ' ' ' ' |cut -d ' ' -f2|xargs kill -9 + ``` + + 2. 删除 data 和 logs 目录。 + - 说明:删除 data 目录是必要的,删除 logs 目录是为了纯净日志,非必需。 + ```shell + cd /data/iotdb rm -rf data logs + ``` + +## 附录 + +### Confignode节点参数介绍 + +| 参数 | 描述 | 是否为必填项 | +| :--- | :------------------------------- | :----------- | +| -d | 以守护进程模式启动,即在后台运行 | 否 | + +### Datanode节点参数介绍 + +| 缩写 | 描述 | 是否为必填项 | +| :--- | :--------------------------------------------- | :----------- | +| -v | 显示版本信息 | 否 | +| -f | 在前台运行脚本,不将其放到后台 | 否 | +| -d | 以守护进程模式启动,即在后台运行 | 否 | +| -p | 指定一个文件来存放进程ID,用于进程管理 | 否 | +| -c | 指定配置文件夹的路径,脚本会从这里加载配置文件 | 否 | +| -g | 打印垃圾回收(GC)的详细信息 | 否 | +| -H | 指定Java堆转储文件的路径,当JVM内存溢出时使用 | 否 | +| -E | 指定JVM错误日志文件的路径 | 否 | +| -D | 定义系统属性,格式为 key=value | 否 | +| -X | 直接传递 -XX 参数给 JVM | 否 | +| -h | 帮助指令 | 否 | +