Final pieces of controller clustering documentation. Fixes #929

openziti · Feb 7, 2025 · cada138 · cada138
1 parent c022517
commit cada138
Show file tree

Hide file tree

Showing 3 changed files with 177 additions and 0 deletions.
diff --git a/docusaurus/docs/reference/ha/overview.md b/docusaurus/docs/reference/ha/overview.md
@@ -131,3 +131,8 @@ The following limitations currently apply:
 
 Improving routing is an ongoing focus for the OpenZiti project. 
 Issues related to routing improvments can be found on the [Routing Project Board](https://github.com/orgs/openziti/projects/13/views/1).
+
+## Quickstart
+
+The quickstart supports running in clustered mode, see 
+[this guide for more information](https://github.com/openziti/ziti/blob/main/doc/ha/quickstart.md).
diff --git a/docusaurus/docs/reference/ha/routers.md b/docusaurus/docs/reference/ha/routers.md
@@ -0,0 +1,90 @@
+---
+sidebar_label: Routers
+sidebar_position: 40
+---
+
+# Routers in Controller HA
+
+There are only a few differences in how routers work in an HA cluster.
+
+## Configuration
+
+When enrolling routers, the JWT for a new router contains the list of 
+controllers. When the router is enrolled, the controller endpoints 
+configuration file is initialized with the list of controllers.
+
+This means that manually configuring the controllers for a router should
+no longer be required.
+
+### Endpoints File
+
+The router stores the current known controllers in an endpoints configuration
+file.
+
+Note that:
+
+* The endpoints file will be written whenever the router is notified of changes
+  to the controller cluster. 
+* The file is only read at router startup. 
+* The file is not monitored, so changes made by administrators while the router 
+  is running won't take effect until the router is restarted, and may be 
+  overwritten by the router before it is restarted. Make sure the router is 
+  stopped before manually editting the file.
+* The endpoints file is only generated by enrollment and when the endpoints
+  change. For an existing configuration with the routers specified in the 
+  router config, if the endpoints never change, the endpoints file will never
+  be generated.
+
+#### Location
+
+By defaul the endpoints file will be named `endpoints` and will be placed
+in the same directory as the router config file.
+
+However, the file can be customized using a config file settings.
+
+```yaml
+ctrl:
+  endpoints:
+    - tls:ctrl1.ziti.example.com:1280
+  endpointsFile: /var/run/ziti/endpoints.yaml
+```
+
+### Manual Controller Configuration
+
+Instead of specifying a single controller, multiple controllers can be specified
+in the router configuration.
+
+```yaml
+ctrl:
+  endpoints:
+    - tls:ctrl1.ziti.example.com:1280
+    - tls:ctrl2.ziti.example.com:1280
+    - tls:ctrl3.ziti.example.com:1280
+```
+
+If the controller cluster changes, it will notify routers of the updated 
+controller endpoints. 
+
+## Router Data Model
+
+The router receives a stripped down version of the controller data model. 
+
+While the router data model can be disabled on the controller using a config 
+setting in standalone mode, it is required for controller clusters, so that
+setting will be ignored.
+
+The data model on the router is periodically snapshotted, so it doesn't need to
+be fully restored from a controller on every restart. 
+
+The location and frequency of snapshotting can be 
+[configured using the db and dbSaveIntervalSeconds properties](../configuration/router#edge).
+
+## Controller Selection
+
+When creating [circuits](/learn/core-concepts/security/SessionsAndConnections.md#data-plane), 
+routers will chose the most responsive controller, based on latency. Network operators will
+want to keep an eye on controllers to make sure they can keep up with the circuit creation
+load they receive.
+
+When managing terminators, routers will try to talk directly to the current 
+cluster leader, since updates have to go through the leader. 
diff --git a/docusaurus/docs/reference/ha/topology.md b/docusaurus/docs/reference/ha/topology.md
@@ -0,0 +1,82 @@
+---
+sidebar_label: Topology
+sidebar_position: 60
+---
+
+# Controller Topology
+
+his document discusses cluster size and member placement.
+
+## Number of Controllers
+
+### Management
+
+The first consideration is how many controllers the network should be able to lose without losing
+functionality. A cluster of size N needs (N/2) + 1 voting members connected to be able
+to take model updates, such as provisioning identities, adding/changes services and updating policies.
+
+Since a two node cluster will lose some functionality if either node becomes unavailable, a minimum
+of 3 nodes is recommended.
+
+### Clients
+
+The functionality that controllers provide to clients doesn't require any specific number of controllers.
+A network manager will want to scale the number controllers based on client demand and may want to 
+place additional controllers geographically close to clusters of clients for better performance.
+
+## Voting vs Non-Voting Members
+
+Because every model update must be approved by a quorum of voting members, adding a large number of voting
+members can add a lot of latency to model changes. A three node cluster in the same data center would 
+likely need a few 10s of milliseconds. A cluster with a quorum on a single continent might take a hundred
+milliseconds, and one that had to traverse large portions of the globe might take half a second.
+
+If the network has enough voting members to meet availability needs, then additional controllers added
+for performance reasons should be added as non-voting members.
+
+Additionally, having a quorum of controllers be geographically close will reduce latency without necessarily
+reducing availability.
+
+### Example
+
+**Requirements**
+
+1. The network should be able to withstand the loss of one voting member
+1. Controllers should exist in the US, EU and Asia, with two in each region. 
+
+To be able to lose one voting member, we need three voting nodes, with six nodes total.
+
+We should place 2 voting members in the same region, but in different availability zones/data centers.
+The third voting member should be in a different region. The rest of the controllers should be non-voting.
+
+**Proposed Layout**
+
+So, using AWS regions, the network might have:
+
+* One in us-east-1 (voting)
+* One in us-west-2 (voting)
+* One in eu-west-3 (voting)
+* One in eu-south-1 (non-voting)
+* One in ap-southeast-4 (non-voting)
+* One in ap-south-2 (non-voting)
+
+Assuming the leader is one of us-east-1 or us-west-2, model updates will only need to be accepted by 
+one relatively close node before being accepted. All other controllers will recieve the updates as well,
+but updates won't be gated on communications with all of them.
+
+**Alternate**
+
+For even faster updates at the cost of an extra controller, two controllers could be in the US Eastern DC: one in us-east-1
+and one in us-east-2. The third member could be in the EU. Updates would now only need to be approved by two 
+very close controllers. If one of them went down, updates would slow down, since updates would need to be done
+over longer latencies, but they would still work.
+
+* One in us-east-1 (voting)
+* One in us-east-2 (voting)
+* One in us-west-2 (non-voting)
+* One in eu-west-3 (voting)
+* One in eu-south-1 (non-voting)
+* One in ap-southeast-4 (non-voting)
+* One in ap-south-2 (non-voting)
+
+