From 9e397dd9b7d70cfce308dcf39e18a01bb3e7b4ad Mon Sep 17 00:00:00 2001
From: Paul Lorenz <paul.lorenz@netfoundry.io>
Date: Fri, 31 Jan 2025 18:17:40 -0500
Subject: [PATCH] Add Controller HA reference material. Fixes #929

---
 docusaurus/docs/reference/ha/routers.md  | 90 ++++++++++++++++++++++++
 docusaurus/docs/reference/ha/topology.md | 81 +++++++++++++++++++++
 2 files changed, 171 insertions(+)
 create mode 100644 docusaurus/docs/reference/ha/routers.md
 create mode 100644 docusaurus/docs/reference/ha/topology.md

diff --git a/docusaurus/docs/reference/ha/routers.md b/docusaurus/docs/reference/ha/routers.md
new file mode 100644
index 00000000..f9a7b8bc
--- /dev/null
+++ b/docusaurus/docs/reference/ha/routers.md
@@ -0,0 +1,90 @@
+---
+sidebar_label: Routers
+sidebar_position: 40
+---
+
+# Routers in Controller HA
+
+There are only a few differences in how routers work in an HA cluster.
+
+## Configuration
+
+When enrolling routers, the JWT for a new router contains the list of 
+controllers. When the router is enrolled, the controller endpoints 
+configuration file is initialized with the list of controllers.
+
+This means that manually configuring the controllers for a router should
+no longer be required.
+
+### Endpoints File
+
+The router stores the current known controllers in an endpoints configuration
+file.
+
+Note that:
+
+* The endpoints file will be written whenever the router is notified of changes
+  to the controller cluster. 
+* The file is only read at router startup. 
+* The file is not monitored, so changes made by administrators while the router 
+  is running won't take effect until the router is restarted, and may be 
+  overwritten by the router before it is restarted. Make sure the router is 
+  stopped before manually editting the file.
+* The endpoints file is only generated by enrollment and when the endpoints
+  change. For an existing configuration with the routers specified in the 
+  router config, if the endpoints never change, the endpoints file will never
+  be generated.
+
+#### Location
+
+By defaul the endpoints file will be named `endpoints` and will be placed
+in the same directory as the router config file.
+
+However, the file can be customized using a config file settings.
+
+```yaml
+ctrl:
+  endpoints:
+    - tls:ctrl1.ziti.example.com:1280
+  endpointsFile: /var/run/ziti/endpoints.yaml
+```
+
+### Manual Controller Configuration
+
+Instead of specifying a single controller, multiple controllers can be specified
+in the router configuration.
+
+```yaml
+ctrl:
+  endpoints:
+    - tls:ctrl1.ziti.example.com:1280
+    - tls:ctrl2.ziti.example.com:1280
+    - tls:ctrl3.ziti.example.com:1280
+```
+
+If the controller cluster changes, it will notify routers of the updated 
+controller endpoints. 
+
+## Router Data Model
+
+The router receives a stripped down version of the controller data model. 
+
+While the router data model can be disabled on the controller using a config 
+setting in standalone mode, it is required for controller clusters, so that
+setting will be ignored.
+
+The data model on the router is periodically snapshotted, so it doesn't need to
+be fully restored from a controller on every restart. 
+
+The location and frequency of snapshotting can be 
+[configured using the db and dbSaveIntervalSeconds properties](../configuration/router#edge).
+
+## Controller Selection
+
+When creating [circuits](/learn/core-concepts/security/SessionsAndConnections.md#data-plane), 
+routers will chose the most responsive controller, based on latency. Network operators will
+want to keep an eye on controllers to make sure they can keep up with the circuit creation
+load they receive.
+
+When managing terminators, routers will try to talk directly to the current 
+cluster leader, since updates have to go through the leader. 
diff --git a/docusaurus/docs/reference/ha/topology.md b/docusaurus/docs/reference/ha/topology.md
new file mode 100644
index 00000000..57934a08
--- /dev/null
+++ b/docusaurus/docs/reference/ha/topology.md
@@ -0,0 +1,81 @@
+---
+sidebar_label: Topology
+sidebar_position: 60
+---
+
+# Controller Topology
+
+This document discuss considerations for how many controllers a network might 
+need and how to place them geographically.
+
+## Number of Controllers
+
+### Management
+
+The first consideration is how many controllers the network should be able to lose without losing
+functionality. A cluster of size N needs (N/2) + 1 controllers active and connected to be able
+to take model updates, such as provisioning identities, adding/changes services and updating policies.
+
+Since a two node cluster will lose some functionality if either node becomes unavailable, a minimum
+of 3 nodes is recommended.
+
+### Clients
+
+The functionality that controllers provide to clients doesn't require any specific number of controllers.
+A network manager will want to scale the number controllers based on client demand and may want to 
+place additional controllers geographically close to clusters of clients for better performance.
+
+## Voting vs Non-Voting Members
+
+Because every model update must be approved by a quorum of voting members, adding a large number of voting
+members can add a lot of latency to model changes. 
+
+If more controllers are desired to scale out to meet client needs, only as many controllers as are needed
+to meet availability requirements for mangement needs should be made into voting members.
+
+Additionally a having a quorum of controllers be geographically close will reduce latency without necessarily
+reducing availability.
+
+### Example
+
+**Requirements**
+
+1. The network should be able to withstand the loss of 1 voting member
+1. Controllers should exist in the US, EU and Asia, with 2 in each region. 
+
+To be able to lose one voting member, we need 3 voting nodes, with 6 nodes total.
+
+We should place 2 voting members in the same region, but in different availability zones/data centers.
+The third voting member should be in a different region. The rest of the controllers should be non-voting.
+
+**Proposed Layout**
+
+So, using AWS regions, we might have:
+
+* 1 in us-east-1 (voting)
+* 1 in us-west-2 (voting)
+* 1 in eu-west-3 (voting)
+* 1 in eu-south-1 (non-voting)
+* 1 in ap-southeast-4 (non-voting)
+* 1 in ap-south-2 (non-voting)
+
+Assuming the leader is one of us-east-1 or us-west-2, model updates will only need to be accepted by 
+one relatively close node before being accepted. All other controllers will recieve the updates as well,
+but updates won't be gated on communications with all of them.
+
+**Alternate**
+
+For even faster updates at the cost of an extra controller, two controllers could be in us-east, one in us-east-1
+and one in us-east-2. The third member could be in the eu. Updates would now only need to be approved by two 
+very close controllers. If one of them went down, updates would slow down, since updates would need to be done
+over longer latencies, but they would still work.
+
+* 1 in us-east-1 (voting)
+* 1 in us-east-2 (voting)
+* 1 in us-west-2 (non-voting)
+* 1 in eu-west-3 (voting)
+* 1 in eu-south-1 (non-voting)
+* 1 in ap-southeast-4 (non-voting)
+* 1 in ap-south-2 (non-voting)
+
+