WIP PCR on Cloud

cockroachdb · Jan 24, 2025 · fd5db34 · fd5db34
1 parent d2e120e
commit fd5db34
Show file tree

Hide file tree

Showing 5 changed files with 316 additions and 6 deletions.
diff --git a/src/current/_includes/v24.3/sidebar-data/cloud-deployments.json b/src/current/_includes/v24.3/sidebar-data/cloud-deployments.json
@@ -551,6 +551,12 @@
                     }
                 ]
               },
+              {
+                "title": "Physical Cluster Replication",
+                "urls": [
+                    "/cockroachcloud/physical-cluster-replication.html"
+                ]
+              },
               {
                 "title": "Billing Management",
                 "urls": [

diff --git a/src/current/_includes/v24.3/sidebar-data/cross-cluster-replication.json b/src/current/_includes/v24.3/sidebar-data/cross-cluster-replication.json
@@ -92,6 +92,12 @@
                 ]
               }
             ]
+          },
+          {
+            "title": "Physical Cluster Replication on Cloud clusters",
+            "urls": [
+              "/cockroachcloud/physical-cluster-replication.html"
+            ]
           }
         ]
       }

diff --git a/src/current/cockroachcloud/physical-cluster-replication.md b/src/current/cockroachcloud/physical-cluster-replication.md
@@ -0,0 +1,236 @@
+---
+title: Physical Cluster Replication
+summary: Set up physical cluster replication (PCR) in a Cloud deployment.
+toc: true
+---
+
+{{site.data.alerts.callout_info}}
+{% include feature-phases/preview.md %}
+{{site.data.alerts.end}}
+
+CockroachDB **physical cluster replication (PCR)** continuously sends all data at the byte level from a _primary_ cluster to an independent _standby_ cluster. Existing data and ongoing changes on the active primary cluster, which is serving application data, replicate asynchronously to the passive standby cluster.
+
+In a disaster recovery scenario, you can [fail over](#step-4-fail-over-to-the-standby-cluster) from the unavailable primary cluster to the standby cluster. This will stop the replication stream, reset the standby cluster to a point in time where all ingested data is consistent, and mark the standby as ready to accept application traffic.
+
+
+{% comment %}Add short use case / benefits here, or decide whether it's ok to link to the PCR overview page {% endcomment %}
+{% comment %}Remember to add to the feature preview page {% endcomment %}
+
+## Set up PCR in CockroachDB {{ site.data.products.advanced }}
+
+In this guide, you'll use the {{ site.data.products.cloud }} API to set up PCR from a primary cluster to a standby cluster, monitor the PCR stream, and fail over from the primary to the standby cluster.
+
+{{site.data.alerts.callout_info}}
+PCR is supported on CockroachDB {{ site.data.products.advanced }} and CockroachDB self-hosted clusters. For guide to setting up PCR in CockroachDB self-hosted, refer to the [Set Up Physical Cluster Replication]({% link {{ site.current_cloud_version }}/set-up-physical-cluster-replication.md %}) tutorial.
+{{site.data.alerts.end}}
+
+### Before you begin
+
+You'll need the following:
+
+{% comment %}Add links{% endcomment %}
+- [Cloud API Access]({% link cockroachcloud/managing-access.md %}#api-access).
+
+    To set up and manage PCR on CockroachDB {{ site.data.products.advanced }} clusters, you'll use the `'https://cockroachlabs.cloud/api/v1/replication-streams'` endpoint. Access to the `replication-streams` endpoint requires a valid CockroachDB {{ site.data.products.cloud }} [service account]({% link cockroachcloud/managing-access.md %}#manage-service-accounts) with the correct permissions.
+
+    The following describes the required roles `'replication-streams'` endpoint methods:
+
+    Method | Required roles | Description
+    -------+----------------+------------
+    `POST` | [Cluster Administrator]({% link cockroachcloud/authorization.md %}#cluster-administrator) | Create a PCR stream.
+    `GET` | [Cluster Administrator]({% link cockroachcloud/authorization.md %}#cluster-administrator), [Cluster Operator]({% link cockroachcloud/authorization.md %}#cluster-operator), [Cluster Developer]({% link cockroachcloud/authorization.md %}#cluster-developer) | Retrieve information for the PCR stream.
+    `PATCH` | [Cluster Administrator]({% link cockroachcloud/authorization.md %}#cluster-administrator) | Update the PCR stream to fail over.
+
+    {{site.data.alerts.callout_success}}
+    We recommend creating service accounts with the [principle of least privilege](https://wikipedia.org/wiki/Principle_of_least_privilege), and giving each application that accesses the API its own service account and API key. This allows fine-grained access to the cluster and PCR streams.
+    {{site.data.alerts.end}}
+
+- Read the [Configuration behavior](#configuration-behavior) and [Known limitations](#known-limitations) sections.
+
+### Configuration behavior
+
+To set up PCR successfully:
+
+- Clusters must be in the same cloud (AWS, GCP, or Azure).
+- Clusters must be single region (but it is possible to have multiple availability zones per cluster).
+- The primary and standby cluster in AWS and Azure must be in different regions.
+- The primary and standby cluster in GCP can be in the same region, but must not have overlapping CIDR ranges.
+- Clusters can have different node topology and hardware configurations. For disaster recovery purposes (failover and redirecting application traffic to a standby), we recommend configuring the primary and standby clusters with similar hardware.
+
+### Step 1. Create the clusters
+
+To use PCR, it is necessary to set `support_physical_cluster_replication` to `true`, which indicates that a cluster should start using an architecture that supports PCR. For cluster cloud and region details, refer to [Configuration behavior](#configuration-behavior).
+
+1. Send a `POST` request to create the primary cluster:
+
+    {% include_cached copy-clipboard.html %}
+    ~~~ shell
+    curl --location --request POST 'https://cockroachlabs.cloud/api/v1/clusters' --header "Authorization: Bearer api_secret_key" --header 'Content-Type: application/json' --data '{"name": "primary_cluster_name", "provider": "AWS", "spec": {"dedicated": {"cockroachVersion": "v24.3", "hardware": {"disk_iops": 0, "machine_spec": {"num_virtual_cpus": 4}, "storage_gib": 16}, "region_nodes": {"us-east-1": 3}, "support_physical_cluster_replication": true}}}'
+    ~~~
+
+    For details on the cluster specifications, refer to [Create a cluster]({% link cockroachcloud/cloud-api.md %}#create-a-cluster). Ensure that you also include:
+    - `api_secret_key`: your API secret key.
+    - `support_physical_cluster_replication` set to `true`.
+
+1. Send a `POST` request to create the standby cluster:
+
+    {% include_cached copy-clipboard.html %}
+    ~~~ shell
+    curl --location --request POST 'https://cockroachlabs.cloud/api/v1/clusters' --header "Authorization: Bearer api_secret_key" --header 'Content-Type: application/json' --data '{"name": "standby_cluster_name", "provider": "AWS", "spec": {"dedicated": {"cockroachVersion": "v24.3", "hardware": {"disk_iops": 0, "machine_spec": {"num_virtual_cpus": 4}, "storage_gib": 16}, "region_nodes": {"us-east-2": 3}, "support_physical_cluster_replication": true}}}'
+    ~~~
+
+    If you're creating clusters in AWS or Azure, you must ensure the primary and standby clusters are in different regions.
+
+{{site.data.alerts.callout_success}}
+We recommend [enabling Prometheus metrics export]({% link cockroachcloud/export-metrics.md %}) on your cluster before starting a PCR stream. For details on metrics to track, refer to [Monitor the PCR stream](#step-3-monitor-the-pcr-stream).
+{{site.data.alerts.end}}
+
+### Step 2. Start the PCR stream
+
+With the primary and standby clusters created, you can now start the PCR stream.
+
+{% comment %}To edit{% endcomment %}
+It is possible to write to both clusters before starting PCR, however, we recommend keeping the standby empty (i.e. not writing to the standby) prior to starting PCR. Upon starting PCR, Cockroach Cloud will take a full cluster backup of the standby, wipe the standby cluster, and start replication. This is because the standby cluster must be empty upon starting PCR. 
+
+With the Cloud API, run the following command to start the PCR stream. You can find the cluster IDs in the cluster creation output:
+
+{% include_cached copy-clipboard.html %}
+~~~ shell
+curl --location --request POST 'https://cockroachlabs.cloud/api/v1/replication-streams' --header "Authorization: Bearer api_secret_key" --header 'Content-Type: application/json' --data '{"sourceClusterId": "primary_cluster_id","targetClusterId": "standby_cluster_id"}'
+~~~
+
+Replace:
+
+- `api_secret_key` with your API secret key.
+- `primary_cluster_id` with the cluster ID returned after creating the primary cluster.
+- `standby_cluster_id` with the cluster ID returned after creating the standby cluster.
+
+Once you have started PCR, the CockroachDB {{ site.data.products.cloud }} standby cluster cannot accept writes and reads, therefore the Cloud Console and SQL shell will be unavailable prior to failover.
+
+{% comment %}Add note to self-hosted docs that read from standby available there, but not in cloud{% endcomment %}
+
+~~~ json
+{
+    "id": "7487d7a6-868b-4c6f-aa60-cc306cc525fe",
+    "status": "STARTING",
+    "source_cluster_id": "ad1e8630-729a-40f3-87e4-9f72eb3347a0",
+    "target_cluster_id": "e64b56c3-09f2-42ee-9f5a-d9b13985a897",
+    "created_at": "2025-01-13T16:41:20.467781Z",
+    "retained_time": null,
+    "replicated_time": null,
+    "failover_at": null,
+    "activation_at": null
+}
+~~~
+
+To start PCR between clusters, CockroachDB {{ site.data.products.cloud }} sets up VPC peering between clusters and validates the connectivity. As a result, it may take around 5 minutes to initialize the PCR job during which the status will be `STARTING`.
+
+### Step 3. Monitor the PCR stream
+
+For monitoring the current status of the PCR stream, send a `GET` request to the `/v1/replication-streams` endpoint along with the primary cluster, standby cluster, or the ID of the PCR stream:
+
+{% include_cached copy-clipboard.html %}
+~~~ shell
+curl --location --request GET "https://cockroachlabs.cloud/api/v1/replication-streams?cluster_id=e64b56c3-09f2-42ee-9f5a-d9b13985a897" --header "Authorization: Bearer api_secret_key" --header 'Content-Type: application/json' 
+~~~
+
+This will return:
+
+~~~json
+{
+  "replication_streams": [
+    {
+      "id": "7487d7a6-868b-4c6f-aa60-cc306cc525fe",
+      "status": "REPLICATING",
+      "source_cluster_id": "ad1e8630-729a-40f3-87e4-9f72eb3347a0",
+      "target_cluster_id": "e64b56c3-09f2-42ee-9f5a-d9b13985a897",
+      "created_at": "2025-01-13T16:41:20.467781Z",
+      "retained_time": "2025-01-13T19:35:14.472670Z",
+      "replicated_time": "2025-01-13T19:46:15Z",
+      "failover_at": null,
+      "activation_at": null
+    }
+  ],
+}
+~~~
+
+- `id`: The ID of the PCR stream.
+- `status`: The status of the PCR stream. For descriptions, refer to [Status](#status).
+- `source_cluster_id`: The cluster ID of the primary cluster.
+- `target_cluster_id`: The cluster ID of the standby cluster.
+- `created_at`: The timestamp when the PCR stream was started.
+- `retained_time`: The timestamp indicating the lower bound that the PCR stream can failover to. The tracked replicated time and the advancing [protected timestamp]({% link {{ site.current_cloud_version }}/architecture/storage-layer.md %}#protected-timestamps) allows PCR to also track retained time. Therefore, the failover window for a PCR job falls between the retained time and the replicated time.
+- `replicated_time`: The latest time at which the standby cluster has consistent data.
+- `failover_at`: The requested timestamp for failover. If you used `"status":"FAILING_OVER"` to initiate the failover and omitted `failover_at`, the failover time will default to the latest consistent replicated time. For more details, refer to [Fail over to the standby cluster](#step-4-fail-over-to-the-standby-cluster).
+- `activation_at`: The CockroachDB system time at which failover is finalized, which could be different from the time that failover was requested. This field will return a response when the PCR stream is in [`COMPLETED` status](#status).
+
+
+{% comment  %}Add statuses and properties (maybe link to api docs here){% endcomment %}
+
+#### Status
+
+Status | Description
+-------+------------
+`STARTING` | The PCR stream is starting by setting up the VPC peering connection between clusters and validating the connectivity.
+`REPLICATING` | The PCR stream will complete an initial scan and then continue the ongoing replication between the primary and standby clusters.
+`FAILING_OVER` | The failover has been initiated from the primary to the standby cluster.
+`COMPLETED` | The failover is complete and the standby cluster is now independent from the primary cluster.
+
+#### Metrics
+
+For continual monitoring of PCR, track the following metrics with [Prometheus]({% link cockroachcloud/export-metrics.md %}): 
+
+- `physical_replication.logical_bytes`: The logical bytes (the sum of all keys and values) ingested by all PCR jobs.
+- `physical_replication.sst_bytes`: The SST bytes (compressed) sent to the KV layer by all PCR jobs.
+- `physical_replication.replicated_time_seconds`: The replicated time of the physical replication stream in seconds since the Unix epoch.
+
+### Step 4. Fail over to the standby cluster
+
+Failing over from the primary cluster to the standby cluster will stop the PCR stream, reset the standby cluster to a point in time where all ingested data is consistent, and mark the standby as ready to accept application traffic. You can schedule the failover to:
+
+- The latest consistent time.
+- A time in the past within the `retained_time`.
+- A time up to one hour in the future.
+
+To specify a timestamp, send a `GET` request to the `/v1/replication-streams` endpoint along with the primary cluster, standby cluster, or the ID of the PCR stream. Include the `failover_at` timestamp and the `"status": "FAILING_OVER"` field:
+
+{% include_cached copy-clipboard.html %}
+~~~ shell
+curl --location --request PATCH "https://cockroachlabs.cloud/api/v1/replication-streams/7487d7a6-868b-4c6f-aa60-cc306cc525fe" --header "Authorization: Bearer api_secret_key" --header 'Content-Type:application/json' --data '{"status": "FAILING_OVER", "failover_at": "2025-01-13T19:35:14.472670Z"}'
+~~~
+
+To fail over to the latest consistent time, you only need to include `"status": "FAILING_OVER"` in your reques with one of the cluster IDs or PCR stream ID:
+
+{% include_cached copy-clipboard.html %}
+~~~ shell
+curl --location --request PATCH "https://cockroachlabs.cloud/api/v1/replication-streams/7487d7a6-868b-4c6f-aa60-cc306cc525fe" --header "Authorization: Bearer api_secret_key" --header 'Content-Type:application/json' --data '{"status": "FAILING_OVER"}'
+~~~
+~~~json
+{
+  "id": "f1ea3b5a-5b9b-4101-9db4-780734554b14",
+  "status": "FAILING_OVER",
+  "source_cluster_id": "560cc798-7881-4a5b-baf0-321354a42c19",
+  "target_cluster_id": "24a9d47c-418c-4abc-940b-d2f665d95715",
+  "created_at": "2025-01-23T16:54:15.926325Z",
+  "retained_time": null,
+  "replicated_time": null,
+  "failover_at": null,
+  "activation_at": null
+}
+~~~
+
+After the failover is complete, both clusters can receive traffic and operate as separate clusters. It is necessary to redirect application traffic manually.
+
+{{site.data.alerts.callout_info}}
+PCR is on the cluster level, which means that the job also replicates all system tables. Users that need to access the standby cluster after failover should use the user roles for the primary cluster, because the standby cluster is a copy of the primary cluster. PCR overwrites all previous system tables on the standby cluster.
+{{site.data.alerts.end}}
+
+### Fail back to the primary cluster
+
+To fail back from the standby to the primary cluster, start another PCR stream with the standby cluster as the `sourceClusterId` and the original primary cluster as the `targetClusterId`. 
+
+## Known limitations
+
+- PCR on CockroachDB {{ site.data.products.cloud }} clusters cannot be started with existing clusters if the cluster was created without the `"support_physical_cluster_replication": true` parameter. 
+- Failing back to an original primary cluster by replicating only the difference in data between the promoted standby and the original primary cluster is not supported. As a result, failing back to the original primary cluster involves starting the starting a new PCR stream including an initial scan.
+- Reading from the standby cluster during PCR is not supported.
diff --git a/src/current/v25.1/cockroachdb-feature-availability.md b/src/current/v25.1/cockroachdb-feature-availability.md
@@ -47,6 +47,10 @@ Any feature made available in a phase prior to GA is provided without any warran
 **The following features are in preview** and are subject to change. To share feedback and/or issues, contact [Support](https://support.cockroachlabs.com/hc).
 {{site.data.alerts.end}}
 
+### Physical cluster replication (PCR) on CockorachDB Advanced
+
+[PCR on CockroachDB {{ site.data.products.advanced }}]({% link cockroachcloud/physical-cluster-replication.md %}) is in preview. PCR continuously sends all data at the byte level from a primary cluster to an independent standby cluster. Existing data and ongoing changes on the active primary cluster, which is serving application data, replicate asynchronously to the passive standby cluster.
+
 ### Triggers
 
 [Triggers]({% link {{ page.version.version }}/triggers.md %}) are in Preview. A trigger executes a function when one or more specified SQL operations is performed on a table. Triggers respond to data changes by adding logic within the database, rather than in an application. They can be used to modify data before it is inserted, maintain data consistency across rows or tables, or record an update to a row.
-Original file line number
+Diff line change
@@ Expand Up / @@ -92,6 +92,12 @@ @@
                     ]
                   }
                 ]
+              },
+              {
+                "title": "Physical Cluster Replication on Cloud clusters",
+                "urls": [
+                  "/cockroachcloud/physical-cluster-replication.html"
+                ]
               }
             ]
           }
@@ Expand Down @@