diff --git a/src/current/_includes/v24.3/sidebar-data/cloud-deployments.json b/src/current/_includes/v24.3/sidebar-data/cloud-deployments.json index 5615c443dcb..c69eb911d63 100644 --- a/src/current/_includes/v24.3/sidebar-data/cloud-deployments.json +++ b/src/current/_includes/v24.3/sidebar-data/cloud-deployments.json @@ -551,6 +551,12 @@ } ] }, + { + "title": "Physical Cluster Replication", + "urls": [ + "/cockroachcloud/physical-cluster-replication.html" + ] + }, { "title": "Billing Management", "urls": [ diff --git a/src/current/_includes/v24.3/sidebar-data/cross-cluster-replication.json b/src/current/_includes/v24.3/sidebar-data/cross-cluster-replication.json index 925e3e2aca9..3057bb6e4a8 100644 --- a/src/current/_includes/v24.3/sidebar-data/cross-cluster-replication.json +++ b/src/current/_includes/v24.3/sidebar-data/cross-cluster-replication.json @@ -92,6 +92,12 @@ ] } ] + }, + { + "title": "Physical Cluster Replication on Cloud clusters", + "urls": [ + "/cockroachcloud/physical-cluster-replication.html" + ] } ] } diff --git a/src/current/cockroachcloud/physical-cluster-replication.md b/src/current/cockroachcloud/physical-cluster-replication.md new file mode 100644 index 00000000000..acda9c86358 --- /dev/null +++ b/src/current/cockroachcloud/physical-cluster-replication.md @@ -0,0 +1,236 @@ +--- +title: Physical Cluster Replication +summary: Set up physical cluster replication (PCR) in a Cloud deployment. +toc: true +--- + +{{site.data.alerts.callout_info}} +{% include feature-phases/preview.md %} +{{site.data.alerts.end}} + +CockroachDB **physical cluster replication (PCR)** continuously sends all data at the byte level from a _primary_ cluster to an independent _standby_ cluster. Existing data and ongoing changes on the active primary cluster, which is serving application data, replicate asynchronously to the passive standby cluster. + +In a disaster recovery scenario, you can [fail over](#step-4-fail-over-to-the-standby-cluster) from the unavailable primary cluster to the standby cluster. This will stop the replication stream, reset the standby cluster to a point in time where all ingested data is consistent, and mark the standby as ready to accept application traffic. + + +{% comment %}Add short use case / benefits here, or decide whether it's ok to link to the PCR overview page {% endcomment %} +{% comment %}Remember to add to the feature preview page {% endcomment %} + +## Set up PCR in CockroachDB {{ site.data.products.advanced }} + +In this guide, you'll use the {{ site.data.products.cloud }} API to set up PCR from a primary cluster to a standby cluster, monitor the PCR stream, and fail over from the primary to the standby cluster. + +{{site.data.alerts.callout_info}} +PCR is supported on CockroachDB {{ site.data.products.advanced }} and CockroachDB self-hosted clusters. For guide to setting up PCR in CockroachDB self-hosted, refer to the [Set Up Physical Cluster Replication]({% link {{ site.current_cloud_version }}/set-up-physical-cluster-replication.md %}) tutorial. +{{site.data.alerts.end}} + +### Before you begin + +You'll need the following: + +{% comment %}Add links{% endcomment %} +- [Cloud API Access]({% link cockroachcloud/managing-access.md %}#api-access). + + To set up and manage PCR on CockroachDB {{ site.data.products.advanced }} clusters, you'll use the `'https://cockroachlabs.cloud/api/v1/replication-streams'` endpoint. Access to the `replication-streams` endpoint requires a valid CockroachDB {{ site.data.products.cloud }} [service account]({% link cockroachcloud/managing-access.md %}#manage-service-accounts) with the correct permissions. + + The following describes the required roles `'replication-streams'` endpoint methods: + + Method | Required roles | Description + -------+----------------+------------ + `POST` | [Cluster Administrator]({% link cockroachcloud/authorization.md %}#cluster-administrator) | Create a PCR stream. + `GET` | [Cluster Administrator]({% link cockroachcloud/authorization.md %}#cluster-administrator), [Cluster Operator]({% link cockroachcloud/authorization.md %}#cluster-operator), [Cluster Developer]({% link cockroachcloud/authorization.md %}#cluster-developer) | Retrieve information for the PCR stream. + `PATCH` | [Cluster Administrator]({% link cockroachcloud/authorization.md %}#cluster-administrator) | Update the PCR stream to fail over. + + {{site.data.alerts.callout_success}} + We recommend creating service accounts with the [principle of least privilege](https://wikipedia.org/wiki/Principle_of_least_privilege), and giving each application that accesses the API its own service account and API key. This allows fine-grained access to the cluster and PCR streams. + {{site.data.alerts.end}} + +- Read the [Configuration behavior](#configuration-behavior) and [Known limitations](#known-limitations) sections. + +### Configuration behavior + +To set up PCR successfully: + +- Clusters must be in the same cloud (AWS, GCP, or Azure). +- Clusters must be single region (but it is possible to have multiple availability zones per cluster). +- The primary and standby cluster in AWS and Azure must be in different regions. +- The primary and standby cluster in GCP can be in the same region, but must not have overlapping CIDR ranges. +- Clusters can have different node topology and hardware configurations. For disaster recovery purposes (failover and redirecting application traffic to a standby), we recommend configuring the primary and standby clusters with similar hardware. + +### Step 1. Create the clusters + +To use PCR, it is necessary to set `support_physical_cluster_replication` to `true`, which indicates that a cluster should start using an architecture that supports PCR. For cluster cloud and region details, refer to [Configuration behavior](#configuration-behavior). + +1. Send a `POST` request to create the primary cluster: + + {% include_cached copy-clipboard.html %} + ~~~ shell + curl --location --request POST 'https://cockroachlabs.cloud/api/v1/clusters' --header "Authorization: Bearer api_secret_key" --header 'Content-Type: application/json' --data '{"name": "primary_cluster_name", "provider": "AWS", "spec": {"dedicated": {"cockroachVersion": "v24.3", "hardware": {"disk_iops": 0, "machine_spec": {"num_virtual_cpus": 4}, "storage_gib": 16}, "region_nodes": {"us-east-1": 3}, "support_physical_cluster_replication": true}}}' + ~~~ + + For details on the cluster specifications, refer to [Create a cluster]({% link cockroachcloud/cloud-api.md %}#create-a-cluster). Ensure that you also include: + - `api_secret_key`: your API secret key. + - `support_physical_cluster_replication` set to `true`. + +1. Send a `POST` request to create the standby cluster: + + {% include_cached copy-clipboard.html %} + ~~~ shell + curl --location --request POST 'https://cockroachlabs.cloud/api/v1/clusters' --header "Authorization: Bearer api_secret_key" --header 'Content-Type: application/json' --data '{"name": "standby_cluster_name", "provider": "AWS", "spec": {"dedicated": {"cockroachVersion": "v24.3", "hardware": {"disk_iops": 0, "machine_spec": {"num_virtual_cpus": 4}, "storage_gib": 16}, "region_nodes": {"us-east-2": 3}, "support_physical_cluster_replication": true}}}' + ~~~ + + If you're creating clusters in AWS or Azure, you must ensure the primary and standby clusters are in different regions. + +{{site.data.alerts.callout_success}} +We recommend [enabling Prometheus metrics export]({% link cockroachcloud/export-metrics.md %}) on your cluster before starting a PCR stream. For details on metrics to track, refer to [Monitor the PCR stream](#step-3-monitor-the-pcr-stream). +{{site.data.alerts.end}} + +### Step 2. Start the PCR stream + +With the primary and standby clusters created, you can now start the PCR stream. + +{% comment %}To edit{% endcomment %} +It is possible to write to both clusters before starting PCR, however, we recommend keeping the standby empty (i.e. not writing to the standby) prior to starting PCR. Upon starting PCR, Cockroach Cloud will take a full cluster backup of the standby, wipe the standby cluster, and start replication. This is because the standby cluster must be empty upon starting PCR. + +With the Cloud API, run the following command to start the PCR stream. You can find the cluster IDs in the cluster creation output: + +{% include_cached copy-clipboard.html %} +~~~ shell +curl --location --request POST 'https://cockroachlabs.cloud/api/v1/replication-streams' --header "Authorization: Bearer api_secret_key" --header 'Content-Type: application/json' --data '{"sourceClusterId": "primary_cluster_id","targetClusterId": "standby_cluster_id"}' +~~~ + +Replace: + +- `api_secret_key` with your API secret key. +- `primary_cluster_id` with the cluster ID returned after creating the primary cluster. +- `standby_cluster_id` with the cluster ID returned after creating the standby cluster. + +Once you have started PCR, the CockroachDB {{ site.data.products.cloud }} standby cluster cannot accept writes and reads, therefore the Cloud Console and SQL shell will be unavailable prior to failover. + +{% comment %}Add note to self-hosted docs that read from standby available there, but not in cloud{% endcomment %} + +~~~ json +{ + "id": "7487d7a6-868b-4c6f-aa60-cc306cc525fe", + "status": "STARTING", + "source_cluster_id": "ad1e8630-729a-40f3-87e4-9f72eb3347a0", + "target_cluster_id": "e64b56c3-09f2-42ee-9f5a-d9b13985a897", + "created_at": "2025-01-13T16:41:20.467781Z", + "retained_time": null, + "replicated_time": null, + "failover_at": null, + "activation_at": null +} +~~~ + +To start PCR between clusters, CockroachDB {{ site.data.products.cloud }} sets up VPC peering between clusters and validates the connectivity. As a result, it may take around 5 minutes to initialize the PCR job during which the status will be `STARTING`. + +### Step 3. Monitor the PCR stream + +For monitoring the current status of the PCR stream, send a `GET` request to the `/v1/replication-streams` endpoint along with the primary cluster, standby cluster, or the ID of the PCR stream: + +{% include_cached copy-clipboard.html %} +~~~ shell +curl --location --request GET "https://cockroachlabs.cloud/api/v1/replication-streams?cluster_id=e64b56c3-09f2-42ee-9f5a-d9b13985a897" --header "Authorization: Bearer api_secret_key" --header 'Content-Type: application/json' +~~~ + +This will return: + +~~~json +{ + "replication_streams": [ + { + "id": "7487d7a6-868b-4c6f-aa60-cc306cc525fe", + "status": "REPLICATING", + "source_cluster_id": "ad1e8630-729a-40f3-87e4-9f72eb3347a0", + "target_cluster_id": "e64b56c3-09f2-42ee-9f5a-d9b13985a897", + "created_at": "2025-01-13T16:41:20.467781Z", + "retained_time": "2025-01-13T19:35:14.472670Z", + "replicated_time": "2025-01-13T19:46:15Z", + "failover_at": null, + "activation_at": null + } + ], +} +~~~ + +- `id`: The ID of the PCR stream. +- `status`: The status of the PCR stream. For descriptions, refer to [Status](#status). +- `source_cluster_id`: The cluster ID of the primary cluster. +- `target_cluster_id`: The cluster ID of the standby cluster. +- `created_at`: The timestamp when the PCR stream was started. +- `retained_time`: The timestamp indicating the lower bound that the PCR stream can failover to. The tracked replicated time and the advancing [protected timestamp]({% link {{ site.current_cloud_version }}/architecture/storage-layer.md %}#protected-timestamps) allows PCR to also track retained time. Therefore, the failover window for a PCR job falls between the retained time and the replicated time. +- `replicated_time`: The latest time at which the standby cluster has consistent data. +- `failover_at`: The requested timestamp for failover. If you used `"status":"FAILING_OVER"` to initiate the failover and omitted `failover_at`, the failover time will default to the latest consistent replicated time. For more details, refer to [Fail over to the standby cluster](#step-4-fail-over-to-the-standby-cluster). +- `activation_at`: The CockroachDB system time at which failover is finalized, which could be different from the time that failover was requested. This field will return a response when the PCR stream is in [`COMPLETED` status](#status). + + +{% comment %}Add statuses and properties (maybe link to api docs here){% endcomment %} + +#### Status + +Status | Description +-------+------------ +`STARTING` | The PCR stream is starting by setting up the VPC peering connection between clusters and validating the connectivity. +`REPLICATING` | The PCR stream will complete an initial scan and then continue the ongoing replication between the primary and standby clusters. +`FAILING_OVER` | The failover has been initiated from the primary to the standby cluster. +`COMPLETED` | The failover is complete and the standby cluster is now independent from the primary cluster. + +#### Metrics + +For continual monitoring of PCR, track the following metrics with [Prometheus]({% link cockroachcloud/export-metrics.md %}): + +- `physical_replication.logical_bytes`: The logical bytes (the sum of all keys and values) ingested by all PCR jobs. +- `physical_replication.sst_bytes`: The SST bytes (compressed) sent to the KV layer by all PCR jobs. +- `physical_replication.replicated_time_seconds`: The replicated time of the physical replication stream in seconds since the Unix epoch. + +### Step 4. Fail over to the standby cluster + +Failing over from the primary cluster to the standby cluster will stop the PCR stream, reset the standby cluster to a point in time where all ingested data is consistent, and mark the standby as ready to accept application traffic. You can schedule the failover to: + +- The latest consistent time. +- A time in the past within the `retained_time`. +- A time up to one hour in the future. + +To specify a timestamp, send a `GET` request to the `/v1/replication-streams` endpoint along with the primary cluster, standby cluster, or the ID of the PCR stream. Include the `failover_at` timestamp and the `"status": "FAILING_OVER"` field: + +{% include_cached copy-clipboard.html %} +~~~ shell +curl --location --request PATCH "https://cockroachlabs.cloud/api/v1/replication-streams/7487d7a6-868b-4c6f-aa60-cc306cc525fe" --header "Authorization: Bearer api_secret_key" --header 'Content-Type:application/json' --data '{"status": "FAILING_OVER", "failover_at": "2025-01-13T19:35:14.472670Z"}' +~~~ + +To fail over to the latest consistent time, you only need to include `"status": "FAILING_OVER"` in your reques with one of the cluster IDs or PCR stream ID: + +{% include_cached copy-clipboard.html %} +~~~ shell +curl --location --request PATCH "https://cockroachlabs.cloud/api/v1/replication-streams/7487d7a6-868b-4c6f-aa60-cc306cc525fe" --header "Authorization: Bearer api_secret_key" --header 'Content-Type:application/json' --data '{"status": "FAILING_OVER"}' +~~~ +~~~json +{ + "id": "f1ea3b5a-5b9b-4101-9db4-780734554b14", + "status": "FAILING_OVER", + "source_cluster_id": "560cc798-7881-4a5b-baf0-321354a42c19", + "target_cluster_id": "24a9d47c-418c-4abc-940b-d2f665d95715", + "created_at": "2025-01-23T16:54:15.926325Z", + "retained_time": null, + "replicated_time": null, + "failover_at": null, + "activation_at": null +} +~~~ + +After the failover is complete, both clusters can receive traffic and operate as separate clusters. It is necessary to redirect application traffic manually. + +{{site.data.alerts.callout_info}} +PCR is on the cluster level, which means that the job also replicates all system tables. Users that need to access the standby cluster after failover should use the user roles for the primary cluster, because the standby cluster is a copy of the primary cluster. PCR overwrites all previous system tables on the standby cluster. +{{site.data.alerts.end}} + +### Fail back to the primary cluster + +To fail back from the standby to the primary cluster, start another PCR stream with the standby cluster as the `sourceClusterId` and the original primary cluster as the `targetClusterId`. + +## Known limitations + +- PCR on CockroachDB {{ site.data.products.cloud }} clusters cannot be started with existing clusters if the cluster was created without the `"support_physical_cluster_replication": true` parameter. +- Failing back to an original primary cluster by replicating only the difference in data between the promoted standby and the original primary cluster is not supported. As a result, failing back to the original primary cluster involves starting the starting a new PCR stream including an initial scan. +- Reading from the standby cluster during PCR is not supported. diff --git a/src/current/v25.1/cockroachdb-feature-availability.md b/src/current/v25.1/cockroachdb-feature-availability.md index b35004d2613..3be3e6291bf 100644 --- a/src/current/v25.1/cockroachdb-feature-availability.md +++ b/src/current/v25.1/cockroachdb-feature-availability.md @@ -47,6 +47,10 @@ Any feature made available in a phase prior to GA is provided without any warran **The following features are in preview** and are subject to change. To share feedback and/or issues, contact [Support](https://support.cockroachlabs.com/hc). {{site.data.alerts.end}} +### Physical cluster replication (PCR) on CockorachDB Advanced + +[PCR on CockroachDB {{ site.data.products.advanced }}]({% link cockroachcloud/physical-cluster-replication.md %}) is in preview. PCR continuously sends all data at the byte level from a primary cluster to an independent standby cluster. Existing data and ongoing changes on the active primary cluster, which is serving application data, replicate asynchronously to the passive standby cluster. + ### Triggers [Triggers]({% link {{ page.version.version }}/triggers.md %}) are in Preview. A trigger executes a function when one or more specified SQL operations is performed on a table. Triggers respond to data changes by adding logic within the database, rather than in an application. They can be used to modify data before it is inserted, maintain data consistency across rows or tables, or record an update to a row. diff --git a/src/current/v25.1/physical-cluster-replication-overview.md b/src/current/v25.1/physical-cluster-replication-overview.md index 28ebd1a0f89..e543a0a79c2 100644 --- a/src/current/v25.1/physical-cluster-replication-overview.md +++ b/src/current/v25.1/physical-cluster-replication-overview.md @@ -30,12 +30,70 @@ You can use PCR in a disaster recovery plan to: ## Features -- **Asynchronous byte-level replication**: When you initiate a replication stream, it will replicate byte-for-byte all of the primary cluster's existing user data and associated metadata to the standby cluster asynchronously. From then on, it will continuously replicate the primary cluster's data and metadata to the standby cluster. PCR will automatically replicate changes related to operations such as [schema changes]({% link {{ page.version.version }}/online-schema-changes.md %}), user and [privilege]({% link {{ page.version.version }}/security-reference/authorization.md %}#managing-privileges) modifications, and [zone configuration]({% link {{ page.version.version }}/show-zone-configurations.md %}) updates without any manual work. -- **Transactional consistency**: You can fail over to the standby cluster at the [`LATEST` timestamp]({% link {{ page.version.version }}/failover-replication.md %}#fail-over-to-the-most-recent-replicated-time) or a [point of time]({% link {{ page.version.version }}/failover-replication.md %}#fail-over-to-a-point-in-time) in the past or the future. When the failover process completes, the standby cluster will be in a transactionally consistent state as of the point in time you specified. -- **Maintained/improved RPO and RTO**: Depending on workload and deployment configuration, [replication lag]({% link {{ page.version.version }}/physical-cluster-replication-technical-overview.md %}) between the primary and standby is generally in the tens-of-seconds range. The failover process from the primary cluster to the standby should typically happen within five minutes when completing a failover to the latest replicated time using `LATEST`. -- **Failover to a timestamp in the past or the future**: In the case of logical disasters or mistakes, you can [fail over]({% link {{ page.version.version }}/failover-replication.md %}) from the primary to the standby cluster to a timestamp in the past. This means that you can return the standby to a timestamp before the mistake was replicated to the standby. You can also configure the [`WITH RETENTION`]({% link {{ page.version.version }}/alter-virtual-cluster.md %}#set-a-retention-window) option to control how far in the past you can fail over to. Furthermore, you can plan a failover by specifying a timestamp in the future. -- **Read from standby cluster**: You can configure PCR to allow read queries on the standby cluster. For more details, refer to [Start a PCR stream with read from standby]({% link {{ page.version.version }}/create-virtual-cluster.md %}#start-a-pcr-stream-with-read-from-standby). -- **Monitoring**: To monitor the replication's initial progress, current status, and performance, you can use metrics available in the [DB Console]({% link {{ page.version.version }}/ui-overview.md %}) and [Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}). For more details, refer to [Physical Cluster Replication Monitoring]({% link {{ page.version.version }}/physical-cluster-replication-monitoring.md %}). +PCR is available on CockroachDB self-hosted and CockroachDB Cloud {{ site.data.products.advanced }} clusters. Review the following table for differences in feature availability: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
{{ site.data.products.advanced }} clustersSelf-hosted clusters
+ Feature phase availability + PreviewGenerally available
+ Asynchronous byte-level replication + When you initiate a replication stream, it will replicate byte-for-byte all of the primary cluster's existing user data and associated metadata to the standby cluster asynchronously. From then on, it will continuously replicate the primary cluster's data and metadata to the standby cluster. PCR will automatically replicate changes related to operations such as schema changes, user and privilege modifications, and zone configuration updates without any manual work.
+ Transactional consistency + Use the Cloud API to fail over to the standby cluster at latest consistent timestamp or a point of time in the past or the future. When the failover process completes, the standby cluster will be in a transactionally consistent state as of the specified timestamp.Use the LATEST timestamp or a point of time in the past or the future to fail over to the standby cluster. When the failover process completes, the standby cluster will be in a transactionally consistent state as of the specified timestamp.
+ Maintained/improved RPO and RTO + Depending on workload and deployment configuration, replication lag between the primary and standby is generally in the tens-of-seconds range. The failover process from the primary cluster to the standby should typically happen within five minutes when completing a failover to the latest replicated time.
+ Read from standby cluster + Not supportedConfigure PCR to allow read queries on the standby cluster. For more details, refer to Start a PCR stream with read from standby.
+ Fail over to a timestamp in the past or the future +
  • Fail over from the primary to the standby cluster to a timestamp in the past, so that you can return the standby to a timestamp before a logical mistake was replicated to the standby.
  • Plan a failover by specifying a timestamp up to one hour in the future.
  • Fail over from the primary to the standby cluster to a timestamp in the past, so that you can return the standby to a timestamp before a logical mistake was replicated to the standby.
  • Plan a failover by specifying a timestamp in the future.
  • Configure the WITH RETENTION option to control how far in the past you can fail over to.
+ Monitoring + Use metrics available in Prometheus and the status with the Cloud API. For more details, refer to Monitor the PCR stream.Monitor the stream's initial progress, current status, and performance, use metrics available in the DB Console and Prometheus. For more details, refer to Physical Cluster Replication Monitoring.
{{site.data.alerts.callout_info}} [Failing over to a timestamp in the past]({% link {{ page.version.version }}/failover-replication.md %}#fail-over-to-a-point-in-time) involves reverting data on the standby cluster. As a result, this type of failover takes longer to complete than failover to the [latest replicated time]({% link {{ page.version.version }}/failover-replication.md %}#fail-over-to-the-most-recent-replicated-time). The increase in failover time will correlate to how much data you are reverting from the standby. For more detail, refer to the [Technical Overview]({% link {{ page.version.version }}/physical-cluster-replication-technical-overview.md %}) page for PCR.