Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for PCR on Cloud in Preview #19320

Closed
wants to merge 1 commit into from
Closed

Conversation

kathancox
Copy link
Contributor

@kathancox kathancox commented Jan 24, 2025

Fixes DOC-10050

PR adds docs for PCR on Advanced clusters in Preview phase.

  • Adds a setup tutorial page with short technical reference to cover "replication lag" and "retained time" (without the VC mentions in the self-hosted docs).
  • Updates the self-hosted PCR docs to callout Advanced availability and to navigate Cloud users to the correct docs.
  • Updates the general PCR overview page to adapt the Features section to a comp table for the supported features in Advanced vs. self-hosted.
  • Adds PCR on Advanced in the Preview section of the feature availability page.
  • Includes the new PCR on Advanced page in under the Cross-cluster Replication nav item, and also under Cloud Deployments.

Rendered Preview

Cloud PCR page
General PCR Overview page updates

Copy link

netlify bot commented Jan 24, 2025

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit 8ed31f0
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-api-docs/deploys/6797e4c54840f20008c7b12e

Copy link

netlify bot commented Jan 24, 2025

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit 8ed31f0
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-interactivetutorials-docs/deploys/6797e4c504140c0008ab998b

Copy link

netlify bot commented Jan 24, 2025

Netlify Preview

Name Link
🔨 Latest commit 8ed31f0
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/6797e4c534d2c80008313be4
😎 Deploy Preview https://deploy-preview-19320--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@kathancox kathancox force-pushed the test-pcr-cloud-location branch from fd5db34 to ac6fef7 Compare January 27, 2025 19:36
@kathancox kathancox force-pushed the test-pcr-cloud-location branch from ac6fef7 to 8ed31f0 Compare January 27, 2025 19:55
@kathancox kathancox marked this pull request as ready for review January 27, 2025 19:57
Copy link

@alicia-l2 alicia-l2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM, had some comments

@@ -551,6 +551,12 @@
}
]
},
{
"title": "Physical Cluster Replication",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to add, "on CockroachDB Cloud?"


The following describes the required roles for the `replication-streams` endpoint methods:

Method | Required roles | Description

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is super clear! thanks :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that the POST requires the role on both clusters. The GET and PATCH only require permissions on one of the two clusters.

- Clusters must be in the same cloud (AWS, GCP, or Azure).
- Clusters must be single [region]({% link cockroachcloud/regions.md %}) (multiple availability zones per clusteris supported).
- The primary and standby cluster in AWS and Azure must be in different regions.
- The primary and standby cluster in GCP can be in the same region, but must not have overlapping CIDR ranges.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidwding is the CIDR range part of the public cloud API? I feel like it was an undocumented field last i checked.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it on the public API. See spec.dedicated.cidr_range under the "schema" tab: https://www.cockroachlabs.com/docs/api/cloud/v1#post-/api/v1/clusters

- Clusters must be single [region]({% link cockroachcloud/regions.md %}) (multiple availability zones per clusteris supported).
- The primary and standby cluster in AWS and Azure must be in different regions.
- The primary and standby cluster in GCP can be in the same region, but must not have overlapping CIDR ranges.
- Clusters can have different [node topology]({% link cockroachcloud/plan-your-cluster-advanced.md %}#cluster-topology) and [hardware configurations]({% link cockroachcloud/plan-your-cluster-advanced.md %}#cluster-sizing-and-scaling). For disaster recovery purposes (failover and redirecting application traffic to a standby), we recommend configuring the primary and standby clusters with similar hardware.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For disaster recovery purposes (failover and redirecting application traffic to a standby), we recommend configuring the primary and standby clusters with similar hardware.

Instead of calling out the DR specific purpose, can we say something along the lines of "To avoid hitting performance constraints, we recommend configuring the primary and standby clusters with similar node topology and hardware" ?


### Step 1. Create the clusters

To use PCR, it is necessary to set the **standby** cluster with the `support_physical_cluster_replication` field to `true`, which indicates that a cluster should start using an architecture that supports PCR. For details on supported cluster cloud provider and region setup, refer to [Configuration](#configuration).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to call out that its highly recommended that you start your primary cluster with the 'support_physical_cluster_replication' field to 'true', but that you can still start PCR from an existing cluster if you must?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also note that an existing cluster that was started without the support_physical_cluster_replication flag can be the source of a PCR stream, but never the target.

curl --location --request POST 'https://cockroachlabs.cloud/api/v1/clusters' --header "Authorization: Bearer api_secret_key" --header 'Content-Type: application/json' --data '{"name": "standby_cluster_name", "provider": "AWS", "spec": {"dedicated": {"cockroachVersion": "v24.3", "hardware": {"disk_iops": 0, "machine_spec": {"num_virtual_cpus": 4}, "storage_gib": 16}, "region_nodes": {"us-east-2": 3}, "support_physical_cluster_replication": true}}}'
~~~

If you're creating clusters in AWS or Azure, you must start the primary and standby clusters in different regions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to say this, we should also call out the GCP overlapping CIDR range thing

### Step 2. Start the PCR stream

{{site.data.alerts.callout_info}}
The standby cluster must be empty upon starting PCR. It is possible to write to both clusters before initiating the PCR stream, however, we recommend keeping the standby empty. That is, not writing to the standby prior to starting PCR. When you initiate the PCR stream, CockroachDB {{ site.data.products.cloud }} will take a full cluster backup of the standby cluster, delete all data from the standby, and then start the PCR stream.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we start this section with: "We recommend starting with an empty standby cluster when starting PCR. When you start the PCR stream....then start the PCR stream." and then add "this is to ensure that the standby will be fully consistent with the primary during PCR"

curl --location --request PATCH "https://cockroachlabs.cloud/api/v1/replication-streams/7487d7a6-868b-4c6f-aa60-cc306cc525fe" --header "Authorization: Bearer api_secret_key" --header 'Content-Type:application/json' --data '{"status": "FAILING_OVER", "failover_at": "2025-01-13T19:35:14.472670Z"}'
~~~

To fail over to the latest consistent time, you only need to include `"status": "FAILING_OVER"` in your request with one of the cluster IDs or PCR stream ID:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit- can we swap the order of this to match the above bullet points?
i.e., latest time first, then pick a time second

After the failover is complete, both clusters can receive traffic and operate as separate clusters. It is necessary to redirect application traffic manually.

{{site.data.alerts.callout_info}}
PCR is on the cluster level, which means that the job also replicates all system tables. Users that need to access the standby cluster after failover should use the user roles for the primary cluster, because the standby cluster is a copy of the primary cluster. PCR overwrites all previous system tables on the standby cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PCR replicates on the cluster level?


The tracked replicated time and the advancing protected timestamp provide the PCR job with enough information to track _retained time_, which is a timestamp in the past indicating the lower bound that the PCR stream could fail over to. Therefore, the _failover window_ for a PCR stream falls between the retained time and the replicated time.

<img src="{{ 'images/v25.1/failover.svg' | relative_url }}" alt="Timeline showing how the failover window is between the retained time and replicated time." style="border:0px solid #eee;width:100%" />

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this a lot! Would it also make sense to add it to the self-hosted PCR technical overview page?


### Fail back to the primary cluster

To fail back from the standby to the primary cluster, start another PCR stream with the standby cluster as the `sourceClusterId` and the original primary cluster as the `targetClusterId`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth noting that we attempt to perform fast failback if possible, and fall back to regular failback if not?

@kathancox
Copy link
Contributor Author

Closing this PR as we're pushing back the launch of this feature.

@kathancox kathancox closed this Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants