-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docs for PCR on Cloud in Preview #19320
Conversation
Files changed:
|
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
|
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify site configuration. |
fd5db34
to
ac6fef7
Compare
ac6fef7
to
8ed31f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM, had some comments
@@ -551,6 +551,12 @@ | |||
} | |||
] | |||
}, | |||
{ | |||
"title": "Physical Cluster Replication", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to add, "on CockroachDB Cloud?"
|
||
The following describes the required roles for the `replication-streams` endpoint methods: | ||
|
||
Method | Required roles | Description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is super clear! thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that the POST requires the role on both clusters. The GET and PATCH only require permissions on one of the two clusters.
- Clusters must be in the same cloud (AWS, GCP, or Azure). | ||
- Clusters must be single [region]({% link cockroachcloud/regions.md %}) (multiple availability zones per clusteris supported). | ||
- The primary and standby cluster in AWS and Azure must be in different regions. | ||
- The primary and standby cluster in GCP can be in the same region, but must not have overlapping CIDR ranges. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidwding is the CIDR range part of the public cloud API? I feel like it was an undocumented field last i checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it on the public API. See spec.dedicated.cidr_range
under the "schema" tab: https://www.cockroachlabs.com/docs/api/cloud/v1#post-/api/v1/clusters
- Clusters must be single [region]({% link cockroachcloud/regions.md %}) (multiple availability zones per clusteris supported). | ||
- The primary and standby cluster in AWS and Azure must be in different regions. | ||
- The primary and standby cluster in GCP can be in the same region, but must not have overlapping CIDR ranges. | ||
- Clusters can have different [node topology]({% link cockroachcloud/plan-your-cluster-advanced.md %}#cluster-topology) and [hardware configurations]({% link cockroachcloud/plan-your-cluster-advanced.md %}#cluster-sizing-and-scaling). For disaster recovery purposes (failover and redirecting application traffic to a standby), we recommend configuring the primary and standby clusters with similar hardware. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For disaster recovery purposes (failover and redirecting application traffic to a standby), we recommend configuring the primary and standby clusters with similar hardware.
Instead of calling out the DR specific purpose, can we say something along the lines of "To avoid hitting performance constraints, we recommend configuring the primary and standby clusters with similar node topology and hardware" ?
|
||
### Step 1. Create the clusters | ||
|
||
To use PCR, it is necessary to set the **standby** cluster with the `support_physical_cluster_replication` field to `true`, which indicates that a cluster should start using an architecture that supports PCR. For details on supported cluster cloud provider and region setup, refer to [Configuration](#configuration). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to call out that its highly recommended that you start your primary cluster with the 'support_physical_cluster_replication' field to 'true', but that you can still start PCR from an existing cluster if you must?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also note that an existing cluster that was started without the support_physical_cluster_replication
flag can be the source of a PCR stream, but never the target.
curl --location --request POST 'https://cockroachlabs.cloud/api/v1/clusters' --header "Authorization: Bearer api_secret_key" --header 'Content-Type: application/json' --data '{"name": "standby_cluster_name", "provider": "AWS", "spec": {"dedicated": {"cockroachVersion": "v24.3", "hardware": {"disk_iops": 0, "machine_spec": {"num_virtual_cpus": 4}, "storage_gib": 16}, "region_nodes": {"us-east-2": 3}, "support_physical_cluster_replication": true}}}' | ||
~~~ | ||
|
||
If you're creating clusters in AWS or Azure, you must start the primary and standby clusters in different regions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to say this, we should also call out the GCP overlapping CIDR range thing
### Step 2. Start the PCR stream | ||
|
||
{{site.data.alerts.callout_info}} | ||
The standby cluster must be empty upon starting PCR. It is possible to write to both clusters before initiating the PCR stream, however, we recommend keeping the standby empty. That is, not writing to the standby prior to starting PCR. When you initiate the PCR stream, CockroachDB {{ site.data.products.cloud }} will take a full cluster backup of the standby cluster, delete all data from the standby, and then start the PCR stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we start this section with: "We recommend starting with an empty standby cluster when starting PCR. When you start the PCR stream....then start the PCR stream." and then add "this is to ensure that the standby will be fully consistent with the primary during PCR"
curl --location --request PATCH "https://cockroachlabs.cloud/api/v1/replication-streams/7487d7a6-868b-4c6f-aa60-cc306cc525fe" --header "Authorization: Bearer api_secret_key" --header 'Content-Type:application/json' --data '{"status": "FAILING_OVER", "failover_at": "2025-01-13T19:35:14.472670Z"}' | ||
~~~ | ||
|
||
To fail over to the latest consistent time, you only need to include `"status": "FAILING_OVER"` in your request with one of the cluster IDs or PCR stream ID: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit- can we swap the order of this to match the above bullet points?
i.e., latest time first, then pick a time second
After the failover is complete, both clusters can receive traffic and operate as separate clusters. It is necessary to redirect application traffic manually. | ||
|
||
{{site.data.alerts.callout_info}} | ||
PCR is on the cluster level, which means that the job also replicates all system tables. Users that need to access the standby cluster after failover should use the user roles for the primary cluster, because the standby cluster is a copy of the primary cluster. PCR overwrites all previous system tables on the standby cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PCR replicates on the cluster level?
|
||
The tracked replicated time and the advancing protected timestamp provide the PCR job with enough information to track _retained time_, which is a timestamp in the past indicating the lower bound that the PCR stream could fail over to. Therefore, the _failover window_ for a PCR stream falls between the retained time and the replicated time. | ||
|
||
<img src="{{ 'images/v25.1/failover.svg' | relative_url }}" alt="Timeline showing how the failover window is between the retained time and replicated time." style="border:0px solid #eee;width:100%" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like this a lot! Would it also make sense to add it to the self-hosted PCR technical overview page?
|
||
### Fail back to the primary cluster | ||
|
||
To fail back from the standby to the primary cluster, start another PCR stream with the standby cluster as the `sourceClusterId` and the original primary cluster as the `targetClusterId`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth noting that we attempt to perform fast failback if possible, and fall back to regular failback if not?
Closing this PR as we're pushing back the launch of this feature. |
Fixes DOC-10050
PR adds docs for PCR on Advanced clusters in Preview phase.
Cross-cluster Replication
nav item, and also underCloud Deployments
.Rendered Preview
Cloud PCR page
General PCR Overview page updates