-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LDR overview and monitoring pages #19085
Conversation
Files changed:
|
✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
|
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify site configuration. |
cf4b5b4
to
62a9a70
Compare
62a9a70
to
fdd7d51
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, LGTM
@@ -0,0 +1,8 @@ | |||
Field | Response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this missing the "description" field in LDR responses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is, I am adding. Thanks for catching.
|
||
## Recommended LDR metrics to track | ||
|
||
- Replication latency: The commit-to-commit replication latency. A _commit_ is when the LDR job either adds a row to the [dead letter queue (DLQ)]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#dead-letter-queue-dlq) or applies a row successfully to the destination cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This is tracked from when a row is committed on the source cluster, to when it is "committed" on the target cluster. A commit on the target cluster is when"....
could we add this sentence in after "The commit-to-commit replication latency"
|
||
For more details, refer to the LDR [Known limitations]({% link {{ page.version.version }}/set-up-logical-data-replication.md %}#known-limitations). | ||
|
||
When you run LDR in `immediate` mode, you cannot replicate a table with [SQL constraints]({% link {{ page.version.version }}/constraints.md %}). In `validated` mode, SQL constraints **must** match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL constraints --> it should be "you cannot replicate a table with foreign key constraints". Other constraints are fine.
200f88e
to
fea64d0
Compare
fea64d0
to
31d37a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM added some small comments. thank you!
|
||
Maintain [high availability]({% link {{ page.version.version }}/data-resilience.md %}#high-availability) with a two-datacenter topology. You can run bidirectional LDR to ensure [data resilience]({% link {{ page.version.version }}/data-resilience.md %}) in your deployment, particularly in datacenter or region failures. Both clusters can receive application reads and writes with low, single-region write latency. In a datacenter or cluster outage, you can redirect application traffic to the surviving cluster with [low downtime]({% link {{ page.version.version }}/data-resilience.md %}#high-availability). In the following diagram, the clusters are deployed in US East and West to provide low latency for that region. The two LDR jobs ensure that the tables on both clusters will reach eventual consistency. | ||
|
||
<image src="{{ 'images/v24.3/east-west-region.svg' | relative_url }}" alt="Diagram showing bidirectional LDR from cluster A to B and back again from cluster B to A." style="width:50%" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
{% include_cached copy-clipboard.html %} | ||
~~~ sql | ||
CREATE LOGICAL REPLICATION STREAM FROM TABLE {database.public.table_name} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this WITH option included in the setup page too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to add this in a few places
31d37a6
to
7c1ba99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great! left mostly nits
|
||
### Achieve high availability and single-region write latency in two-datacenter deployments | ||
|
||
Maintain [high availability]({% link {{ page.version.version }}/data-resilience.md %}#high-availability) and resilience to region failures with a two-datacenter topology. You can run bidirectional LDR to ensure [data resilience]({% link {{ page.version.version }}/data-resilience.md %}) in your deployment, particularly in datacenter or region failures. If you set up two single-region clusters, in LDR, both clusters can receive application reads and writes with low, single-region write latency. Then, in a datacenter, region, or cluster outage, you can redirect application traffic to the surviving cluster with [low downtime]({% link {{ page.version.version }}/data-resilience.md %}#high-availability). In the following diagram, the two single-region clusters are deployed in US East and West to provide low latency for that region. The two LDR jobs ensure that the tables on both clusters will reach eventual consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alicia-l2 are we planning to publicly document the pros and cons of using our MR feature suite vs LDR for the first use case? as well as zone cfgs + execution locality backup/cdc vs ldr for the second use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can link to blog posts describing this for the first use case.
For second use case, @kathancox maybe we can try to get more specific about hardware/cluster specific isolation?
maybe above both use cases we should add a note saying that we consider this a tool and is an alternative deployment option to our native Raft/MRarchitecture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just link to the Data Resilience page? I think that covers some of this, particularly when it comes to MR feature comparison. I'm adding a note linking to that now.
Anything further in terms of execution locality + cdc comps, we should probably create another docs issue for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i think we should revisit how we publicly document the pros and cons of LDR vs PCR vs CRDB replication, but i don't think we need to block this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK yes, I think it's something that @alicia-l2 and I are working on gradually. We recently published this page https://www.cockroachlabs.com/docs/dev/data-resilience and created a new top-level section, so I building out this area is a good idea.
|
||
### Responses | ||
|
||
{% include {{ page.version.version }}/ldr/show-logical-replication-responses.md %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh, i may have to backport some changes to this table. this is fine to merge as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
||
- `logical_replication.replicated_time_seconds` | ||
- `logical_replication.events_ingested` | ||
- `logical_replication.events_dlqed` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while we're here, you could add the .scanning_ranges and .lagging_ranges metrics. added on friday cockroachdb/cockroach@9eb6c8b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have added these to this label list
7c1ba99
to
f1df57f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo that i strongly suggest if at all possible switching from scare quote "committed" to "applied" and then defining what "applied" means (see my comment) since COMMIT is already super duper in use and we should not be redefining it, it will only lead to tears IMO
@@ -0,0 +1 @@ | |||
There are some tradeoffs between enabling one table per LDR job versus multiple tables in one LDR job. Multiple tables in one LDR job can be easier to operate. For example, if you pause and resume the single job, LDR will stop and resume for all the tables. However, the most granular level observability will be at the job level. One table in one LDR job will allow for table-level observability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol i was just checking if this was an include
`job_id` | The job's ID. Use with [`CANCEL JOB`]({% link {{ page.version.version }}/cancel-job.md %}), [`PAUSE JOB`]({% link {{ page.version.version }}/pause-job.md %}), [`RESUME JOB`]({% link {{ page.version.version }}/resume-job.md %}), [`SHOW JOB`]({% link {{ page.version.version }}/show-jobs.md %}). | ||
`status` | Status of the job `running`, `paused`, `canceled`. {% comment %}check these{% endcomment %} | ||
`targets` | The fully qualified name of the table(s) that are part of the LDR job. | ||
`replicated_time` | The latest timestamp at which the destination cluster has consistent data. This time advances automatically as long as the LDR job proceeds without error. `replicated_time` is updated periodically (every 30s). {% comment %}To confirm this line is accurate{% endcomment %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timestamp could link to our TIMESTAMP type docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
## Recommended LDR metrics to track | ||
|
||
- Replication latency: The commit-to-commit replication latency, which is tracked from when a row is committed on the source cluster, to when it is "committed" on the destination cluster. A _commit_ is when the LDR job either adds a row to the [dead letter queue (DLQ)]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#dead-letter-queue-dlq) or applies a row successfully to the destination cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the scare quotes around "committed" and i don't think overloading the term commit is a good idea
COMMIT is already very well defined semantically in SQL world and in our docs
i'd actually reverse this and say "to when it is applied on the destination cluster" where "applied" means either:
- COMMIT to target table, OR
- inserted to DLQ (which is also a COMMIT)
pls ignore for now if there isn't time for this but i think it's a serious problem if we are sprinkling "committed" to mean "LDR's special notion of committed" around LDR docs when we really mean "applied" per the definition i'm using above which doesn't cause the same confusion IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, so I updated here where I could to "apply". Ofc the metric name has "commit" in the syntax and the same is true for at least one changefeed metric.
|
||
### Jobs page | ||
|
||
On the **Jobs** page, select: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above re: recommend linking to jobs page docs for we who live the memento lifestyle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
0f7daee
to
2e29c5a
Compare
2e29c5a
to
e869cad
Compare
TFTRs! |
This PR adds the Monitoring page and General Overview page for the LDR section.
Preview
Overview page
Monitoring page
This PR is 2 of 4(?) for LDR docs. It contains:
Overview: General description of LDR and its features/use cases
Monitoring LDR: This is a page that describes what is available for monitoring and links out to the relevant docs material
Reviewers: You'll see a lot of comments for follow-up PRs to connect pages with links and such.
Content pages to come:
CREATE/SHOW sql ref
Technical Overview
LDR Metrics Dashboard docs
PR 1 #19043