Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDR overview and monitoring pages #19085

Merged
merged 11 commits into from
Nov 14, 2024
Merged

LDR overview and monitoring pages #19085

merged 11 commits into from
Nov 14, 2024

Conversation

kathancox
Copy link
Contributor

@kathancox kathancox commented Oct 31, 2024

This PR adds the Monitoring page and General Overview page for the LDR section.

Preview

Overview page
Monitoring page

This PR is 2 of 4(?) for LDR docs. It contains:

Overview: General description of LDR and its features/use cases
Monitoring LDR: This is a page that describes what is available for monitoring and links out to the relevant docs material

Reviewers: You'll see a lot of comments for follow-up PRs to connect pages with links and such.

Content pages to come:
CREATE/SHOW sql ref
Technical Overview
LDR Metrics Dashboard docs

PR 1 #19043

Copy link

github-actions bot commented Oct 31, 2024

Copy link

netlify bot commented Oct 31, 2024

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit e869cad
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-interactivetutorials-docs/deploys/67364717bd88aa0008800852

Copy link

netlify bot commented Oct 31, 2024

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit e869cad
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-api-docs/deploys/673647173e98c50008e0ff69

Copy link

netlify bot commented Oct 31, 2024

Netlify Preview

Name Link
🔨 Latest commit e869cad
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/673647171a3286000824eba9
😎 Deploy Preview https://deploy-preview-19085--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@kathancox kathancox force-pushed the ldr-overview-monitoring branch 3 times, most recently from cf4b5b4 to 62a9a70 Compare November 5, 2024 19:50
@kathancox kathancox changed the title WIP: LDR overview and monitoring pages LDR overview and monitoring pages Nov 5, 2024
@kathancox kathancox force-pushed the ldr-overview-monitoring branch from 62a9a70 to fdd7d51 Compare November 5, 2024 20:04
Copy link

@alicia-l2 alicia-l2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, LGTM

src/current/v24.3/logical-data-replication-overview.md Outdated Show resolved Hide resolved
src/current/v24.3/logical-data-replication-monitoring.md Outdated Show resolved Hide resolved
@@ -0,0 +1,8 @@
Field | Response

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this missing the "description" field in LDR responses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is, I am adding. Thanks for catching.


## Recommended LDR metrics to track

- Replication latency: The commit-to-commit replication latency. A _commit_ is when the LDR job either adds a row to the [dead letter queue (DLQ)]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#dead-letter-queue-dlq) or applies a row successfully to the destination cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This is tracked from when a row is committed on the source cluster, to when it is "committed" on the target cluster. A commit on the target cluster is when"....

could we add this sentence in after "The commit-to-commit replication latency"


For more details, refer to the LDR [Known limitations]({% link {{ page.version.version }}/set-up-logical-data-replication.md %}#known-limitations).

When you run LDR in `immediate` mode, you cannot replicate a table with [SQL constraints]({% link {{ page.version.version }}/constraints.md %}). In `validated` mode, SQL constraints **must** match.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL constraints --> it should be "you cannot replicate a table with foreign key constraints". Other constraints are fine.

@kathancox kathancox force-pushed the ldr-overview-monitoring branch 2 times, most recently from 200f88e to fea64d0 Compare November 5, 2024 21:47
@kathancox kathancox marked this pull request as ready for review November 5, 2024 21:48
@kathancox kathancox marked this pull request as draft November 7, 2024 16:11
@kathancox kathancox force-pushed the ldr-overview-monitoring branch from fea64d0 to 31d37a6 Compare November 7, 2024 19:10
@kathancox kathancox marked this pull request as ready for review November 7, 2024 19:19
Copy link

@alicia-l2 alicia-l2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM added some small comments. thank you!

src/current/v24.3/logical-data-replication-overview.md Outdated Show resolved Hide resolved
src/current/v24.3/logical-data-replication-overview.md Outdated Show resolved Hide resolved
src/current/v24.3/logical-data-replication-overview.md Outdated Show resolved Hide resolved
src/current/v24.3/logical-data-replication-overview.md Outdated Show resolved Hide resolved

Maintain [high availability]({% link {{ page.version.version }}/data-resilience.md %}#high-availability) with a two-datacenter topology. You can run bidirectional LDR to ensure [data resilience]({% link {{ page.version.version }}/data-resilience.md %}) in your deployment, particularly in datacenter or region failures. Both clusters can receive application reads and writes with low, single-region write latency. In a datacenter or cluster outage, you can redirect application traffic to the surviving cluster with [low downtime]({% link {{ page.version.version }}/data-resilience.md %}#high-availability). In the following diagram, the clusters are deployed in US East and West to provide low latency for that region. The two LDR jobs ensure that the tables on both clusters will reach eventual consistency.

<image src="{{ 'images/v24.3/east-west-region.svg' | relative_url }}" alt="Diagram showing bidirectional LDR from cluster A to B and back again from cluster B to A." style="width:50%" />

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit -- on top of the arrows can we write LDR stream #1 and LDR stream #2


{% include_cached copy-clipboard.html %}
~~~ sql
CREATE LOGICAL REPLICATION STREAM FROM TABLE {database.public.table_name}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this WITH option included in the setup page too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add this in a few places

@kathancox kathancox force-pushed the ldr-overview-monitoring branch from 31d37a6 to 7c1ba99 Compare November 8, 2024 20:44
Copy link

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! left mostly nits

src/current/v24.3/logical-data-replication-overview.md Outdated Show resolved Hide resolved

### Achieve high availability and single-region write latency in two-datacenter deployments

Maintain [high availability]({% link {{ page.version.version }}/data-resilience.md %}#high-availability) and resilience to region failures with a two-datacenter topology. You can run bidirectional LDR to ensure [data resilience]({% link {{ page.version.version }}/data-resilience.md %}) in your deployment, particularly in datacenter or region failures. If you set up two single-region clusters, in LDR, both clusters can receive application reads and writes with low, single-region write latency. Then, in a datacenter, region, or cluster outage, you can redirect application traffic to the surviving cluster with [low downtime]({% link {{ page.version.version }}/data-resilience.md %}#high-availability). In the following diagram, the two single-region clusters are deployed in US East and West to provide low latency for that region. The two LDR jobs ensure that the tables on both clusters will reach eventual consistency.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alicia-l2 are we planning to publicly document the pros and cons of using our MR feature suite vs LDR for the first use case? as well as zone cfgs + execution locality backup/cdc vs ldr for the second use case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can link to blog posts describing this for the first use case.
For second use case, @kathancox maybe we can try to get more specific about hardware/cluster specific isolation?

maybe above both use cases we should add a note saying that we consider this a tool and is an alternative deployment option to our native Raft/MRarchitecture

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just link to the Data Resilience page? I think that covers some of this, particularly when it comes to MR feature comparison. I'm adding a note linking to that now.

Anything further in terms of execution locality + cdc comps, we should probably create another docs issue for that.

Copy link

@msbutler msbutler Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i think we should revisit how we publicly document the pros and cons of LDR vs PCR vs CRDB replication, but i don't think we need to block this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yes, I think it's something that @alicia-l2 and I are working on gradually. We recently published this page https://www.cockroachlabs.com/docs/dev/data-resilience and created a new top-level section, so I building out this area is a good idea.

src/current/v24.3/logical-data-replication-overview.md Outdated Show resolved Hide resolved

### Responses

{% include {{ page.version.version }}/ldr/show-logical-replication-responses.md %}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, i may have to backport some changes to this table. this is fine to merge as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

src/current/v24.3/logical-data-replication-monitoring.md Outdated Show resolved Hide resolved

- `logical_replication.replicated_time_seconds`
- `logical_replication.events_ingested`
- `logical_replication.events_dlqed`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while we're here, you could add the .scanning_ranges and .lagging_ranges metrics. added on friday cockroachdb/cockroach@9eb6c8b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have added these to this label list

@kathancox kathancox force-pushed the ldr-overview-monitoring branch from 7c1ba99 to f1df57f Compare November 12, 2024 15:40
@kathancox kathancox requested a review from msbutler November 12, 2024 15:50
@kathancox kathancox requested a review from rmloveland November 13, 2024 15:21
Copy link
Contributor

@rmloveland rmloveland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo that i strongly suggest if at all possible switching from scare quote "committed" to "applied" and then defining what "applied" means (see my comment) since COMMIT is already super duper in use and we should not be redefining it, it will only lead to tears IMO

src/current/v24.3/logical-data-replication-overview.md Outdated Show resolved Hide resolved
src/current/v24.3/logical-data-replication-overview.md Outdated Show resolved Hide resolved
@@ -0,0 +1 @@
There are some tradeoffs between enabling one table per LDR job versus multiple tables in one LDR job. Multiple tables in one LDR job can be easier to operate. For example, if you pause and resume the single job, LDR will stop and resume for all the tables. However, the most granular level observability will be at the job level. One table in one LDR job will allow for table-level observability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol i was just checking if this was an include

`job_id` | The job's ID. Use with [`CANCEL JOB`]({% link {{ page.version.version }}/cancel-job.md %}), [`PAUSE JOB`]({% link {{ page.version.version }}/pause-job.md %}), [`RESUME JOB`]({% link {{ page.version.version }}/resume-job.md %}), [`SHOW JOB`]({% link {{ page.version.version }}/show-jobs.md %}).
`status` | Status of the job `running`, `paused`, `canceled`. {% comment %}check these{% endcomment %}
`targets` | The fully qualified name of the table(s) that are part of the LDR job.
`replicated_time` | The latest timestamp at which the destination cluster has consistent data. This time advances automatically as long as the LDR job proceeds without error. `replicated_time` is updated periodically (every 30s). {% comment %}To confirm this line is accurate{% endcomment %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp could link to our TIMESTAMP type docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


## Recommended LDR metrics to track

- Replication latency: The commit-to-commit replication latency, which is tracked from when a row is committed on the source cluster, to when it is "committed" on the destination cluster. A _commit_ is when the LDR job either adds a row to the [dead letter queue (DLQ)]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#dead-letter-queue-dlq) or applies a row successfully to the destination cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the scare quotes around "committed" and i don't think overloading the term commit is a good idea

COMMIT is already very well defined semantically in SQL world and in our docs

i'd actually reverse this and say "to when it is applied on the destination cluster" where "applied" means either:

  • COMMIT to target table, OR
  • inserted to DLQ (which is also a COMMIT)

pls ignore for now if there isn't time for this but i think it's a serious problem if we are sprinkling "committed" to mean "LDR's special notion of committed" around LDR docs when we really mean "applied" per the definition i'm using above which doesn't cause the same confusion IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so I updated here where I could to "apply". Ofc the metric name has "commit" in the syntax and the same is true for at least one changefeed metric.

src/current/v24.3/logical-data-replication-monitoring.md Outdated Show resolved Hide resolved
src/current/v24.3/logical-data-replication-monitoring.md Outdated Show resolved Hide resolved

### Jobs page

On the **Jobs** page, select:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above re: recommend linking to jobs page docs for we who live the memento lifestyle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@kathancox kathancox force-pushed the ldr-overview-monitoring branch 2 times, most recently from 0f7daee to 2e29c5a Compare November 14, 2024 18:42
@kathancox kathancox force-pushed the ldr-overview-monitoring branch from 2e29c5a to e869cad Compare November 14, 2024 18:53
@kathancox kathancox merged commit 36c94c6 into main Nov 14, 2024
6 checks passed
@kathancox kathancox deleted the ldr-overview-monitoring branch November 14, 2024 19:00
@kathancox
Copy link
Contributor Author

TFTRs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants