Add more docs on WAL failover (#19052)

* Add more docs on WAL failover Fixes DOC-11199 Summary of changes: - Add a new 'WAL Failover' page to the 'Self-Hosted Deployments' section - Update `cockroach start` docs to note that it has the basic info, but to see the 'WAL Failover Playbook' for more detailed instructions - Mark WAL failover as GA (aka no longer in Preview)
cockroachdb · Nov 15, 2024 · 0d4e85a · 0d4e85a
1 parent 8a71f2f
commit 0d4e85a
Show file tree

Hide file tree

Showing 11 changed files with 520 additions and 37 deletions.
diff --git a/src/current/_includes/v24.3/sidebar-data/self-hosted-deployments.json b/src/current/_includes/v24.3/sidebar-data/self-hosted-deployments.json
@@ -550,7 +550,13 @@
                        "urls": [
                          "/${VERSION}/ui-key-visualizer.html"
                        ]
-                   }
+                      },
+                      {
+                        "title": "WAL Failover",
+                        "urls": [
+                          "/${VERSION}/wal-failover.html"
+                        ]
+                      }
                ]
           },
           {

diff --git a/src/current/_includes/v24.3/wal-failover-intro.md b/src/current/_includes/v24.3/wal-failover-intro.md
@@ -0,0 +1,9 @@
+On a CockroachDB [node]({% link {{ page.version.version }}/architecture/overview.md %}#node) with [multiple stores]({% link {{ page.version.version }}/cockroach-start.md %}#store), you can mitigate some effects of [disk stalls]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disk-stalls) by configuring the node to failover each store's [write-ahead log (WAL)]({% link {{ page.version.version }}/architecture/storage-layer.md %}#memtable-and-write-ahead-log) to another store's data directory using the `--wal-failover` flag to [`cockroach start`]({% link {{ page.version.version }}/cockroach-start.md %}#enable-wal-failover) or the `COCKROACH_WAL_FAILOVER` environment variable.
+
+Failing over the WAL may allow some operations against a store to continue to complete despite temporary unavailability of the underlying storage. For example, if the node's primary store is stalled, and the node can't read from or write to it, the node can still write to the WAL on another store. This can allow the node to continue to service requests during momentary unavailability of the underlying storage device.
+
+When WAL failover is enabled, CockroachDB will take the the following actions:
+
+- At node startup, each store is assigned another store to be its failover destination.
+- CockroachDB will begin monitoring the latency of all WAL writes. If latency to the WAL exceeds the value of the [cluster setting `storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node will attempt to write WAL entries to a secondary store's volume.
+- CockroachDB will update the [store status endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#store-status-endpoint) at `/_status/stores` so you can monitor the store's status.
diff --git a/src/current/_includes/v24.3/wal-failover-log-config.md b/src/current/_includes/v24.3/wal-failover-log-config.md
@@ -0,0 +1,9 @@
+{% include_cached copy-clipboard.html %}
+~~~ yaml
+file-defaults:
+ buffered-writes: false
+ buffering:
+   max-staleness: 1s
+   flush-trigger-size: 256KiB
+   max-buffer-size: 50MiB
+~~~
diff --git a/src/current/_includes/v24.3/wal-failover-metrics.md b/src/current/_includes/v24.3/wal-failover-metrics.md
@@ -0,0 +1,12 @@
+You can monitor WAL failover occurrences using the following metrics:
+
+- `storage.wal.failover.secondary.duration`: Cumulative time spent (in nanoseconds) writing to the secondary WAL directory. Only populated when WAL failover is configured.
+- `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured.
+- `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa.
+
+The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, the rate at which it increases provides an indication of the health of the primary store.
+
+You can access these metrics via the following methods:
+
+- The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}).
+- By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}).
diff --git a/src/current/_includes/v24.3/wal-failover-side-disk.md b/src/current/_includes/v24.3/wal-failover-side-disk.md
@@ -0,0 +1,3 @@
+- Size = minimum 25 GiB
+- IOPS = 1/10th of the disk for the "user data" store
+- Bandwidth = 1/10th of the disk for the "user data" store
diff --git a/src/current/images/v24.3/wal-failover-long-stall-metrics.jpg b/src/current/images/v24.3/wal-failover-long-stall-metrics.jpg
diff --git a/src/current/images/v24.3/wal-failover-metrics-chart.jpg b/src/current/images/v24.3/wal-failover-metrics-chart.jpg
diff --git a/src/current/v24.3/cluster-setup-troubleshooting.md b/src/current/v24.3/cluster-setup-troubleshooting.md
@@ -420,6 +420,10 @@ Different filesystems may treat the ballast file differently. Make sure to test
 
 A _disk stall_ is any disk operation that does not terminate in a reasonable amount of time. This usually manifests as write-related system calls such as [`fsync(2)`](https://man7.org/linux/man-pages/man2/fdatasync.2.html) (aka `fdatasync`) taking a lot longer than expected (e.g., more than 20 seconds). The mitigation in almost all cases is to [restart the node]({% link {{ page.version.version }}/cockroach-start.md %}) with the stalled disk. CockroachDB's internal disk stall monitoring will attempt to shut down a node when it sees a disk stall that lasts longer than 20 seconds. At that point the node should be restarted by your [orchestration system]({% link {{ page.version.version }}/recommended-production-settings.md %}#orchestration-kubernetes).
 
+{{site.data.alerts.callout_success}}
+In cloud environments, transient disk stalls are common, often lasting on the order of several seconds. If you deploy a CockroachDB {{ site.data.products.core }} cluster in the cloud, we strongly recommend enabling [WAL failover]({% link {{ page.version.version }}/cockroach-start.md %}#write-ahead-log-wal-failover).
+{{site.data.alerts.end}}
+
 Symptoms of disk stalls include:
 
 - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload.

diff --git a/src/current/v24.3/cockroach-start.md b/src/current/v24.3/cockroach-start.md
@@ -232,22 +232,18 @@ Field | Description
 <a name="fields-ballast-size"></a> `ballast-size` | Configure the size of the automatically created emergency ballast file. Accepts the same value formats as the [`size` field](#store-size). For more details, see [Automatic ballast files]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#automatic-ballast-files).<br><br>To disable automatic ballast file creation, set the value to `0`:<br><br>`--store=path=/mnt/ssd01,ballast-size=0`
 <a name="store-provisioned-rate"></a> `provisioned-rate` | A mapping of a store name to a bandwidth limit, expressed in bytes per second. This constrains the bandwidth used for [admission control]({% link {{ page.version.version }}/admission-control.md %}) for operations on the store. The disk name is separated from the bandwidth value by a colon (`:`). A value of `0` (the default) represents unlimited bandwidth. For example: <br /><br />`--store=provisioned-rate=disk-name=/mnt/ssd01:200`<br /><br />**Default:** 0<br /><br />If the bandwidth value is omitted, bandwidth is limited to the value of the  [`kv.store.admission.provisioned_bandwidth` cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#settings). <strong>Modify this setting only in consultation with your <a href="https://support.cockroachlabs.com/hc/en-us">support team</a>.</strong>
 
-#### Write Ahead Log (WAL) Failover
+#### Write Ahead Log (WAL) failover
 
-On a CockroachDB [node]({% link {{ page.version.version }}/architecture/overview.md %}#node) with [multiple stores](#store), you can mitigate some effects of [disk stalls]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disk-stalls) by configuring the node to failover each store's [write-ahead log (WAL)]({% link {{ page.version.version }}/architecture/storage-layer.md %}#memtable-and-write-ahead-log) to another store's data directory using the `--wal-failover` flag.
-
-Failing over the WAL may allow some operations against a store to continue to complete despite temporary unavailability of the underlying storage. For example, if the node's primary store is stalled, and the node can't read from or write to it, the node can still write to the WAL on another store. This can give the node a chance to eventually catch up once the disk stall has been resolved.
-
-When WAL failover is enabled, CockroachDB will take the the following actions:
-
-- At node startup, each store is assigned another store to be its failover destination.
-- CockroachDB will begin monitoring the latency of all WAL writes. If latency to the WAL exceeds the value of the [cluster setting `storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node will attempt to write WAL entries to a secondary store's volume.
-- CockroachDB will update the [store status endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#store-status-endpoint) at `/_status/stores` so you can monitor the store's status.
+{% include {{ page.version.version }}/wal-failover-intro.md %}
 
 {{site.data.alerts.callout_info}}
 {% include feature-phases/preview.md %}
 {{site.data.alerts.end}}
 
+This page has basic instructions on how to enable WAL failover, disable WAL failover, and monitor WAL failover.
+
+For more detailed instructions showing how to use, test, and monitor WAL failover, as well as descriptions of how WAL failover works in multi-store configurations, see [WAL Failover]({% link {{ page.version.version }}/wal-failover.md %}).
+
 ##### Enable WAL failover
 
 To enable WAL failover, you must take one of the following actions:
@@ -265,14 +261,7 @@ Therefore, if you enable WAL failover and log to local disks, you must also upda
 1. When `buffering` is enabled, `buffered-writes` must be explicitly disabled as shown in the following example. This is necessary because `buffered-writes` does not provide true asynchronous disk access, but rather a small buffer. If the small buffer fills up, it can cause internal routines performing logging operations to hang. This will in turn cause internal routines doing other important work to hang, potentially affecting cluster stability.
 1. The recommended logging configuration for using file-based logging with WAL failover is as follows:
 
-      ~~~
-      file-defaults:
-        buffered-writes: false
-        buffering:
-          max-staleness: 1s
-          flush-trigger-size: 256KiB
-          max-buffer-size: 50MiB
-      ~~~
+    {% include {{ page.version.version }}/wal-failover-log-config.md %}
 
 As an alternative to logging to local disks, you can configure [remote log sinks]({% link {{page.version.version}}/logging-use-cases.md %}#network-logging) that are not correlated with the availability of your cluster's local disks. However, this will make troubleshooting using [`cockroach debug zip`]({% link {{ page.version.version}}/cockroach-debug-zip.md %}) more difficult, since the output of that command will not include the (remotely stored) log files.
 
@@ -285,18 +274,7 @@ To disable WAL failover, you must [restart the node]({% link {{ page.version.ver
 
 ##### Monitor WAL failover
 
-You can monitor if WAL failover occurs using the following metrics:
-
-- `storage.wal.failover.secondary.duration`: Cumulative time spent (in nanoseconds) writing to the secondary WAL directory. Only populated when WAL failover is configured.
-- `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured.
-- `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa.
-
-The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, you probably care about how long it remains non-zero because it provides an indication of the health of the primary store.
-
-You can access these metrics via the following methods:
-
-- The [Custom Chart Debug Page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}).
-- By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}).
+{% include {{ page.version.version }}/wal-failover-metrics.md %}
 
 ### Logging
 

diff --git a/src/current/v24.3/cockroachdb-feature-availability.md b/src/current/v24.3/cockroachdb-feature-availability.md
@@ -260,12 +260,6 @@ The [`EXPERIMENTAL CHANGEFEED FOR`]({% link {{ page.version.version }}/changefee
 
 The multiple active portals feature of the Postgres wire protocol (pgwire) is available, with limitations.  For more information, see [Multiple active portals]({% link {{ page.version.version }}/postgresql-compatibility.md %}#multiple-active-portals).
 
-### Write Ahead Log (WAL) Failover
-
-When a CockroachDB [node]({% link {{ page.version.version }}/architecture/overview.md %}#node) is configured to run with [multiple stores]({% link {{ page.version.version }}/cockroach-start.md %}#store), you can mitigate some effects of [disk stalls]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disk-stalls) by configuring the node to failover each store's [write-ahead log (WAL)]({% link {{ page.version.version }}/architecture/storage-layer.md %}#memtable-and-write-ahead-log) to another store's data directory.
-
-For more information, see [Write Ahead Log (WAL Failover)]({% link {{ page.version.version }}/cockroach-start.md %}#write-ahead-log-wal-failover).
-
 ### Super regions
 
 [Super regions]({% link {{ page.version.version }}/multiregion-overview.md %}#super-regions) allow you to define a set of database regions such that schema objects will have all of their replicas stored _only_ in regions that are members of the super region. The primary use case for super regions is data domiciling.