Add WAL corruption information and fix relrefs (#5618)

* Add WAL corruption draft and fix relrefs * Rephrase some descriptive info * Correct default in Static config * Tweak wording in the descriptive text * Update text around the error messaging * Some minor word changes * Switch xref links and comment out doc reference * Remove empty line * Small wording changes * Remove doc reference syntax * Add more info and reformat layout * Small update to wording
grafana · Oct 27, 2023 · 4e5318b · 4e5318b
1 parent 6acb80f
commit 4e5318b
Show file tree

Hide file tree

Showing 2 changed files with 44 additions and 7 deletions.
diff --git a/docs/sources/shared/wal-data-retention.md b/docs/sources/shared/wal-data-retention.md
@@ -27,7 +27,7 @@ for remote writing. If that data has not yet been pushed to the remote
 endpoint, it is lost.
 
 This behavior dictates the data retention for the `prometheus.remote_write`
-component. It also means that it is impossible to directly correlate data
+component. It also means that it's impossible to directly correlate data
 retention directly to the data age itself, as the truncation logic works on
 _segments_, not the samples themselves. This makes data retention less
 predictable when the component receives a non-consistent rate of data.
@@ -48,7 +48,7 @@ removed. The `max_keepalive_time` or `max_wal_time` controls the maximum age of
 samples that can be kept in the WAL. Samples older than
 `max_keepalive_time` are forcibly removed.
 
-### In cases of `remote_write` outages
+### Extended `remote_write` outages
 When the remote write endpoint is unreachable over a period of time, the most
 recent successfully sent timestamp is not updated. The
 `min_keepalive_time` and `max_keepalive_time` arguments control the age range
@@ -57,15 +57,15 @@ of data kept in the WAL.
 If the remote write outage is longer than the `max_keepalive_time` parameter,
 then the WAL is truncated, and the oldest data is lost.
 
-### In cases of intermittent `remote_write` outages
+### Intermittent `remote_write` outages
 If the remote write endpoint is intermittently reachable, the most recent
 successfully sent timestamp is updated whenever the connection is successful.
 A successful connection updates the series' comparison with
 `min_keepalive_time` and triggers a truncation on the next `truncate_frequency`
 interval which checkpoints two thirds of the segments (rounded down to the
 nearest integer) written since the previous truncation.
 
-### In cases of falling behind
+### Falling behind
 If the queue shards cannot flush data quickly enough to keep
 up-to-date with the most recent data buffered in the WAL, we say that the
 component is 'falling behind'.
@@ -74,5 +74,42 @@ If the component falls behind more than one third of the data written since the
 last truncate interval, it is possible for the truncate loop to checkpoint data
 before being pushed to the remote_write endpoint.
 
-[WAL block]: {{< relref "../flow/reference/components/prometheus.remote_write.md/#wal-block" >}}
-[metrics config]: {{< relref "../static/configuration/metrics-config.md" >}}
+### WAL corruption
+
+WAL corruption can occur when Grafana Agent unexpectedly stops while the latest WAL segments
+are still being written to disk. For example, the host computer has a general disk failure
+and crashes before you can stop Grafana Agent and other running services. When you restart Grafana
+Agent, the Agent verifies the WAL, removing any corrupt segments it finds. Sometimes, this repair
+is unsuccessful, and you must manually delete the corrupted WAL to continue.
+
+If the WAL becomes corrupted, Grafana Agent writes error messages such as
+`err="failed to find segment for index"` to the log file.
+
+{{% admonition type="note" %}}
+Deleting a WAL segment or a WAL file permanently deletes the stored WAL data.
+{{% /admonition %}}
+
+To delete the corrupted WAL:
+
+1. [Stop][] Grafana Agent.
+1. Find and delete the contents of the `wal` directory.
+
+   By default the `wal` directory is a subdirectory
+   of the `data-agent` directory located in the Grafana Agent working directory. The WAL data directory
+   may be different than the default depending on the [wal_directory][] setting in your Static configuration
+   file or the path specified by the Flow [command line flag][run] `--storage-path`.
+
+   {{% admonition type="note" %}}
+   There is one `wal` directory per:
+
+   * Metrics instance running in Static mode
+   * `prometheus.remote_write` component running in Flow mode
+   {{% /admonition %}}
+
+1. [Start][Stop] Grafana Agent and verify that the WAL is working correctly.
+
+[WAL block]: /docs/agent/<AGENT_VERSION>/flow/reference/components/prometheus.remote_write#wal-block
+[metrics config]: /docs/agent/<AGENT_VERSION>/static/configuration/metrics-config
+[Stop]: /docs/agent/<AGENT_VERSION>/flow/setup/start-agent
+[wal_directory]: /docs/agent/<AGENT_VERSION>/static/configuration/metrics-config
+[run]: /docs/agent/<AGENT_VERSION>/flow/reference/cli/run
diff --git a/docs/sources/static/configuration/metrics-config.md b/docs/sources/static/configuration/metrics-config.md
@@ -31,7 +31,7 @@ define one instance.
 # The Grafana Agent assumes that all folders within wal_directory are managed by
 # the agent itself. This means if you are using a PVC, you must point
 # wal_directory to a subdirectory of the PVC mount.
-[wal_directory: <string> | default = ""]
+[wal_directory: <string> | default = "data-agent/"]
 
 # Configures how long ago an abandoned (not associated with an instance) WAL
 # may be written to before being eligible to be deleted