Skip to content

Commit

Permalink
Add WAL corruption information and fix relrefs (#5618)
Browse files Browse the repository at this point in the history
* Add WAL corruption draft and fix relrefs

* Rephrase some descriptive info

* Correct default in Static config

* Tweak wording in the descriptive text

* Update text around the error messaging

* Some minor word changes

* Switch xref links and comment out doc reference

* Remove empty line

* Small wording changes

* Remove doc reference syntax

* Add more info and reformat layout

* Small update to wording
  • Loading branch information
clayton-cornell authored Oct 27, 2023
1 parent 6acb80f commit 4e5318b
Show file tree
Hide file tree
Showing 2 changed files with 44 additions and 7 deletions.
49 changes: 43 additions & 6 deletions docs/sources/shared/wal-data-retention.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ for remote writing. If that data has not yet been pushed to the remote
endpoint, it is lost.

This behavior dictates the data retention for the `prometheus.remote_write`
component. It also means that it is impossible to directly correlate data
component. It also means that it's impossible to directly correlate data
retention directly to the data age itself, as the truncation logic works on
_segments_, not the samples themselves. This makes data retention less
predictable when the component receives a non-consistent rate of data.
Expand All @@ -48,7 +48,7 @@ removed. The `max_keepalive_time` or `max_wal_time` controls the maximum age of
samples that can be kept in the WAL. Samples older than
`max_keepalive_time` are forcibly removed.

### In cases of `remote_write` outages
### Extended `remote_write` outages
When the remote write endpoint is unreachable over a period of time, the most
recent successfully sent timestamp is not updated. The
`min_keepalive_time` and `max_keepalive_time` arguments control the age range
Expand All @@ -57,15 +57,15 @@ of data kept in the WAL.
If the remote write outage is longer than the `max_keepalive_time` parameter,
then the WAL is truncated, and the oldest data is lost.

### In cases of intermittent `remote_write` outages
### Intermittent `remote_write` outages
If the remote write endpoint is intermittently reachable, the most recent
successfully sent timestamp is updated whenever the connection is successful.
A successful connection updates the series' comparison with
`min_keepalive_time` and triggers a truncation on the next `truncate_frequency`
interval which checkpoints two thirds of the segments (rounded down to the
nearest integer) written since the previous truncation.

### In cases of falling behind
### Falling behind
If the queue shards cannot flush data quickly enough to keep
up-to-date with the most recent data buffered in the WAL, we say that the
component is 'falling behind'.
Expand All @@ -74,5 +74,42 @@ If the component falls behind more than one third of the data written since the
last truncate interval, it is possible for the truncate loop to checkpoint data
before being pushed to the remote_write endpoint.

[WAL block]: {{< relref "../flow/reference/components/prometheus.remote_write.md/#wal-block" >}}
[metrics config]: {{< relref "../static/configuration/metrics-config.md" >}}
### WAL corruption

WAL corruption can occur when Grafana Agent unexpectedly stops while the latest WAL segments
are still being written to disk. For example, the host computer has a general disk failure
and crashes before you can stop Grafana Agent and other running services. When you restart Grafana
Agent, the Agent verifies the WAL, removing any corrupt segments it finds. Sometimes, this repair
is unsuccessful, and you must manually delete the corrupted WAL to continue.

If the WAL becomes corrupted, Grafana Agent writes error messages such as
`err="failed to find segment for index"` to the log file.

{{% admonition type="note" %}}
Deleting a WAL segment or a WAL file permanently deletes the stored WAL data.
{{% /admonition %}}

To delete the corrupted WAL:

1. [Stop][] Grafana Agent.
1. Find and delete the contents of the `wal` directory.

By default the `wal` directory is a subdirectory
of the `data-agent` directory located in the Grafana Agent working directory. The WAL data directory
may be different than the default depending on the [wal_directory][] setting in your Static configuration
file or the path specified by the Flow [command line flag][run] `--storage-path`.

{{% admonition type="note" %}}
There is one `wal` directory per:

* Metrics instance running in Static mode
* `prometheus.remote_write` component running in Flow mode
{{% /admonition %}}

1. [Start][Stop] Grafana Agent and verify that the WAL is working correctly.

[WAL block]: /docs/agent/<AGENT_VERSION>/flow/reference/components/prometheus.remote_write#wal-block
[metrics config]: /docs/agent/<AGENT_VERSION>/static/configuration/metrics-config
[Stop]: /docs/agent/<AGENT_VERSION>/flow/setup/start-agent
[wal_directory]: /docs/agent/<AGENT_VERSION>/static/configuration/metrics-config
[run]: /docs/agent/<AGENT_VERSION>/flow/reference/cli/run
2 changes: 1 addition & 1 deletion docs/sources/static/configuration/metrics-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ define one instance.
# The Grafana Agent assumes that all folders within wal_directory are managed by
# the agent itself. This means if you are using a PVC, you must point
# wal_directory to a subdirectory of the PVC mount.
[wal_directory: <string> | default = ""]
[wal_directory: <string> | default = "data-agent/"]

# Configures how long ago an abandoned (not associated with an instance) WAL
# may be written to before being eligible to be deleted
Expand Down

0 comments on commit 4e5318b

Please sign in to comment.