diff --git a/docs/sources/shared/wal-data-retention.md b/docs/sources/shared/wal-data-retention.md index 5ff90ae03301..cb5c1f5a0e7a 100644 --- a/docs/sources/shared/wal-data-retention.md +++ b/docs/sources/shared/wal-data-retention.md @@ -27,7 +27,7 @@ for remote writing. If that data has not yet been pushed to the remote endpoint, it is lost. This behavior dictates the data retention for the `prometheus.remote_write` -component. It also means that it is impossible to directly correlate data +component. It also means that it's impossible to directly correlate data retention directly to the data age itself, as the truncation logic works on _segments_, not the samples themselves. This makes data retention less predictable when the component receives a non-consistent rate of data. @@ -48,7 +48,7 @@ removed. The `max_keepalive_time` or `max_wal_time` controls the maximum age of samples that can be kept in the WAL. Samples older than `max_keepalive_time` are forcibly removed. -### In cases of `remote_write` outages +### Extended `remote_write` outages When the remote write endpoint is unreachable over a period of time, the most recent successfully sent timestamp is not updated. The `min_keepalive_time` and `max_keepalive_time` arguments control the age range @@ -57,7 +57,7 @@ of data kept in the WAL. If the remote write outage is longer than the `max_keepalive_time` parameter, then the WAL is truncated, and the oldest data is lost. -### In cases of intermittent `remote_write` outages +### Intermittent `remote_write` outages If the remote write endpoint is intermittently reachable, the most recent successfully sent timestamp is updated whenever the connection is successful. A successful connection updates the series' comparison with @@ -65,7 +65,7 @@ A successful connection updates the series' comparison with interval which checkpoints two thirds of the segments (rounded down to the nearest integer) written since the previous truncation. -### In cases of falling behind +### Falling behind If the queue shards cannot flush data quickly enough to keep up-to-date with the most recent data buffered in the WAL, we say that the component is 'falling behind'. @@ -74,5 +74,42 @@ If the component falls behind more than one third of the data written since the last truncate interval, it is possible for the truncate loop to checkpoint data before being pushed to the remote_write endpoint. -[WAL block]: {{< relref "../flow/reference/components/prometheus.remote_write.md/#wal-block" >}} -[metrics config]: {{< relref "../static/configuration/metrics-config.md" >}} +### WAL corruption + +WAL corruption can occur when Grafana Agent unexpectedly stops while the latest WAL segments +are still being written to disk. For example, the host computer has a general disk failure +and crashes before you can stop Grafana Agent and other running services. When you restart Grafana +Agent, the Agent verifies the WAL, removing any corrupt segments it finds. Sometimes, this repair +is unsuccessful, and you must manually delete the corrupted WAL to continue. + +If the WAL becomes corrupted, Grafana Agent writes error messages such as +`err="failed to find segment for index"` to the log file. + +{{% admonition type="note" %}} +Deleting a WAL segment or a WAL file permanently deletes the stored WAL data. +{{% /admonition %}} + +To delete the corrupted WAL: + +1. [Stop][] Grafana Agent. +1. Find and delete the contents of the `wal` directory. + + By default the `wal` directory is a subdirectory + of the `data-agent` directory located in the Grafana Agent working directory. The WAL data directory + may be different than the default depending on the [wal_directory][] setting in your Static configuration + file or the path specified by the Flow [command line flag][run] `--storage-path`. + + {{% admonition type="note" %}} + There is one `wal` directory per: + + * Metrics instance running in Static mode + * `prometheus.remote_write` component running in Flow mode + {{% /admonition %}} + +1. [Start][Stop] Grafana Agent and verify that the WAL is working correctly. + +[WAL block]: /docs/agent//flow/reference/components/prometheus.remote_write#wal-block +[metrics config]: /docs/agent//static/configuration/metrics-config +[Stop]: /docs/agent//flow/setup/start-agent +[wal_directory]: /docs/agent//static/configuration/metrics-config +[run]: /docs/agent//flow/reference/cli/run diff --git a/docs/sources/static/configuration/metrics-config.md b/docs/sources/static/configuration/metrics-config.md index 70926e003791..d5cb9a91f41e 100644 --- a/docs/sources/static/configuration/metrics-config.md +++ b/docs/sources/static/configuration/metrics-config.md @@ -31,7 +31,7 @@ define one instance. # The Grafana Agent assumes that all folders within wal_directory are managed by # the agent itself. This means if you are using a PVC, you must point # wal_directory to a subdirectory of the PVC mount. -[wal_directory: | default = ""] +[wal_directory: | default = "data-agent/"] # Configures how long ago an abandoned (not associated with an instance) WAL # may be written to before being eligible to be deleted