-
Notifications
You must be signed in to change notification settings - Fork 467
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add TA-131639 Fixes DOC-11251 * Address nicktrav feedback (1) * Address florence-crl feedback (1) * Update with nicktrav feedback (2) * Fix link * Add check script to repo * Apply suggestions from code review Co-authored-by: Matt Linville (he/him) <[email protected]> * Update usage example * Update Detection section per additional review * Apply final suggestions * Update src/current/advisories/a131639.md --------- Co-authored-by: mikeCRL <[email protected]> Co-authored-by: Mike Lewis <[email protected]> Co-authored-by: Matt Linville (he/him) <[email protected]>
- Loading branch information
1 parent
678019d
commit 7b18875
Showing
2 changed files
with
199 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
--- | ||
title: Technical Advisory 131639 | ||
advisory: A-131639 | ||
summary: During a sustained period of disk slowness in the presence of lease transfers, it is possible for some writes in a transaction that straddle multiple ranges to be lost. | ||
toc: true | ||
affected_versions: v22.2, v23.1.0 to v23.1.26, v23.2.0 to v23.2.10, v24.1.0 | ||
advisory_date: 2024-10-08 | ||
docs_area: releases | ||
--- | ||
|
||
Publication date: {{ page.advisory_date | date: "%B %e, %Y" }} | ||
|
||
## Description | ||
|
||
In v22.2 through v24.2, in circumstances where there is a sustained period of disk slowness in the presence of lease transfers, it is possible for some writes in a transaction that straddle multiple ranges to be lost. In cases where these lost writes are not also present on one or more secondary indexes that could allow for reconstructing or repairing the row, these writes could be irrecoverably lost. | ||
|
||
However, in certain scenarios, these writes may be fully recoverable. For example, if the lost writes were on a secondary index, these could be fully repaired from the primary index, or a primary index could be fully repaired if there exists a set of secondary indexes that fully covers the primary index columns. The lost writes may also be partially recoverable. For example, if only some of the lost writes from a primary index are contained in secondary indexes. | ||
|
||
In detail, since v22.2, when a lease for a range is transferred, it is done so as an expiration-based lease. This lease is then promoted to an epoch-based lease. In situations where the block device backing a store is experiencing intermittent slowness over an extended period of time, the expiration time for the lease transferred to the node with the slow stores can move back in time during this promotion. The lease expiration is said to have "regressed". This lease expiration regression can result in a brief period of time where another node can claim the lease, and the range can have two leaseholders. | ||
|
||
While transactions will typically be routed from gateway nodes to the true leaseholder (on the node with the healthy store), and write intents will be written on this true leaseholder, on the node with the slow store these write intents are still queued up in the range's Raft log, and have not been applied. If async intent resolution (the means by which intents are committed to durable records, once a transaction commits) is subsequently triggered on this slow node before it has caught up on its Raft log, intent resolution becomes a no-op, instead of marking the intent as committed. A future transaction that interacts with the row may then come across this intent, and if the committed transaction's record has been garbage collected at that point, it may incorrectly determine that the intent belongs to an aborted transaction. This would then cause the intent to be cleaned up, effectively dropping the committed write. | ||
|
||
This bug is rare in that it requires intermittent slowness of a block device backing a store, coupled with lease transfers onto the store with the intermittently slow block device, followed by async intent resolution on the slow node before it has caught up on its Raft log. Writes must touch more than one range. Single range writes, including those that benefit from the One-Phase-Commit (1PC) fast path are not affected. | ||
|
||
The following CockroachDB versions are impacted by this bug: | ||
|
||
- v22.2 | ||
- v23.1.0 - v23.1.26 | ||
- v23.2.0 - v23.2.10 | ||
- v24.1.0 (subsequent patch versions are unaffected) | ||
|
||
Versions prior to 23.1 are no longer eligible for [maintenance support]({% link releases/release-support-policy.md %}), and this issue will not be addressed in those versions. | ||
|
||
## Statement | ||
|
||
[In #123442](https://github.com/cockroachdb/cockroach/commit/6dd54b46cc56b7d2b302e0d5ec1509658a1c86f7), we resolved an issue with CockroachDB in the expiration-to-epoch lease promotion transition process, where a lease's effective expiration could be allowed to regress, resulting in two nodes believing they are the leaseholder for a range. | ||
|
||
The patch has been applied to maintenance releases of CockroachDB: | ||
|
||
- [v23.1.27]({% link releases/v23.1.md%}#v23-1-27) | ||
- [v23.2.11]({% link releases/v23.2.md%}#v23-2-11) | ||
- [v24.1.1]({% link releases/v24.1.md%}#v24-1-1) | ||
|
||
This public issue is tracked by [131639](https://github.com/cockroachdb/cockroach/issues/131639). | ||
|
||
## Mitigation | ||
|
||
Users are encouraged to upgrade to [v23.1.27]({% link releases/v23.1.md%}#v23-1-27), [v23.2.11]({% link releases/v23.2.md%}#v23-2-11), [v24.1.1]({% link releases/v24.1.md%}#v24-1-1), or a later version that includes the patch. | ||
|
||
### Detection via logs | ||
|
||
Users can run the script [a131639_check.sh](a131639_check.sh) to determine whether any lease expiration regressions are evident from their logs. The script must be run from within a decompressed [debug.zip]({% link v24.3/cockroach-debug-zip.md %}) directory. | ||
|
||
As described previously, a lease expiration regression is only one of the two race conditions needed to encounter a lost write, with the other being a stale read during asynchronous intent resolution on the range that had the lease expiration regression. Consequently, this script does not provide a complete method of detection for a lost write. Running this script may result in a limited number of false positives. For complete confirmation, this script should be combined with the method described in [Detection via an audit of primary and secondary indexes](#detection-via-an-audit-of-primary-and-secondary-indexes). | ||
|
||
Users should not run the script on a production system. Instead, users should move the `debug.zip` file elsewhere for analysis and run the script there. | ||
|
||
The script requires the following dependencies: | ||
|
||
- [ripgrep](https://github.com/BurntSushi/ripgrep#installation) | ||
- [DuckDB](https://duckdb.org/docs/installation) | ||
|
||
Example usage and output: | ||
|
||
~~~ | ||
➜ pwd | ||
/path/of/unzipped/debug | ||
➜ /path/to/script/a131639_check.sh | ||
Searching for liveness epoch increments ... | ||
Searching for slow lease promotions ... | ||
Querying the logs for lease expiration regressions ... | ||
❌ Lease expiration regressions found. Symptoms detected. | ||
symptoms | ||
node 123 observed possible symptoms at 2024-09-01 12:34:56 on ranges 123456 | ||
~~~ | ||
|
||
### Detection via an audit of primary and secondary indexes | ||
|
||
If the detection script finds symptoms of a potential lease expiration regression and you suspect that a particular row may have been subject to a lost write, you can query the primary and secondary indexes to confirm that the row exists in both places. | ||
|
||
For example, if you suspect that a row in `my_table` with primary key `pk=1` and secondary index keys `a=2` and `b=3` has a lost write, you can run the following queries to confirm: | ||
|
||
{% include_cached copy-clipboard.html %} | ||
~~~ sql | ||
SELECT count(*) FROM my_table@my_table_pkey WHERE pk = 1 AS OF SYSTEM TIME '-10s'; | ||
SELECT count(*) FROM my_table@my_table_a_idx WHERE pk = 1 AND a=2 AS OF SYSTEM TIME '-10s'; | ||
SELECT count(*) FROM my_table@my_table_b_idx WHERE pk = 1 AND b=3 AS OF SYSTEM TIME '-10s'; | ||
~~~ | ||
|
||
If any of these queries return different values for `count`, that likely indicates a lost write. | ||
|
||
The `AS OF SYSTEM TIME` clauses are added to minimize the impact on foreground traffic. | ||
|
||
If you are unsure which rows may have been affected but suspect that your cluster experienced lost writes, please contact the [support team](https://support.cockroachlabs.com). | ||
|
||
### Repair rows with lost writes | ||
|
||
If you experienced lost writes, it is possible to repair rows by re-inserting data. This will generally be easier if you know all the column values that need to be re-inserted. Please contact the [support team](https://support.cockroachlabs.com) for assistance with repairing rows with lost writes. | ||
|
||
## Impact | ||
|
||
In circumstances where there is a sustained period of disk slowness in the presence of lease transfers, it is possible for some writes in a transaction that straddle multiple ranges to be lost. In cases where these lost writes are not also present on one or more secondary indexes that could allow for reconstructing or repairing the row, these writes could be irrecoverably lost. | ||
|
||
Please reach out to the [support team](https://support.cockroachlabs.com) if more information or assistance is needed. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
#!/usr/bin/env bash | ||
# Run from within an unzipped debug.zip directory. | ||
set -euo pipefail | ||
|
||
# Check if ripgrep is installed. | ||
if ! command -v rg &> /dev/null; then | ||
echo "Error: ripgrep is not installed. Please install following the instructions in https://github.com/BurntSushi/ripgrep#installation" | ||
exit 1 | ||
fi | ||
|
||
# Check if duckdb is installed. | ||
if ! command -v duckdb &> /dev/null; then | ||
echo "Error: duckdb is not installed. Please install following the instructions in https://duckdb.org/docs/installation" | ||
exit 1 | ||
fi | ||
|
||
# Grep the debug.zip for the two queries. | ||
echo "Searching for liveness epoch increments ..." | ||
echo "timestamp,remote_node_id,epoch" > logs_epoch_increment.csv | ||
rg -I -r '$1,$2,$3' 'I([0-9]+ [0-9:.]+).*incremented n(.*) liveness epoch to (.*)' nodes >> logs_epoch_increment.csv || true | ||
if [[ $(wc -l <logs_epoch_increment.csv) -eq 1 ]]; then | ||
printf "\n✅ No liveness epoch increments found. Symptoms not detected.\n" | ||
exit 0 | ||
fi | ||
|
||
echo "Searching for slow lease promotions ..." | ||
echo "timestamp,node_id,range_id,range_span,lease_epoch,lease_proposed,prev_expiration" > logs_slow_lease_promo.csv | ||
rg -I -r '$1,$2,$3,$4,$5,$6,$7' 'W([0-9]+ [0-9:.]+).*,n(\d+),.*,r(\d+)/\d+:(.*),raft\].*client traffic may have been delayed.*epo=(\d+).*pro=(\d+\.\d+).*prev.*exp=(\d+\.\d+).*Request.*' nodes >> logs_slow_lease_promo.csv || true | ||
if [[ $(wc -l <logs_slow_lease_promo.csv) -eq 1 ]]; then | ||
printf "\n✅ No slow lease promotions found. Symptoms not detected.\n" | ||
exit 0 | ||
fi | ||
|
||
# Query the logs using duckdb. | ||
echo "Querying the logs for lease expiration regressions ..." | ||
cat <<EOF > query.sql | ||
SELECT | ||
'node ' | ||
|| node_id | ||
|| ' observed possible symptoms at ' | ||
|| lease_proposed_min | ||
|| ' on ranges ' | ||
|| string_agg(range_id, ', ') as symptoms | ||
FROM | ||
( | ||
WITH | ||
lease_acqs AS ( | ||
SELECT | ||
strptime(timestamp, '%y%m%d %H:%M:%S.%n')::timestamp_ns as timestamp, | ||
node_id, | ||
range_id, | ||
range_span, | ||
lease_epoch, | ||
timezone('UTC', to_timestamp(lease_proposed)) AS lease_proposed, | ||
timezone('UTC', to_timestamp(prev_expiration)) AS prev_expiration | ||
FROM 'logs_slow_lease_promo.csv' | ||
), | ||
epoch_incs AS ( | ||
SELECT | ||
strptime(timestamp, '%y%m%d %H:%M:%S.%n')::timestamp_ns as timestamp, | ||
remote_node_id, | ||
epoch | ||
FROM 'logs_epoch_increment.csv' | ||
) | ||
SELECT *, date_trunc('minute', lease_proposed) AS lease_proposed_min | ||
FROM lease_acqs JOIN epoch_incs | ||
-- where the leaseholder's liveness record had its epoch incremented | ||
ON lease_acqs.node_id = epoch_incs.remote_node_id | ||
-- and the lease acquisition used an epoch that is less than the one | ||
-- that was incremented to | ||
AND lease_acqs.lease_epoch < epoch_incs.epoch | ||
-- and the lease acquisition began before the epoch increment | ||
AND lease_acqs.lease_proposed < epoch_incs.timestamp | ||
-- and the lease acquisition completed after the epoch increment | ||
AND lease_acqs.timestamp > epoch_incs.timestamp | ||
-- and the previous lease's expiration was after the epoch increment | ||
AND lease_acqs.prev_expiration > epoch_incs.timestamp | ||
) | ||
GROUP BY node_id, lease_proposed_min | ||
ORDER BY lease_proposed_min | ||
EOF | ||
duckdb -list < query.sql > logs_lease_regressions.txt | ||
|
||
# Output result. | ||
if [ -s logs_lease_regressions.txt ]; then | ||
printf "\n❌ Lease expiration regressions found. Symptoms detected.\n\n" | ||
cat logs_lease_regressions.txt | ||
exit 1 | ||
else | ||
printf "\n✅ No lease expiration regressions found. Symptoms not detected.\n" | ||
exit 0 | ||
fi |