Skip to content

Commit

Permalink
Add TA-131639 (#18981)
Browse files Browse the repository at this point in the history
* Add TA-131639

Fixes DOC-11251

* Address nicktrav feedback (1)

* Address florence-crl feedback (1)

* Update with nicktrav feedback (2)

* Fix link

* Add check script to repo

* Apply suggestions from code review

Co-authored-by: Matt Linville (he/him) <[email protected]>

* Update usage example

* Update Detection section per additional review

* Apply final suggestions

* Update src/current/advisories/a131639.md

---------

Co-authored-by: mikeCRL <[email protected]>
Co-authored-by: Mike Lewis <[email protected]>
Co-authored-by: Matt Linville (he/him) <[email protected]>
  • Loading branch information
4 people authored Oct 8, 2024
1 parent 678019d commit 7b18875
Show file tree
Hide file tree
Showing 2 changed files with 199 additions and 0 deletions.
107 changes: 107 additions & 0 deletions src/current/advisories/a131639.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
title: Technical Advisory 131639
advisory: A-131639
summary: During a sustained period of disk slowness in the presence of lease transfers, it is possible for some writes in a transaction that straddle multiple ranges to be lost.
toc: true
affected_versions: v22.2, v23.1.0 to v23.1.26, v23.2.0 to v23.2.10, v24.1.0
advisory_date: 2024-10-08
docs_area: releases
---

Publication date: {{ page.advisory_date | date: "%B %e, %Y" }}

## Description

In v22.2 through v24.2, in circumstances where there is a sustained period of disk slowness in the presence of lease transfers, it is possible for some writes in a transaction that straddle multiple ranges to be lost. In cases where these lost writes are not also present on one or more secondary indexes that could allow for reconstructing or repairing the row, these writes could be irrecoverably lost.

However, in certain scenarios, these writes may be fully recoverable. For example, if the lost writes were on a secondary index, these could be fully repaired from the primary index, or a primary index could be fully repaired if there exists a set of secondary indexes that fully covers the primary index columns. The lost writes may also be partially recoverable. For example, if only some of the lost writes from a primary index are contained in secondary indexes.

In detail, since v22.2, when a lease for a range is transferred, it is done so as an expiration-based lease. This lease is then promoted to an epoch-based lease. In situations where the block device backing a store is experiencing intermittent slowness over an extended period of time, the expiration time for the lease transferred to the node with the slow stores can move back in time during this promotion. The lease expiration is said to have "regressed". This lease expiration regression can result in a brief period of time where another node can claim the lease, and the range can have two leaseholders.

While transactions will typically be routed from gateway nodes to the true leaseholder (on the node with the healthy store), and write intents will be written on this true leaseholder, on the node with the slow store these write intents are still queued up in the range's Raft log, and have not been applied. If async intent resolution (the means by which intents are committed to durable records, once a transaction commits) is subsequently triggered on this slow node before it has caught up on its Raft log, intent resolution becomes a no-op, instead of marking the intent as committed. A future transaction that interacts with the row may then come across this intent, and if the committed transaction's record has been garbage collected at that point, it may incorrectly determine that the intent belongs to an aborted transaction. This would then cause the intent to be cleaned up, effectively dropping the committed write.

This bug is rare in that it requires intermittent slowness of a block device backing a store, coupled with lease transfers onto the store with the intermittently slow block device, followed by async intent resolution on the slow node before it has caught up on its Raft log. Writes must touch more than one range. Single range writes, including those that benefit from the One-Phase-Commit (1PC) fast path are not affected.

The following CockroachDB versions are impacted by this bug:

- v22.2
- v23.1.0 - v23.1.26
- v23.2.0 - v23.2.10
- v24.1.0 (subsequent patch versions are unaffected)

Versions prior to 23.1 are no longer eligible for [maintenance support]({% link releases/release-support-policy.md %}), and this issue will not be addressed in those versions.

## Statement

[In #123442](https://github.com/cockroachdb/cockroach/commit/6dd54b46cc56b7d2b302e0d5ec1509658a1c86f7), we resolved an issue with CockroachDB in the expiration-to-epoch lease promotion transition process, where a lease's effective expiration could be allowed to regress, resulting in two nodes believing they are the leaseholder for a range.

The patch has been applied to maintenance releases of CockroachDB:

- [v23.1.27]({% link releases/v23.1.md%}#v23-1-27)
- [v23.2.11]({% link releases/v23.2.md%}#v23-2-11)
- [v24.1.1]({% link releases/v24.1.md%}#v24-1-1)

This public issue is tracked by [131639](https://github.com/cockroachdb/cockroach/issues/131639).

## Mitigation

Users are encouraged to upgrade to [v23.1.27]({% link releases/v23.1.md%}#v23-1-27), [v23.2.11]({% link releases/v23.2.md%}#v23-2-11), [v24.1.1]({% link releases/v24.1.md%}#v24-1-1), or a later version that includes the patch.

### Detection via logs

Users can run the script [a131639_check.sh](a131639_check.sh) to determine whether any lease expiration regressions are evident from their logs. The script must be run from within a decompressed [debug.zip]({% link v24.3/cockroach-debug-zip.md %}) directory.

As described previously, a lease expiration regression is only one of the two race conditions needed to encounter a lost write, with the other being a stale read during asynchronous intent resolution on the range that had the lease expiration regression. Consequently, this script does not provide a complete method of detection for a lost write. Running this script may result in a limited number of false positives. For complete confirmation, this script should be combined with the method described in [Detection via an audit of primary and secondary indexes](#detection-via-an-audit-of-primary-and-secondary-indexes).

Users should not run the script on a production system. Instead, users should move the `debug.zip` file elsewhere for analysis and run the script there.

The script requires the following dependencies:

- [ripgrep](https://github.com/BurntSushi/ripgrep#installation)
- [DuckDB](https://duckdb.org/docs/installation)

Example usage and output:

~~~
➜ pwd
/path/of/unzipped/debug
➜ /path/to/script/a131639_check.sh
Searching for liveness epoch increments ...
Searching for slow lease promotions ...
Querying the logs for lease expiration regressions ...
❌ Lease expiration regressions found. Symptoms detected.
symptoms
node 123 observed possible symptoms at 2024-09-01 12:34:56 on ranges 123456
~~~

### Detection via an audit of primary and secondary indexes

If the detection script finds symptoms of a potential lease expiration regression and you suspect that a particular row may have been subject to a lost write, you can query the primary and secondary indexes to confirm that the row exists in both places.

For example, if you suspect that a row in `my_table` with primary key `pk=1` and secondary index keys `a=2` and `b=3` has a lost write, you can run the following queries to confirm:

{% include_cached copy-clipboard.html %}
~~~ sql
SELECT count(*) FROM my_table@my_table_pkey WHERE pk = 1 AS OF SYSTEM TIME '-10s';
SELECT count(*) FROM my_table@my_table_a_idx WHERE pk = 1 AND a=2 AS OF SYSTEM TIME '-10s';
SELECT count(*) FROM my_table@my_table_b_idx WHERE pk = 1 AND b=3 AS OF SYSTEM TIME '-10s';
~~~

If any of these queries return different values for `count`, that likely indicates a lost write.

The `AS OF SYSTEM TIME` clauses are added to minimize the impact on foreground traffic.

If you are unsure which rows may have been affected but suspect that your cluster experienced lost writes, please contact the [support team](https://support.cockroachlabs.com).

### Repair rows with lost writes

If you experienced lost writes, it is possible to repair rows by re-inserting data. This will generally be easier if you know all the column values that need to be re-inserted. Please contact the [support team](https://support.cockroachlabs.com) for assistance with repairing rows with lost writes.

## Impact

In circumstances where there is a sustained period of disk slowness in the presence of lease transfers, it is possible for some writes in a transaction that straddle multiple ranges to be lost. In cases where these lost writes are not also present on one or more secondary indexes that could allow for reconstructing or repairing the row, these writes could be irrecoverably lost.

Please reach out to the [support team](https://support.cockroachlabs.com) if more information or assistance is needed.
92 changes: 92 additions & 0 deletions src/current/advisories/a131639_check.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
#!/usr/bin/env bash
# Run from within an unzipped debug.zip directory.
set -euo pipefail

# Check if ripgrep is installed.
if ! command -v rg &> /dev/null; then
echo "Error: ripgrep is not installed. Please install following the instructions in https://github.com/BurntSushi/ripgrep#installation"
exit 1
fi

# Check if duckdb is installed.
if ! command -v duckdb &> /dev/null; then
echo "Error: duckdb is not installed. Please install following the instructions in https://duckdb.org/docs/installation"
exit 1
fi

# Grep the debug.zip for the two queries.
echo "Searching for liveness epoch increments ..."
echo "timestamp,remote_node_id,epoch" > logs_epoch_increment.csv
rg -I -r '$1,$2,$3' 'I([0-9]+ [0-9:.]+).*incremented n(.*) liveness epoch to (.*)' nodes >> logs_epoch_increment.csv || true
if [[ $(wc -l <logs_epoch_increment.csv) -eq 1 ]]; then
printf "\n✅ No liveness epoch increments found. Symptoms not detected.\n"
exit 0
fi

echo "Searching for slow lease promotions ..."
echo "timestamp,node_id,range_id,range_span,lease_epoch,lease_proposed,prev_expiration" > logs_slow_lease_promo.csv
rg -I -r '$1,$2,$3,$4,$5,$6,$7' 'W([0-9]+ [0-9:.]+).*,n(\d+),.*,r(\d+)/\d+:(.*),raft\].*client traffic may have been delayed.*epo=(\d+).*pro=(\d+\.\d+).*prev.*exp=(\d+\.\d+).*Request.*' nodes >> logs_slow_lease_promo.csv || true
if [[ $(wc -l <logs_slow_lease_promo.csv) -eq 1 ]]; then
printf "\n✅ No slow lease promotions found. Symptoms not detected.\n"
exit 0
fi

# Query the logs using duckdb.
echo "Querying the logs for lease expiration regressions ..."
cat <<EOF > query.sql
SELECT
'node '
|| node_id
|| ' observed possible symptoms at '
|| lease_proposed_min
|| ' on ranges '
|| string_agg(range_id, ', ') as symptoms
FROM
(
WITH
lease_acqs AS (
SELECT
strptime(timestamp, '%y%m%d %H:%M:%S.%n')::timestamp_ns as timestamp,
node_id,
range_id,
range_span,
lease_epoch,
timezone('UTC', to_timestamp(lease_proposed)) AS lease_proposed,
timezone('UTC', to_timestamp(prev_expiration)) AS prev_expiration
FROM 'logs_slow_lease_promo.csv'
),
epoch_incs AS (
SELECT
strptime(timestamp, '%y%m%d %H:%M:%S.%n')::timestamp_ns as timestamp,
remote_node_id,
epoch
FROM 'logs_epoch_increment.csv'
)
SELECT *, date_trunc('minute', lease_proposed) AS lease_proposed_min
FROM lease_acqs JOIN epoch_incs
-- where the leaseholder's liveness record had its epoch incremented
ON lease_acqs.node_id = epoch_incs.remote_node_id
-- and the lease acquisition used an epoch that is less than the one
-- that was incremented to
AND lease_acqs.lease_epoch < epoch_incs.epoch
-- and the lease acquisition began before the epoch increment
AND lease_acqs.lease_proposed < epoch_incs.timestamp
-- and the lease acquisition completed after the epoch increment
AND lease_acqs.timestamp > epoch_incs.timestamp
-- and the previous lease's expiration was after the epoch increment
AND lease_acqs.prev_expiration > epoch_incs.timestamp
)
GROUP BY node_id, lease_proposed_min
ORDER BY lease_proposed_min
EOF
duckdb -list < query.sql > logs_lease_regressions.txt

# Output result.
if [ -s logs_lease_regressions.txt ]; then
printf "\n❌ Lease expiration regressions found. Symptoms detected.\n\n"
cat logs_lease_regressions.txt
exit 1
else
printf "\n✅ No lease expiration regressions found. Symptoms not detected.\n"
exit 0
fi

0 comments on commit 7b18875

Please sign in to comment.