kv: prototype buffered writes #122862

nvanbenschoten · 2024-04-22T23:11:12Z

Informs #72614.

This PR includes a prototype of gateway-buffered transaction writes. This is achieved through the introduction of a new txnInterceptor, the txnWriteBuffer, which buffers writes for a transaction before sending them during commit.

Buffering writes client-side until commit has four main benefits:

It allows for more batching of writes, which can be more efficient. Instead of sending writes one at a time, we can batch them up and send them all at once. This is a win even if writes would otherwise be pipelined through consensus.
It allows for the elimination of redundant writes. If a client writes to the same key multiple times in a transaction, only the last write needs to be written to the key-value layer.
It allows the client to serve read-your-writes locally, which can be much faster and cheaper than sending them to the leaseholder. This is especially true when the leaseholder is not collocated with the client. By serving read-your-writes locally from the gateway, write buffering also avoids the problem of pipeline stalls that can occur when a client reads a pipelined intent write before the write has completed consensus. For details on pipeline stalls, see txnPipeliner.
It allows clients to passively hit the 1-phase commit fast-path, instead of requiring clients to carefully construct "auto-commit" BatchRequests to make us of the optimization. By buffering writes on the client before commit, we avoid immediately disabling the fast-path when the client issues their first write. Then, at commit time, we flush the buffer and will happen to hit the 1-phase commit fast path if all writes go to the same range.

The first three of these benefits applies to sysbench. The fourth does not.

Benchmark results

The immediate motivation for this prototype is to determine the benefit buffered writes could have on sysbench. Specifically, we would like to evaluate the throughput improvement (improvement to system efficiency) that comes from this technique when applied to sysbench's oltp_write_only and oltp_read_write workloads.

Both of these workloads contain the following sequence of statements in a transaction. oltp_read_write also includes 14 read-only SELECT statements beforehand.

UPDATE sbtest1 SET k=k+1 WHERE id=1;
UPDATE sbtest1 SET c='...' WHERE id=2;
DELETE FROM sbtest1 WHERE id=3;
INSERT INTO sbtest1 (id, k, c, pad) VALUES (3, 1, '...', '...');
COMMIT;

Before this change, these statements mapped to the following KV batches:

UPDATE sbtest1 SET k=k+1 WHERE id=1;
-- {1 Get}
-- {1 Put}, {1 CPut, 1 Del} in parallel

UPDATE sbtest1 SET c='...' WHERE id=2;
-- {1 Get}
-- {1 Put}

DELETE FROM sbtest1 WHERE id=3;
-- {1 Get}
-- {1 Del}, {1 Del} in parallel

INSERT INTO sbtest1 (id, k, c, pad) VALUES (3, 1, '...', '...');
-- {1 QueryIntent, 1 CPut}, {1 QueryIntent, 1 InitPut} in parallel

COMMIT;
-- {1 EndTxn}, {3 QueryIntent}, {3 QueryIntent} in parallel

The first thing this change does is switch some of these read-write operations on secondary indexes over to blind writes. The read portion of these requests was unnecessary and making them write-only helps with buffering.

UPDATE sbtest1 SET k=k+1 WHERE id=1;
-- {1 Get}
-- {1 Put}, {1 Put, 1 Del} in parallel

UPDATE sbtest1 SET c='...' WHERE id=2;
-- {1 Get}
-- {1 Put}

DELETE FROM sbtest1 WHERE id=3;
-- {1 Get}
-- {1 Del}, {1 Del} in parallel

INSERT INTO sbtest1 (id, k, c, pad) VALUES (3, 1, '...', '...');
-- {1 QueryIntent, 1 CPut}, {1 QueryIntent, 1 Put} in parallel

COMMIT;
-- {1 EndTxn}, {3 QueryIntent}, {3 QueryIntent} in parallel

Next, the change adds a buffer in the TxnCoordSender to defer blind writes. Care it taken to allow read-only and read-write requests to use this buffer to avoid remote access, where possible. This has the following effect:

UPDATE sbtest1 SET k=k+1 WHERE id=1;
-- {1 Get}

UPDATE sbtest1 SET c='...' WHERE id=2;
-- {1 Get}

DELETE FROM sbtest1 WHERE id=3;
-- {1 Get}

INSERT INTO sbtest1 (id, k, c, pad) VALUES (3, 1, '...', '...');
-- no kv access

COMMIT;
-- {3 Put, 1 EndTxn}, {2 Put, 2 Del} in parallel

Notice both the reduction in sequential KV batches and also the reduction in total KV requests.

We then test the change on the sysbench/oltp_*/nodes=7/cpu=16 benchmark suite. This benchmark suite runs sysbench against a 7 node, 16 vCPU cluster. We first test with 128 workload threads, and then 200 workload threads.

`sysbench/oltp_write_only`

We expect the benefit from this change to be more pronounced on this workload, because all statements in the transaction are writes.

SET CLUSTER SETTING kv.transaction.buffered_writes.enabled = true @ 23:01:40

When we enable buffered writes with 128 workload threads, throughput jumps from averaging 39,929 qps to 58,053 qps, a 45.4% improvement. CPU remains saturated, so the improvement comes from more efficient transaction processing and a corresponding reduction in average transaction latency.

When we test again with 200 workload threads, throughput jumps from 41,279 qps to 60,750 qps, a 47.2% improvement.

`sysbench/oltp_read_write`

We expect the benefit from this change to be less pronounced on this workload, because 14 of the 18 statements in the transaction are reads and only 4 are writes. Still, we do see an improvement.

SET CLUSTER SETTING kv.transaction.buffered_writes.enabled = true @ 23:01:30

When we enable buffered writes with 128 workload threads, throughput jumps from averaging 60,206 qps to 67,004 qps, an 11.3% improvement. CPU remains saturated, so the improvement comes from more efficient transaction processing and a corresponding reduction in average transaction latency.

When we test again with 200 workload threads, throughput jumps from, throughput jumps from 61,525 qps to 70,479 qps, a 14.6% improvement.

NOTE: all qps numbers here are reported from sysbench, which counts BEGIN and COMMIT as "queries".

blathers-crl · 2024-04-22T23:11:16Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2024-04-22T23:11:25Z

This change is

andy-kimball · 2024-04-23T00:23:24Z

Very nice improvements. How much work would it take to production-ize this, do you think?

nvanbenschoten · 2024-04-23T17:25:19Z

There are some local problems to solve here if we wanted to productionize this, like read-your-writes with DistSQL, interactions with savepoints, and adherence to memory limits. All of that is work, but it is straightforward enough.

The bigger challenge (and the reason we have not done this yet) is that it has implications on pessimistic locking. Intents currently serve as a form of exclusive lock, along with a store of the provisional value. The assumption we're making with this change is that all pessimistic locking for mutations will be lifted to the initial row scans for these statements (the {1 Get} batches). These requests need to reliably acquire exclusive locks. We currently have some support for this, but it's best-effort in a few ways.

First of all, as discussed in #72614 (comment), we support unreplicated key-level locks and we support replicated+pipelined key-level locks. We probably want something in better (cheap but reliable enough) — an unreplicated key-level lock that gets promoted to a replicated lock if it is at risk of being locks (see #75456). We would need to acquire these during the Get requests for each mutation and verify their continued existence at some later point.

Secondly, this is also best-effort because only some forms of mutation statements even support "implicit select for update". Currently, some UPDATE statements do, but DELETE statements do not (#50181). These mutations would need to be rethought from a query planning perspective if the resulting KV writes do not reliable serve as a pessimistic lock.

Preserving the current pessimistic locking behavior feels to me like the largest hurdle for productionization of this project.

tbg · 2024-10-14T23:12:14Z

Heads up: I just rebased the branch in anticipation of using it for some benchmarking.

df32ba2...a97820f nvanbenschoten/txnWriteBuffer -> nvanbenschoten/txnWriteBuffer (forced update)

nvanbenschoten force-pushed the nvanbenschoten/txnWriteBuffer branch from f347ce8 to df32ba2 Compare August 9, 2024 18:32

tbg force-pushed the nvanbenschoten/txnWriteBuffer branch from df32ba2 to a97820f Compare October 14, 2024 23:11

nvanbenschoten force-pushed the nvanbenschoten/txnWriteBuffer branch 3 times, most recently from 8115ba3 to b87b043 Compare December 14, 2024 00:53

nvanbenschoten added 2 commits December 19, 2024 12:51

sql: more blind writes

49298d9

kv: gen buffered writes btree

3288d34

tbg mentioned this pull request Dec 19, 2024

kv: abort span access is expensive (2% cpu on oltp_read_write) #122719

Open

kv: prototype buffered writes

169a4b5

tbg force-pushed the nvanbenschoten/txnWriteBuffer branch from b87b043 to 169a4b5 Compare December 19, 2024 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: prototype buffered writes #122862

kv: prototype buffered writes #122862

nvanbenschoten commented Apr 22, 2024 •

edited

Loading

blathers-crl bot commented Apr 22, 2024

cockroach-teamcity commented Apr 22, 2024

andy-kimball commented Apr 23, 2024

nvanbenschoten commented Apr 23, 2024

tbg commented Oct 14, 2024

kv: prototype buffered writes #122862

Are you sure you want to change the base?

kv: prototype buffered writes #122862

Conversation

nvanbenschoten commented Apr 22, 2024 • edited Loading

Benchmark results

sysbench/oltp_write_only

sysbench/oltp_read_write

blathers-crl bot commented Apr 22, 2024

cockroach-teamcity commented Apr 22, 2024

andy-kimball commented Apr 23, 2024

nvanbenschoten commented Apr 23, 2024

tbg commented Oct 14, 2024

nvanbenschoten commented Apr 22, 2024 •

edited

Loading

`sysbench/oltp_write_only`

`sysbench/oltp_read_write`