Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: prototype buffered writes #122862

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Apr 22, 2024

Informs #72614.

This PR includes a prototype of gateway-buffered transaction writes. This is achieved through the introduction of a new txnInterceptor, the txnWriteBuffer, which buffers writes for a transaction before sending them during commit.

Buffering writes client-side until commit has four main benefits:

  1. It allows for more batching of writes, which can be more efficient. Instead of sending writes one at a time, we can batch them up and send them all at once. This is a win even if writes would otherwise be pipelined through consensus.

  2. It allows for the elimination of redundant writes. If a client writes to the same key multiple times in a transaction, only the last write needs to be written to the key-value layer.

  3. It allows the client to serve read-your-writes locally, which can be much faster and cheaper than sending them to the leaseholder. This is especially true when the leaseholder is not collocated with the client. By serving read-your-writes locally from the gateway, write buffering also avoids the problem of pipeline stalls that can occur when a client reads a pipelined intent write before the write has completed consensus. For details on pipeline stalls, see txnPipeliner.

  4. It allows clients to passively hit the 1-phase commit fast-path, instead of requiring clients to carefully construct "auto-commit" BatchRequests to make us of the optimization. By buffering writes on the client before commit, we avoid immediately disabling the fast-path when the client issues their first write. Then, at commit time, we flush the buffer and will happen to hit the 1-phase commit fast path if all writes go to the same range.

The first three of these benefits applies to sysbench. The fourth does not.

Benchmark results

The immediate motivation for this prototype is to determine the benefit buffered writes could have on sysbench. Specifically, we would like to evaluate the throughput improvement (improvement to system efficiency) that comes from this technique when applied to sysbench's oltp_write_only and oltp_read_write workloads.

Both of these workloads contain the following sequence of statements in a transaction. oltp_read_write also includes 14 read-only SELECT statements beforehand.

UPDATE sbtest1 SET k=k+1 WHERE id=1;
UPDATE sbtest1 SET c='...' WHERE id=2;
DELETE FROM sbtest1 WHERE id=3;
INSERT INTO sbtest1 (id, k, c, pad) VALUES (3, 1, '...', '...');
COMMIT;

Before this change, these statements mapped to the following KV batches:

UPDATE sbtest1 SET k=k+1 WHERE id=1;
-- {1 Get}
-- {1 Put}, {1 CPut, 1 Del} in parallel

UPDATE sbtest1 SET c='...' WHERE id=2;
-- {1 Get}
-- {1 Put}

DELETE FROM sbtest1 WHERE id=3;
-- {1 Get}
-- {1 Del}, {1 Del} in parallel

INSERT INTO sbtest1 (id, k, c, pad) VALUES (3, 1, '...', '...');
-- {1 QueryIntent, 1 CPut}, {1 QueryIntent, 1 InitPut} in parallel

COMMIT;
-- {1 EndTxn}, {3 QueryIntent}, {3 QueryIntent} in parallel

The first thing this change does is switch some of these read-write operations on secondary indexes over to blind writes. The read portion of these requests was unnecessary and making them write-only helps with buffering.

UPDATE sbtest1 SET k=k+1 WHERE id=1;
-- {1 Get}
-- {1 Put}, {1 Put, 1 Del} in parallel

UPDATE sbtest1 SET c='...' WHERE id=2;
-- {1 Get}
-- {1 Put}

DELETE FROM sbtest1 WHERE id=3;
-- {1 Get}
-- {1 Del}, {1 Del} in parallel

INSERT INTO sbtest1 (id, k, c, pad) VALUES (3, 1, '...', '...');
-- {1 QueryIntent, 1 CPut}, {1 QueryIntent, 1 Put} in parallel

COMMIT;
-- {1 EndTxn}, {3 QueryIntent}, {3 QueryIntent} in parallel

Next, the change adds a buffer in the TxnCoordSender to defer blind writes. Care it taken to allow read-only and read-write requests to use this buffer to avoid remote access, where possible. This has the following effect:

UPDATE sbtest1 SET k=k+1 WHERE id=1;
-- {1 Get}

UPDATE sbtest1 SET c='...' WHERE id=2;
-- {1 Get}

DELETE FROM sbtest1 WHERE id=3;
-- {1 Get}

INSERT INTO sbtest1 (id, k, c, pad) VALUES (3, 1, '...', '...');
-- no kv access

COMMIT;
-- {3 Put, 1 EndTxn}, {2 Put, 2 Del} in parallel

Notice both the reduction in sequential KV batches and also the reduction in total KV requests.

We then test the change on the sysbench/oltp_*/nodes=7/cpu=16 benchmark suite. This benchmark suite runs sysbench against a 7 node, 16 vCPU cluster. We first test with 128 workload threads, and then 200 workload threads.

sysbench/oltp_write_only

We expect the benefit from this change to be more pronounced on this workload, because all statements in the transaction are writes.

Screenshot 2024-04-22 at 7 06 19 PM

SET CLUSTER SETTING kv.transaction.buffered_writes.enabled = true @ 23:01:40

When we enable buffered writes with 128 workload threads, throughput jumps from averaging 39,929 qps to 58,053 qps, a 45.4% improvement. CPU remains saturated, so the improvement comes from more efficient transaction processing and a corresponding reduction in average transaction latency.

When we test again with 200 workload threads, throughput jumps from 41,279 qps to 60,750 qps, a 47.2% improvement.

sysbench/oltp_read_write

We expect the benefit from this change to be less pronounced on this workload, because 14 of the 18 statements in the transaction are reads and only 4 are writes. Still, we do see an improvement.

Screenshot 2024-04-22 at 7 06 54 PM

SET CLUSTER SETTING kv.transaction.buffered_writes.enabled = true @ 23:01:30

When we enable buffered writes with 128 workload threads, throughput jumps from averaging 60,206 qps to 67,004 qps, an 11.3% improvement. CPU remains saturated, so the improvement comes from more efficient transaction processing and a corresponding reduction in average transaction latency.

When we test again with 200 workload threads, throughput jumps from, throughput jumps from 61,525 qps to 70,479 qps, a 14.6% improvement.

NOTE: all qps numbers here are reported from sysbench, which counts BEGIN and COMMIT as "queries".

Copy link

blathers-crl bot commented Apr 22, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andy-kimball
Copy link
Contributor

Very nice improvements. How much work would it take to production-ize this, do you think?

@nvanbenschoten
Copy link
Member Author

There are some local problems to solve here if we wanted to productionize this, like read-your-writes with DistSQL, interactions with savepoints, and adherence to memory limits. All of that is work, but it is straightforward enough.

The bigger challenge (and the reason we have not done this yet) is that it has implications on pessimistic locking. Intents currently serve as a form of exclusive lock, along with a store of the provisional value. The assumption we're making with this change is that all pessimistic locking for mutations will be lifted to the initial row scans for these statements (the {1 Get} batches). These requests need to reliably acquire exclusive locks. We currently have some support for this, but it's best-effort in a few ways.

First of all, as discussed in #72614 (comment), we support unreplicated key-level locks and we support replicated+pipelined key-level locks. We probably want something in better (cheap but reliable enough) — an unreplicated key-level lock that gets promoted to a replicated lock if it is at risk of being locks (see #75456). We would need to acquire these during the Get requests for each mutation and verify their continued existence at some later point.

Secondly, this is also best-effort because only some forms of mutation statements even support "implicit select for update". Currently, some UPDATE statements do, but DELETE statements do not (#50181). These mutations would need to be rethought from a query planning perspective if the resulting KV writes do not reliable serve as a pessimistic lock.

Preserving the current pessimistic locking behavior feels to me like the largest hurdle for productionization of this project.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/txnWriteBuffer branch from f347ce8 to df32ba2 Compare August 9, 2024 18:32
@tbg tbg force-pushed the nvanbenschoten/txnWriteBuffer branch from df32ba2 to a97820f Compare October 14, 2024 23:11
@tbg
Copy link
Member

tbg commented Oct 14, 2024

Heads up: I just rebased the branch in anticipation of using it for some benchmarking.

  • df32ba2...a97820f nvanbenschoten/txnWriteBuffer -> nvanbenschoten/txnWriteBuffer (forced update)

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/txnWriteBuffer branch 3 times, most recently from 8115ba3 to b87b043 Compare December 14, 2024 00:53
@tbg tbg force-pushed the nvanbenschoten/txnWriteBuffer branch from b87b043 to 169a4b5 Compare December 19, 2024 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants