-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: prototype buffered writes #122862
base: master
Are you sure you want to change the base?
kv: prototype buffered writes #122862
Conversation
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Very nice improvements. How much work would it take to production-ize this, do you think? |
There are some local problems to solve here if we wanted to productionize this, like read-your-writes with DistSQL, interactions with savepoints, and adherence to memory limits. All of that is work, but it is straightforward enough. The bigger challenge (and the reason we have not done this yet) is that it has implications on pessimistic locking. Intents currently serve as a form of exclusive lock, along with a store of the provisional value. The assumption we're making with this change is that all pessimistic locking for mutations will be lifted to the initial row scans for these statements (the First of all, as discussed in #72614 (comment), we support unreplicated key-level locks and we support replicated+pipelined key-level locks. We probably want something in better (cheap but reliable enough) — an unreplicated key-level lock that gets promoted to a replicated lock if it is at risk of being locks (see #75456). We would need to acquire these during the Get requests for each mutation and verify their continued existence at some later point. Secondly, this is also best-effort because only some forms of mutation statements even support "implicit select for update". Currently, some UPDATE statements do, but DELETE statements do not (#50181). These mutations would need to be rethought from a query planning perspective if the resulting KV writes do not reliable serve as a pessimistic lock. Preserving the current pessimistic locking behavior feels to me like the largest hurdle for productionization of this project. |
f347ce8
to
df32ba2
Compare
df32ba2
to
a97820f
Compare
Heads up: I just rebased the branch in anticipation of using it for some benchmarking.
|
8115ba3
to
b87b043
Compare
b87b043
to
169a4b5
Compare
Informs #72614.
This PR includes a prototype of gateway-buffered transaction writes. This is achieved through the introduction of a new
txnInterceptor
, thetxnWriteBuffer
, which buffers writes for a transaction before sending them during commit.Buffering writes client-side until commit has four main benefits:
It allows for more batching of writes, which can be more efficient. Instead of sending writes one at a time, we can batch them up and send them all at once. This is a win even if writes would otherwise be pipelined through consensus.
It allows for the elimination of redundant writes. If a client writes to the same key multiple times in a transaction, only the last write needs to be written to the key-value layer.
It allows the client to serve read-your-writes locally, which can be much faster and cheaper than sending them to the leaseholder. This is especially true when the leaseholder is not collocated with the client. By serving read-your-writes locally from the gateway, write buffering also avoids the problem of pipeline stalls that can occur when a client reads a pipelined intent write before the write has completed consensus. For details on pipeline stalls, see txnPipeliner.
It allows clients to passively hit the 1-phase commit fast-path, instead of requiring clients to carefully construct "auto-commit" BatchRequests to make us of the optimization. By buffering writes on the client before commit, we avoid immediately disabling the fast-path when the client issues their first write. Then, at commit time, we flush the buffer and will happen to hit the 1-phase commit fast path if all writes go to the same range.
The first three of these benefits applies to
sysbench
. The fourth does not.Benchmark results
The immediate motivation for this prototype is to determine the benefit buffered writes could have on
sysbench
. Specifically, we would like to evaluate the throughput improvement (improvement to system efficiency) that comes from this technique when applied tosysbench
'soltp_write_only
andoltp_read_write
workloads.Both of these workloads contain the following sequence of statements in a transaction.
oltp_read_write
also includes 14 read-only SELECT statements beforehand.Before this change, these statements mapped to the following KV batches:
The first thing this change does is switch some of these read-write operations on secondary indexes over to blind writes. The read portion of these requests was unnecessary and making them write-only helps with buffering.
Next, the change adds a buffer in the
TxnCoordSender
to defer blind writes. Care it taken to allow read-only and read-write requests to use this buffer to avoid remote access, where possible. This has the following effect:Notice both the reduction in sequential KV batches and also the reduction in total KV requests.
We then test the change on the
sysbench/oltp_*/nodes=7/cpu=16
benchmark suite. This benchmark suite runs sysbench against a 7 node, 16 vCPU cluster. We first test with 128 workload threads, and then 200 workload threads.sysbench/oltp_write_only
We expect the benefit from this change to be more pronounced on this workload, because all statements in the transaction are writes.
SET CLUSTER SETTING kv.transaction.buffered_writes.enabled = true
@ 23:01:40When we enable buffered writes with 128 workload threads, throughput jumps from averaging 39,929 qps to 58,053 qps, a 45.4% improvement. CPU remains saturated, so the improvement comes from more efficient transaction processing and a corresponding reduction in average transaction latency.
When we test again with 200 workload threads, throughput jumps from 41,279 qps to 60,750 qps, a 47.2% improvement.
sysbench/oltp_read_write
We expect the benefit from this change to be less pronounced on this workload, because 14 of the 18 statements in the transaction are reads and only 4 are writes. Still, we do see an improvement.
SET CLUSTER SETTING kv.transaction.buffered_writes.enabled = true
@ 23:01:30When we enable buffered writes with 128 workload threads, throughput jumps from averaging 60,206 qps to 67,004 qps, an 11.3% improvement. CPU remains saturated, so the improvement comes from more efficient transaction processing and a corresponding reduction in average transaction latency.
When we test again with 200 workload threads, throughput jumps from, throughput jumps from 61,525 qps to 70,479 qps, a 14.6% improvement.
NOTE: all qps numbers here are reported from
sysbench
, which counts BEGIN and COMMIT as "queries".