-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvflowcontrol: set kvadmission.flow_control.mode default to apply_to_all #133860
kvflowcontrol: set kvadmission.flow_control.mode default to apply_to_all #133860
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
0717143
to
a835348
Compare
Hmm, I'm not sure why |
5d0c3dd
to
adb6161
Compare
Not unexpectedly, some tests are failing. I suspect these tests exhaust tokens and delay admission. We should confirm whats happening. |
adb6161
to
528ce9f
Compare
I can't repro the |
The
|
528ce9f
to
a5fbaa8
Compare
Closed out the PR which enabled just the metamorphism. If anything fails, it will be equally as disruptive, metamorphic or not. So lets just merge both at once, then rollback as necessary. I'll also make an announcement in (internal) slack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r2, 2 of 2 files at r4, 2 of 2 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli)
docs/generated/metrics/metrics.html
line 22 at r5 (raw file):
<tr><td>STORAGE</td><td>admission.admitted.kv-stores.locking-normal-pri</td><td>Number of requests admitted</td><td>Requests</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr> <tr><td>STORAGE</td><td>admission.admitted.kv-stores.normal-pri</td><td>Number of requests admitted</td><td>Requests</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr> <tr><td>STORAGE</td><td>admission.admitted.kv-stores.user-high-pri</td><td>Number of requests admitted</td><td>Requests</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
why these changes to metrics.html?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TYFTR!!
bors r=sumeerbhola
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)
docs/generated/metrics/metrics.html
line 22 at r5 (raw file):
Previously, sumeerbhola wrote…
why these changes to metrics.html?
See earlier comment, there is a liveness/reachability analysis conducted via running some tests, in order to decide which metrics to generate.
Merge conflict. |
Going to rebase and try again. |
The cluster setting `kvadmission.flow_control.mode` is used to control whether replication admission control applies to only elastic work, or applies to all work which sets an admission priority. The cluster setting was prior to cockroachdb#132125, metamorphic in tests, selecting either value with equal probability. cockroachdb#132125 disabled the metamorphism in order to prevent the newly introduced send queue / pull mode from being run unexpectedly, while still under testing. With testing now in place, re-enable the metamorphism, leaving the default value unchanged. Resolves: cockroachdb#132364 Release note: None
a5fbaa8
to
58b80bc
Compare
Rebased to resolve merge conflicts. Given we were green before, I'll bors this again as soon as the required checks are passed. |
With testing now in place, this commit changes the default cluster setting value of `kvadmission.flow_control.mode` from `apply_to_elastic`, to `apply_to_all`. Now by default, regular work will be subject to replication admission control, whereby a quorum will be allowed to proceed, queuing entries to any non-quorum required replica, when send tokens are exhausted. For more details, see cockroachdb#123509. Resolves: cockroachdb#133838 Release note (performance improvement): Regular writes are now subject to admission control by default, meaning that non-quorum required replicas may not be told about new writes from the leader if they are unable to keep up. This brings a large performance improvement during instances where there is a large backlog of replication work towards a subset of node(s), such as node restarts. The setting can be reverted to the <=24.3 default by setting `kvadmission.flow_control.mode` to `apply_to_elastic`.
58b80bc
to
69d5559
Compare
Pushed a small diff to resolve a flaky, test-only issue on diff --git a/pkg/kv/kvserver/client_raft_log_queue_test.go b/pkg/kv/kvserver/client_raft_log_queue_test.go
index abaadb9d5bb..10a70e2d288 100644
--- a/pkg/kv/kvserver/client_raft_log_queue_test.go
+++ b/pkg/kv/kvserver/client_raft_log_queue_test.go
@@ -201,8 +201,10 @@ func TestRaftTracing(t *testing.T) {
expectedMessages := []string{
`replica_proposal_buf.* flushing proposal to Raft`,
`replica_proposal_buf.* registering local trace`,
+ // Note that we don't assert that the 1->3 MsgApp goes through, as
+ // the ordering may change between 1->2 and 1->3. It should be
+ // sufficient to just check one of them for tracing.
`replica_raft.* 1->2 MsgApp`,
- `replica_raft.* 1->3 MsgApp`,
`replica_raft.* AppendThread->1 MsgStorageAppendResp`,
`ack-ing replication success to the client`,
} Passes 1k/1k now, where previously it would fail quite quickly: $ dev test pkg/kv/kvserver -v -f TestRaftTracing/lease-type=LeaseEpoch --stress
...
INFO: Found 1 test target...
Target //pkg/kv/kvserver:kvserver_test up-to-date:
_bazel/bin/pkg/kv/kvserver/kvserver_test_/kvserver_test
INFO: Elapsed time: 308.212s, Critical Path: 43.41s
INFO: 1006 processes: 1 internal, 1005 darwin-sandbox.
INFO: Build completed successfully, 1006 total actions
//pkg/kv/kvserver:kvserver_test PASSED in 4.3s
Stats over 1000 runs: max = 4.3s, min = 1.1s, avg = 1.3s, dev = 0.2s |
There have been recent flakes on tests which use a `TestCluster`. These failures have occurred at startup, before much test specific logic. This patch mitigates the symptom by retrying on an error, rather than erroring out. The error looks like: ``` testcluster.go:485: unable to retrieve conf reader: span configs not available ``` It is caused by span configs not being available, which is checked via grabbing the `ConfReader` before forcing the replica queues to process. The root cause of this behavior change should still be determined and this should be considered temporary, although the root cause may be benign. informs: cockroachdb#137712 informs: cockroachdb#137762 Release note: None
Stacked a commit to resolve the flaky behavior on testcluster startup. Left a todo to investigate further and linked the known related failures, which were all pre-existing. |
bors r=sumeerbhola |
The cluster setting kvadmission.flow_control.mode is used to control whether replication admission control applies to only elastic work, or applies to all work which sets an admission priority.
The cluster setting was prior to #132125, metamorphic in tests, selecting either value with equal probability. #132125 disabled the metamorphism in order to prevent the newly introduced send queue / pull mode from being run unexpectedly, while still under testing.
With testing now in place, re-enable the metamorphism, leaving the default value unchanged.
Resolves: #132364
With testing now in place, this commit changes the default cluster
setting value of
kvadmission.flow_control.mode
fromapply_to_elastic
, toapply_to_all
.Now by default, regular work will be subject to replication admission
control, whereby a quorum will be allowed to proceed, queuing entries
to any non-quorum required replica, when send tokens are exhausted.
For more details, see #123509.
Resolves: #133838
Release note (performance improvement): Regular writes are now subject
to admission control by default, meaning that non-quorum required
replicas may not be told about new writes from the leader if they are
unable to keep up. This brings a large performance improvement during
instances where there is a large backlog of replication work towards a
subset of node(s), such as node restarts. The setting can be reverted to
the <=24.3 default by setting
kvadmission.flow_control.mode
toapply_to_elastic
.