From 917c511a2b7f225763e0cb11d7bb59ba0e75dc87 Mon Sep 17 00:00:00 2001 From: Rich Loveland Date: Tue, 5 Nov 2024 16:04:59 -0500 Subject: [PATCH 1/8] Recommend DistSender concurrency limit bump Fixes DOC-11652 Summary of changes: - Update 'Performance Recipes' page to note that if you encounter DistSender batch throttling (`distsender.batches.async.throttled` is > 0), consider increasing the value of the `kv.dist_sender.concurrency_limit` cluster setting. --- src/current/v24.2/performance-recipes.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/src/current/v24.2/performance-recipes.md b/src/current/v24.2/performance-recipes.md index ac38eb294e7..29b7ccc35d9 100644 --- a/src/current/v24.2/performance-recipes.md +++ b/src/current/v24.2/performance-recipes.md @@ -30,6 +30,7 @@ This section describes how to use CockroachDB commands and dashboards to identif
  • Querying the crdb_internal.transaction_contention_events table indicates that your transactions have experienced contention.
  • The SQL Statement Contention graph in the [CockroachDB {{ site.data.products.cloud }} Console]({% link cockroachcloud/metrics-sql.md %}#sql-statement-contention) or DB Console is showing spikes over time.
  • The Transaction Restarts graph in the [CockroachDB {{ site.data.products.cloud }} Console]({% link cockroachcloud/metrics-sql.md %}#transaction-restarts) or DB Console is showing spikes in retries over time.
  • + @@ -76,6 +77,11 @@ This section describes how to use CockroachDB commands and dashboards to identif + + + + + ## Solutions @@ -297,6 +303,14 @@ A low percentage of live data can cause statements to scan more data ([MVCC valu Reduce the [`gc.ttlseconds`]({% link {{ page.version.version }}/configure-replication-zones.md %}#gc-ttlseconds) zone configuration of the table as much as possible. +### KV DistSender batches being throttled (performance impact to larger clusters) + +If you see values greater than `0` for the `distsender.batches.async.throttled` metric, consider increasing the [KV layer DistSender]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#distsender) concurrency using the `kv.dist_sender.concurrency_limit` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}). + +[XXX](): How much to increase? 6x as in https://github.com/cockroachdb/cockroach/pull/131226 ? What should we say about testing? What is a bad outcome of doing this too much? + +[XXX](): FILE COCKROACH ISSUE TO MAKE THIS CLUSTER SETTING PUBLIC (currently private), if we're gonna document it it should be public + ## See also If you aren't sure whether SQL query performance needs to be improved, see [Identify slow queries]({% link {{ page.version.version }}/query-behavior-troubleshooting.md %}#identify-slow-queries). From 952c717b0848deaa7b3422ac7922021b4401638d Mon Sep 17 00:00:00 2001 From: Rich Loveland Date: Mon, 18 Nov 2024 15:00:36 -0500 Subject: [PATCH 2/8] Update with sean- feedback (1) --- src/current/v24.2/cockroach-start.md | 2 +- src/current/v24.2/performance-recipes.md | 6 ++---- 2 files changed, 3 insertions(+), 5 deletions(-) diff --git a/src/current/v24.2/cockroach-start.md b/src/current/v24.2/cockroach-start.md index ea0d9a11c19..c3a16ee6abf 100644 --- a/src/current/v24.2/cockroach-start.md +++ b/src/current/v24.2/cockroach-start.md @@ -64,7 +64,7 @@ Flag | Description `--listening-url-file` | The file to which the node's SQL connection URL will be written as soon as the node is ready to accept connections, in addition to being printed to the [standard output](#standard-output). When `--background` is used, this happens before the process detaches from the terminal.

    This is particularly helpful in identifying the node's port when an unused port is assigned automatically (`--port=0`). `--locality` | Arbitrary key-value pairs that describe the location of the node. Locality might include country, region, availability zone, etc. A `region` tier must be included in order to enable [multi-region capabilities]({% link {{ page.version.version }}/multiregion-overview.md %}). For more details, see [Locality](#locality) below. `--max-disk-temp-storage` | The maximum on-disk storage capacity available to store temporary data for SQL queries that exceed the memory budget (see `--max-sql-memory`). This ensures that JOINs, sorts, and other memory-intensive SQL operations are able to spill intermediate results to disk. This can be a percentage (notated as a decimal or with `%`) or any bytes-based unit (e.g., `.25`, `25%`, `500GB`, `1TB`, `1TiB`).

    Note: If you use the `%` notation, you might need to escape the `%` sign, for instance, while configuring CockroachDB through `systemd` service files. For this reason, it's recommended to use the decimal notation instead. Also, if expressed as a percentage, this value is interpreted relative to the size of the first store. However, the temporary space usage is never counted towards any store usage; therefore, when setting this value, it's important to ensure that the size of this temporary storage plus the size of the first store doesn't exceed the capacity of the storage device.

    The temporary files are located in the path specified by the `--temp-dir` flag, or in the subdirectory of the first store (see `--store`) by default.

    **Default:** `32GiB` -`--max-go-memory` | The maximum soft memory limit for the Go runtime, which influences the behavior of Go's garbage collection. Defaults to `--max-sql-memory x 2.25`, but cannot exceed 90% of the node's available RAM. To disable the soft memory limit, set `--max-go-memory` to `0` (not recommended). +`--max-go-memory` | The maximum soft memory limit for the Go runtime, which influences the behavior of Go's garbage collection. Defaults to `--max-sql-memory x 2.25`, but cannot exceed 90% of the node's available RAM. To disable the soft memory limit, set `--max-go-memory` to `0` (not recommended). `--max-offset` | The maximum allowed clock offset for the cluster. If observed clock offsets exceed this limit, servers will crash to minimize the likelihood of reading inconsistent data. Increasing this value will increase the time to recovery of failures as well as the frequency of uncertainty-based read restarts.

    Nodes can run with different values for `--max-offset`, but only for the purpose of updating the setting across the cluster using a rolling upgrade.

    **Default:** `500ms` `--max-sql-memory` | The maximum in-memory storage capacity available to store temporary data for SQL queries, including prepared queries and intermediate data rows during query execution. This can be a percentage (notated as a decimal or with `%`) or any bytes-based unit; for example:

    `--max-sql-memory=.25`
    `--max-sql-memory=25%`
    `--max-sql-memory=10000000000 ----> 1000000000 bytes`
    `--max-sql-memory=1GB ----> 1000000000 bytes`
    `--max-sql-memory=1GiB ----> 1073741824 bytes`

    The temporary files are located in the path specified by the `--temp-dir` flag, or in the subdirectory of the first store (see `--store`) by default.

    **Note:** If you use the `%` notation, you might need to escape the `%` sign (for instance, while configuring CockroachDB through `systemd` service files). For this reason, it's recommended to use the decimal notation instead.

    **Note:** The sum of `--cache`, `--max-sql-memory`, and `--max-tsdb-memory` should not exceed 75% of the memory available to the `cockroach` process.

    **Default:** `25%`

    The default SQL memory size is suitable for production deployments but can be raised to increase the number of simultaneous client connections the node allows as well as the node's capacity for in-memory processing of rows when using `ORDER BY`, `GROUP BY`, `DISTINCT`, joins, and window functions. For local development clusters with memory-intensive workloads, reduce this value to, for example, `128MiB` to prevent [out-of-memory errors]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#out-of-memory-oom-crash). `--max-tsdb-memory` | Maximum memory capacity available to store temporary data for use by the time-series database to display metrics in the [DB Console]({% link {{ page.version.version }}/ui-overview.md %}). Consider raising this value if your cluster is comprised of a large number of nodes where individual nodes have very limited memory available (e.g., under `8 GiB`). Insufficient memory capacity for the time-series database can constrain the ability of the DB Console to process the time-series queries used to render metrics for the entire cluster. This capacity constraint does not affect SQL query execution. This flag accepts numbers interpreted as bytes, size suffixes (e.g., `1GB` and `1GiB`) or a percentage of physical memory (e.g., `0.01`).

    **Note:** The sum of `--cache`, `--max-sql-memory`, and `--max-tsdb-memory` should not exceed 75% of the memory available to the `cockroach` process.

    **Default:** `0.01` (i.e., 1%) of physical memory or `64 MiB`, whichever is greater. diff --git a/src/current/v24.2/performance-recipes.md b/src/current/v24.2/performance-recipes.md index 29b7ccc35d9..8cb3ae27a37 100644 --- a/src/current/v24.2/performance-recipes.md +++ b/src/current/v24.2/performance-recipes.md @@ -305,11 +305,9 @@ Reduce the [`gc.ttlseconds`]({% link {{ page.version.version }}/configure-replic ### KV DistSender batches being throttled (performance impact to larger clusters) -If you see values greater than `0` for the `distsender.batches.async.throttled` metric, consider increasing the [KV layer DistSender]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#distsender) concurrency using the `kv.dist_sender.concurrency_limit` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}). +If you see `distsender.batches.async.throttled` values that aren't zero (or aren't consistently near zero), experiment with increasing the [KV layer DistSender]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#distsender) and [KV layer Streamer]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#streamer) concurrency using the `kv.streamer.concurrency_limit` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}). In 24.3, these default values were increased by 6x and 12x, respectively. For versions older than 24.3, increasing the value by 6x and 12x would be a good starting point. -[XXX](): How much to increase? 6x as in https://github.com/cockroachdb/cockroach/pull/131226 ? What should we say about testing? What is a bad outcome of doing this too much? - -[XXX](): FILE COCKROACH ISSUE TO MAKE THIS CLUSTER SETTING PUBLIC (currently private), if we're gonna document it it should be public +Note that changing this setting can increase risk of [out of memory (OOM) errors]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#out-of-memory-oom-crash) depending on the value of [`cockroach start --max-go-memory`]({% link {{ page.version.version }}/cockroach-start.md %}#flags-max-go-memory) and/or [`GOMEMLIMIT`](https://pkg.go.dev/runtime#hdr-Environment_Variables). ## See also From c9e8705cd25ab41790cc81e8083a2916d96b05a6 Mon Sep 17 00:00:00 2001 From: Rich Loveland Date: Mon, 18 Nov 2024 15:05:06 -0500 Subject: [PATCH 3/8] Update with sean- feedback (2) --- src/current/v24.2/performance-recipes.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/src/current/v24.2/performance-recipes.md b/src/current/v24.2/performance-recipes.md index 8cb3ae27a37..5bc98a43f96 100644 --- a/src/current/v24.2/performance-recipes.md +++ b/src/current/v24.2/performance-recipes.md @@ -307,7 +307,11 @@ Reduce the [`gc.ttlseconds`]({% link {{ page.version.version }}/configure-replic If you see `distsender.batches.async.throttled` values that aren't zero (or aren't consistently near zero), experiment with increasing the [KV layer DistSender]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#distsender) and [KV layer Streamer]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#streamer) concurrency using the `kv.streamer.concurrency_limit` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}). In 24.3, these default values were increased by 6x and 12x, respectively. For versions older than 24.3, increasing the value by 6x and 12x would be a good starting point. -Note that changing this setting can increase risk of [out of memory (OOM) errors]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#out-of-memory-oom-crash) depending on the value of [`cockroach start --max-go-memory`]({% link {{ page.version.version }}/cockroach-start.md %}#flags-max-go-memory) and/or [`GOMEMLIMIT`](https://pkg.go.dev/runtime#hdr-Environment_Variables). +To validate a successful result, you can increase this value until you see no new throttled requests AND no increase in tail latency (e.g. `p99.999`). + +This does increase the amount of RAM consumption per node to handle the increased concurrency, but it's proportional to the load and an individual flow's memory consumption should not be significant. Bad outcomes include increased tail latency or too much memory consumption with no decrease in the number of throttled requests. + +Changing this setting can increase risk of [out of memory (OOM) errors]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#out-of-memory-oom-crash) depending on the value of [`cockroach start --max-go-memory`]({% link {{ page.version.version }}/cockroach-start.md %}#flags-max-go-memory) and/or [`GOMEMLIMIT`](https://pkg.go.dev/runtime#hdr-Environment_Variables). ## See also From d9b34a6c6604cc2df5f312ffe6bdc0cf8c05714f Mon Sep 17 00:00:00 2001 From: Rich Loveland Date: Mon, 18 Nov 2024 17:09:43 -0500 Subject: [PATCH 4/8] Update with sean- feedback (3) --- src/current/v24.2/performance-recipes.md | 15 --------------- src/current/v24.3/performance-recipes.md | 13 +++++++++++++ 2 files changed, 13 insertions(+), 15 deletions(-) diff --git a/src/current/v24.2/performance-recipes.md b/src/current/v24.2/performance-recipes.md index 5bc98a43f96..e38a3522c2e 100644 --- a/src/current/v24.2/performance-recipes.md +++ b/src/current/v24.2/performance-recipes.md @@ -77,11 +77,6 @@ This section describes how to use CockroachDB commands and dashboards to identif
    • You may be scanning over large numbers of MVCC versions. This is similar to how a full table scan can be slow.
    - -
    • vCPU usage has plateaued (possibly around 70%) on your large cluster (XXX: DEFINE LARGE).
    -
    • KV layer DistSender batches may be getting throttled; check if the distsender.batches.async.throttled metric is greater than 0.
    -
    • Increase the kv.dist_sender.concurrency_limit cluster setting. (XXX: HOW MUCH? 6x as in https://github.com/cockroachdb/cockroach/pull/131226 ?)
    - ## Solutions @@ -303,16 +298,6 @@ A low percentage of live data can cause statements to scan more data ([MVCC valu Reduce the [`gc.ttlseconds`]({% link {{ page.version.version }}/configure-replication-zones.md %}#gc-ttlseconds) zone configuration of the table as much as possible. -### KV DistSender batches being throttled (performance impact to larger clusters) - -If you see `distsender.batches.async.throttled` values that aren't zero (or aren't consistently near zero), experiment with increasing the [KV layer DistSender]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#distsender) and [KV layer Streamer]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#streamer) concurrency using the `kv.streamer.concurrency_limit` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}). In 24.3, these default values were increased by 6x and 12x, respectively. For versions older than 24.3, increasing the value by 6x and 12x would be a good starting point. - -To validate a successful result, you can increase this value until you see no new throttled requests AND no increase in tail latency (e.g. `p99.999`). - -This does increase the amount of RAM consumption per node to handle the increased concurrency, but it's proportional to the load and an individual flow's memory consumption should not be significant. Bad outcomes include increased tail latency or too much memory consumption with no decrease in the number of throttled requests. - -Changing this setting can increase risk of [out of memory (OOM) errors]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#out-of-memory-oom-crash) depending on the value of [`cockroach start --max-go-memory`]({% link {{ page.version.version }}/cockroach-start.md %}#flags-max-go-memory) and/or [`GOMEMLIMIT`](https://pkg.go.dev/runtime#hdr-Environment_Variables). - ## See also If you aren't sure whether SQL query performance needs to be improved, see [Identify slow queries]({% link {{ page.version.version }}/query-behavior-troubleshooting.md %}#identify-slow-queries). diff --git a/src/current/v24.3/performance-recipes.md b/src/current/v24.3/performance-recipes.md index 1cd1320c31b..09fafcdce59 100644 --- a/src/current/v24.3/performance-recipes.md +++ b/src/current/v24.3/performance-recipes.md @@ -76,6 +76,11 @@ This section describes how to use CockroachDB commands and dashboards to identif
    • You may be scanning over large numbers of MVCC versions. This is similar to how a full table scan can be slow.
    + +
    • vCPU usage has plateaued (possibly around 70%) on your large cluster.
    +
    • KV layer DistSender batches may be getting throttled; check if the distsender.batches.async.throttled metric is greater than 0.
    + + ## Solutions @@ -297,6 +302,14 @@ A low percentage of live data can cause statements to scan more data ([MVCC valu Reduce the [`gc.ttlseconds`]({% link {{ page.version.version }}/configure-replication-zones.md %}#gc-ttlseconds) zone configuration of the table as much as possible. +### KV DistSender batches being throttled (performance impact to larger clusters) + +If you see `distsender.batches.async.throttled` values that aren't zero (or aren't consistently near zero), experiment with increasing the [KV layer `DistSender`]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#distsender) and `Streamer` concurrency using the `kv.streamer.concurrency_limit` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}). In v24.3, these default values were increased by 6x and 12x, respectively. For versions prior to v24.3, increasing the value by 6x and 12x would be a good starting point. + +To validate a successful result, you can increase this value until you see no new throttled requests and no increase in tail latency (e.g. `p99.999`). + +This does increase the amount of RAM consumption per node to handle the increased concurrency, but it's proportional to the load and an individual flow's memory consumption should not be significant. Bad outcomes include increased tail latency or too much memory consumption with no decrease in the number of throttled requests. + ## See also If you aren't sure whether SQL query performance needs to be improved, see [Identify slow queries]({% link {{ page.version.version }}/query-behavior-troubleshooting.md %}#identify-slow-queries). From 55061c18eba4bd54dc2589e67ae6357c5a22a37b Mon Sep 17 00:00:00 2001 From: Rich Loveland Date: Wed, 20 Nov 2024 14:42:24 -0500 Subject: [PATCH 5/8] Update src/current/v24.3/performance-recipes.md Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com> --- src/current/v24.3/performance-recipes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/current/v24.3/performance-recipes.md b/src/current/v24.3/performance-recipes.md index 09fafcdce59..b6d174c24c2 100644 --- a/src/current/v24.3/performance-recipes.md +++ b/src/current/v24.3/performance-recipes.md @@ -304,7 +304,7 @@ Reduce the [`gc.ttlseconds`]({% link {{ page.version.version }}/configure-replic ### KV DistSender batches being throttled (performance impact to larger clusters) -If you see `distsender.batches.async.throttled` values that aren't zero (or aren't consistently near zero), experiment with increasing the [KV layer `DistSender`]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#distsender) and `Streamer` concurrency using the `kv.streamer.concurrency_limit` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}). In v24.3, these default values were increased by 6x and 12x, respectively. For versions prior to v24.3, increasing the value by 6x and 12x would be a good starting point. +If you see `distsender.batches.async.throttled` values that aren't zero (or aren't consistently near zero), experiment with increasing the [KV layer `DistSender`]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#distsender) and `Streamer` concurrency using the `kv.streamer.concurrency_limit` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}). In v24.3, these default values were increased by 6x and 12x, respectively. For versions prior to v24.3, increasing the value by 6x and 12x would be a good starting point. To validate a successful result, you can increase this value until you see no new throttled requests and no increase in tail latency (e.g. `p99.999`). From 661734eb6c70f75233002a686f7778aace99e29f Mon Sep 17 00:00:00 2001 From: Rich Loveland Date: Wed, 20 Nov 2024 14:42:57 -0500 Subject: [PATCH 6/8] Update src/current/v24.3/performance-recipes.md Co-authored-by: Ryan Kuo <8740013+taroface@users.noreply.github.com> --- src/current/v24.3/performance-recipes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/current/v24.3/performance-recipes.md b/src/current/v24.3/performance-recipes.md index b6d174c24c2..c23fe1d9c58 100644 --- a/src/current/v24.3/performance-recipes.md +++ b/src/current/v24.3/performance-recipes.md @@ -306,7 +306,7 @@ Reduce the [`gc.ttlseconds`]({% link {{ page.version.version }}/configure-replic If you see `distsender.batches.async.throttled` values that aren't zero (or aren't consistently near zero), experiment with increasing the [KV layer `DistSender`]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#distsender) and `Streamer` concurrency using the `kv.streamer.concurrency_limit` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}). In v24.3, these default values were increased by 6x and 12x, respectively. For versions prior to v24.3, increasing the value by 6x and 12x would be a good starting point. -To validate a successful result, you can increase this value until you see no new throttled requests and no increase in tail latency (e.g. `p99.999`). +To validate a successful result, you can increase this value until you see no new throttled requests and no increase in tail latency (e.g., `p99.999`). This does increase the amount of RAM consumption per node to handle the increased concurrency, but it's proportional to the load and an individual flow's memory consumption should not be significant. Bad outcomes include increased tail latency or too much memory consumption with no decrease in the number of throttled requests. From 82cc2a0a3e6293f0498bc226da6d5e58fe7dc8f0 Mon Sep 17 00:00:00 2001 From: Rich Loveland Date: Wed, 20 Nov 2024 14:47:56 -0500 Subject: [PATCH 7/8] Update with taroface feedback (1) --- src/current/v24.3/performance-recipes.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/current/v24.3/performance-recipes.md b/src/current/v24.3/performance-recipes.md index 09fafcdce59..ad23bdce890 100644 --- a/src/current/v24.3/performance-recipes.md +++ b/src/current/v24.3/performance-recipes.md @@ -304,11 +304,11 @@ Reduce the [`gc.ttlseconds`]({% link {{ page.version.version }}/configure-replic ### KV DistSender batches being throttled (performance impact to larger clusters) -If you see `distsender.batches.async.throttled` values that aren't zero (or aren't consistently near zero), experiment with increasing the [KV layer `DistSender`]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#distsender) and `Streamer` concurrency using the `kv.streamer.concurrency_limit` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}). In v24.3, these default values were increased by 6x and 12x, respectively. For versions prior to v24.3, increasing the value by 6x and 12x would be a good starting point. +If you see `distsender.batches.async.throttled` values that aren't zero (or aren't consistently near zero), experiment with increasing the [KV layer `DistSender`]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#distsender) and `Streamer` concurrency using the `kv.dist_sender.concurrency_limit` and `kv.streamer.concurrency_limit` [cluster settings]({% link {{ page.version.version }}/cluster-settings.md %}), respectively. In v24.3, these default values were increased by 6x and 12x, respectively. For versions prior to v24.3, increasing the value by 6x and 12x would be a good starting point. To validate a successful result, you can increase this value until you see no new throttled requests and no increase in tail latency (e.g. `p99.999`). -This does increase the amount of RAM consumption per node to handle the increased concurrency, but it's proportional to the load and an individual flow's memory consumption should not be significant. Bad outcomes include increased tail latency or too much memory consumption with no decrease in the number of throttled requests. +This does increase the amount of RAM consumption per node to handle the increased concurrency, but it's proportional to the load and an individual flow's memory consumption should not be significant. Bad outcomes include increased tail latency or too much memory consumption with no decrease in the number of throttled requests, in which case you should return the settings to their default values. ## See also From 1fdee1404a22075823237038b8da849e5e44ea30 Mon Sep 17 00:00:00 2001 From: Rich Loveland Date: Wed, 20 Nov 2024 15:00:07 -0500 Subject: [PATCH 8/8] Update with taroface feedback (2) --- src/current/v24.3/performance-recipes.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/current/v24.3/performance-recipes.md b/src/current/v24.3/performance-recipes.md index e505df0eab3..bd6b083671c 100644 --- a/src/current/v24.3/performance-recipes.md +++ b/src/current/v24.3/performance-recipes.md @@ -304,9 +304,9 @@ Reduce the [`gc.ttlseconds`]({% link {{ page.version.version }}/configure-replic ### KV DistSender batches being throttled (performance impact to larger clusters) -If you see `distsender.batches.async.throttled` values that aren't zero (or aren't consistently near zero), experiment with increasing the [KV layer `DistSender`]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#distsender) and `Streamer` concurrency using the `kv.dist_sender.concurrency_limit` and `kv.streamer.concurrency_limit` [cluster settings]({% link {{ page.version.version }}/cluster-settings.md %}), respectively. In v24.3, these default values were increased by 6x and 12x, respectively. For versions prior to v24.3, increasing the value by 6x and 12x would be a good starting point. +If you see `distsender.batches.async.throttled` values that aren't zero (or aren't consistently near zero), experiment with increasing the [KV layer `DistSender`]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#distsender) and `Streamer` concurrency using the `kv.dist_sender.concurrency_limit` and `kv.streamer.concurrency_limit` [cluster settings]({% link {{ page.version.version }}/cluster-settings.md %}), respectively. In v24.3, these default values were increased by 6x and 12x, respectively. For versions prior to v24.3, increasing the values by 6x and 12x would be a good starting point. -To validate a successful result, you can increase this value until you see no new throttled requests and no increase in tail latency (e.g., `p99.999`). +To validate a successful result, you can increase the values of these cluster settings until you see no new throttled requests and no increase in tail latency (e.g., `p99.999`). This does increase the amount of RAM consumption per node to handle the increased concurrency, but it's proportional to the load and an individual flow's memory consumption should not be significant. Bad outcomes include increased tail latency or too much memory consumption with no decrease in the number of throttled requests, in which case you should return the settings to their default values.