(de)compression: reduce memory allocation to improve performance #521

magurotuna · 2024-09-22T16:11:17Z

Motivation

Currently, every time WrapBody::poll_frame is called, new instance of BytesMut is created with the default capacity, which is effectively 64 bytes. This ends up with a lot of memory allocation in certain situations, making the throughput significantly worse.

#520 describes that a manual implementation of decompression logic using async-compression gets the performance issue with tower_http::decompression to be resolved. To identify what is making this difference, I captured a flamegraph. Here's the result:

As annotated, tower_http::decompression has more memory allocation which is the reason why the performance is degraded.

Solution

To optimize memory allocation, WrapBody now gets BytesMut as its field, with initial capacity of 4096 bytes. This buffer will be reused as much as possible across multiple poll_frame calls, and only when its capacity becomes 0, new allocation of another 4096 bytes is performed.

The capacity of 4096 bytes is taken from the default capacity of the internal buffer used in tokio_util::io::ReaderStream: https://docs.rs/tokio-util/0.7.12/src/tokio_util/io/reader_stream.rs.html#8

Performance Improvement

With this optimization, the flamegraph is changed as follows, where memory allocation is significantly reduced and optimized.

Also, the throughput gets 10x better when we measure it using Deno's fetch implementation that internally uses hyper and tower_http::decompresson.
This graph shows how long each Deno version takes to handle 2k requests. The rightmost one named v2.0.0-rc.4-tower-http-patched is the one with this patch applied. The leftmost one, v1.45.2, does not use tower_http::decompression.

For more information on how this result was measured, please refer to https://github.com/magurotuna/deno_fetch_decompression_throughput

Fixes: #520

Currently, every time `WrapBody::poll_frame` is called, new instance of `BytesMut` is created with the default capacity, which is effectively 64 bytes. This ends up with a lot of memory allocation in certain situations, making the throughput significantly worse. To optimize memory allocation, `WrapBody` now gets `BytesMut` as its field, with initial capacity of 4096 bytes. This buffer will be reused as much as possible across multiple `poll_frame` calls, and only when its capacity becomes 0, new allocation of another 4096 bytes is performed. Fixes: tower-rs#520

seanmonstar

Phenomenal write-up, thanks!

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [tower-http](https://redirect.github.com/tower-rs/tower-http) | dependencies | patch | `0.6.0` -> `0.6.1` | --- ### Release Notes <details> <summary>tower-rs/tower-http (tower-http)</summary> ### [`v0.6.1`](https://redirect.github.com/tower-rs/tower-http/releases/tag/tower-http-0.6.1): v0.6.1 [Compare Source](https://redirect.github.com/tower-rs/tower-http/compare/tower-http-0.6.0...tower-http-0.6.1) #### Fixed - **decompression:** reuse scratch buffer to significantly reduce allocations and improve performance ([#521]) [#521]: https://redirect.github.com/tower-rs/tower-http/pull/521 #### New Contributors - [@magurotuna](https://redirect.github.com/magurotuna) made their first contribution in [https://github.com/tower-rs/tower-http/pull/521](https://redirect.github.com/tower-rs/tower-http/pull/521) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/apollographql/subgraph-template-rust-async-graphql).  Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [tower-http](https://redirect.github.com/tower-rs/tower-http) | dependencies | patch | `0.6.0` -> `0.6.1` | --- ### Release Notes <details> <summary>tower-rs/tower-http (tower-http)</summary> ### [`v0.6.1`](https://redirect.github.com/tower-rs/tower-http/releases/tag/tower-http-0.6.1): v0.6.1 [Compare Source](https://redirect.github.com/tower-rs/tower-http/compare/tower-http-0.6.0...tower-http-0.6.1) #### Fixed - **decompression:** reuse scratch buffer to significantly reduce allocations and improve performance ([#521]) [#521]: https://redirect.github.com/tower-rs/tower-http/pull/521 #### New Contributors - [@magurotuna](https://redirect.github.com/magurotuna) made their first contribution in [https://github.com/tower-rs/tower-http/pull/521](https://redirect.github.com/tower-rs/tower-http/pull/521) </details>  Co-authored-by: repo-jeeves[bot] <106431701+repo-jeeves[bot]@users.noreply.github.com>

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [tower-http](https://redirect.github.com/tower-rs/tower-http) | dependencies | patch | `0.6.0` -> `0.6.1` | --- ### Release Notes <details> <summary>tower-rs/tower-http (tower-http)</summary> ### [`v0.6.1`](https://redirect.github.com/tower-rs/tower-http/releases/tag/tower-http-0.6.1): v0.6.1 [Compare Source](https://redirect.github.com/tower-rs/tower-http/compare/tower-http-0.6.0...tower-http-0.6.1) #### Fixed - **decompression:** reuse scratch buffer to significantly reduce allocations and improve performance ([#521]) [#521]: https://redirect.github.com/tower-rs/tower-http/pull/521 #### New Contributors - [@magurotuna](https://redirect.github.com/magurotuna) made their first contribution in [https://github.com/tower-rs/tower-http/pull/521](https://redirect.github.com/tower-rs/tower-http/pull/521) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/mist-id/mist).  Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

Dav1dde · 2025-01-17T22:07:48Z

I just wanna say amazing work!

I work on Relay, which get's quite a chunk of compressed traffic. I updated tower and axum the other day and noticed about a 20% decrease in tokio runtime busy threads, a 4x decrease in time spent reading (+decompressing) request bodies, a 50% reduction in the average latency and about 30% reduced latency in the max and p95's.

Now looking through several changelogs and git commits trying to find what caused this, I am relatively sure it was this PR!

This was referenced Sep 22, 2024

perf(ext/fetch): improve decompression throughput by not using tower_http::decompression denoland/deno#25800

Closed

perf(ext/fetch): improve decompression throughput by upgrading tower_http denoland/deno#25806

Merged

seanmonstar approved these changes Sep 23, 2024

View reviewed changes

seanmonstar merged commit 9fdf0eb into tower-rs:main Sep 23, 2024
11 checks passed

magurotuna deleted the perf-wrap-body-allocation branch September 23, 2024 14:28

markdingram mentioned this pull request Sep 24, 2024

Upgrades Tower to 0.5.1 kube-rs/kube#1589

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(de)compression: reduce memory allocation to improve performance #521

(de)compression: reduce memory allocation to improve performance #521

magurotuna commented Sep 22, 2024 •

edited

Loading

seanmonstar left a comment

Dav1dde commented Jan 17, 2025 •

edited

Loading

(de)compression: reduce memory allocation to improve performance #521

(de)compression: reduce memory allocation to improve performance #521

Conversation

magurotuna commented Sep 22, 2024 • edited Loading

Motivation

Solution

Performance Improvement

seanmonstar left a comment

Choose a reason for hiding this comment

Dav1dde commented Jan 17, 2025 • edited Loading

magurotuna commented Sep 22, 2024 •

edited

Loading

Dav1dde commented Jan 17, 2025 •

edited

Loading