Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[testbed] Add batcher performance tests #36206

Merged
merged 1 commit into from
Dec 6, 2024

Conversation

swiatekm
Copy link
Contributor

@swiatekm swiatekm commented Nov 5, 2024

Description

Add basic batching benchmarks. The primary intent of these is to help verify that we're not introducing performance regressions with open-telemetry/opentelemetry-collector#8122.

I've taken the the 10k DPS benchmark for logs, and ran it in different configurations:

  • Batching either enabled or disabled
  • In-memory queue enabled or disabled
  • Using the batch processor at the start of the pipeline or the new exporter batcher
  • All of this either with no processors or with some basic filtering and transformation

I've reduced the input batch size to 10 to better capture the effect of having no batching at the start of the pipeline on processor performance, which is one of the concerns with moving batching to the exporter.

For now, I'd like to get some comments on whether this is sufficient, or if I should expand my scope to other signal types and/or different parameters for the benchmarks.

Current results

Test Result Duration CPU Avg% CPU Max% RAM Avg MiB RAM Max MiB Sent Items Received Items
Log10kDPSNoProcessors/No_batching,_no_queue PASS 15s 20.7 22.7 73 103 150100 150100
Log10kDPSNoProcessors/No_batching,_queue PASS 15s 19.1 21.7 73 103 150100 150100
Log10kDPSNoProcessors/Batch_size_1000_with_batch_processor,_no_queue PASS 15s 13.1 14.3 77 109 150100 150100
Log10kDPSNoProcessors/Batch_size_1000_with_batch_processor,_queue PASS 15s 12.7 14.0 75 108 150100 150100
Log10kDPSNoProcessors/Batch_size_1000_with_exporter_batcher,_no_queue PASS 15s 15.5 17.3 72 101 150100 150100
Log10kDPSNoProcessors/Batch_size_1000_with_exporter_batcher,_queue PASS 15s 14.3 15.7 72 102 150100 150100
Log10kDPSWithProcessors/No_batching,_no_queue PASS 15s 21.5 23.0 72 103 150100 150100
Log10kDPSWithProcessors/No_batching,_queue PASS 15s 22.2 23.0 72 102 150100 150100
Log10kDPSWithProcessors/Batch_size_1000_with_batch_processor,_no_queue PASS 15s 22.5 26.0 75 107 150100 150100
Log10kDPSWithProcessors/Batch_size_1000_with_batch_processor,_queue PASS 15s 18.0 19.7 73 104 150100 150100
Log10kDPSWithProcessors/Batch_size_1000_with_exporter_batcher,_no_queue PASS 16s 18.3 20.7 75 106 150100 150100
Log10kDPSWithProcessors/Batch_size_1000_with_exporter_batcher,_queue PASS 15s 17.1 17.7 73 102 150100 150100

It looks like the new batcher is a bit less performant if the pipeline doesn't contain any processors, but is in fact faster if processors are present, which is surprising to me. But this does assuage our fears that we'd tank processor performance by moving batching to the end of the pipeline.

Link to tracking issue

Fixes open-telemetry/opentelemetry-collector#10836

@swiatekm
Copy link
Contributor Author

swiatekm commented Nov 5, 2024

@dmitryax @sfc-gh-sili I'd love some feedback as to whether this is useful for your purposes, and if I should broaden the scope.

@swiatekm swiatekm force-pushed the test/testbed-batcher branch from 35c4618 to 89f3353 Compare November 5, 2024 14:07
@jmacd
Copy link
Contributor

jmacd commented Nov 7, 2024

By the way, I'm not convinced that the testbed is appropriately sophisticated to measure and describe the differences between the two batchers. The amount of work being done is the same, so we expect the same amount of compute resource. There are still a few salient differences between the two forms of batching that would not be teased apart by this benchmark, for example the way that exporter-batching has to serialize multiple requests and can't benefit from the queue for concurrency when batches are reduced in size. The effect is on individual request latency, which the testbed doesn't measure.

I'm working on an RFC to describe ideal batching behavior, and there are performance arguments you could test here. When there are processors that do CPU-intensive work, there is likely to be an positive impact from the batch processor because larger batches will perform better due to CPU and memory caches. I.e., if there is compute-intensive work, the batch processor is likely to lead to an optimization because we will invoke the subsequent processors fewer times w/ the same amount of data.

I also claim we'll never get rid of the batch processor and should fix it. As a shining example, the groupbyattr processor absolutely benefits from the batch processor for the reason stated above (in addition to other benefits). https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/groupbyattrsprocessor

@swiatekm
Copy link
Contributor Author

I'm working on an RFC to describe ideal batching behavior, and there are performance arguments you could test here. When there are processors that do CPU-intensive work, there is likely to be an positive impact from the batch processor because larger batches will perform better due to CPU and memory caches. I.e., if there is compute-intensive work, the batch processor is likely to lead to an optimization because we will invoke the subsequent processors fewer times w/ the same amount of data.

@jmacd any idea on how I could simulate a CPU-heavy processor here? The processors I did add to my benchmark were intended to simulate the average use case, but I'd be happy to have a more skewed benchmark there as well.

@atoulme
Copy link
Contributor

atoulme commented Nov 12, 2024

Please rebase, fix conflicts and add changelog.

@swiatekm swiatekm force-pushed the test/testbed-batcher branch from 89f3353 to ace990a Compare November 13, 2024 14:00
@swiatekm swiatekm marked this pull request as ready for review November 13, 2024 14:00
@swiatekm swiatekm requested a review from a team as a code owner November 13, 2024 14:00
@swiatekm swiatekm requested a review from ChrsMark November 13, 2024 14:00
@swiatekm swiatekm force-pushed the test/testbed-batcher branch 2 times, most recently from c57c82c to e6a49ca Compare November 13, 2024 15:04
@swiatekm swiatekm force-pushed the test/testbed-batcher branch from e6a49ca to 7061eeb Compare November 14, 2024 10:54
@sfc-gh-sili
Copy link

@swiatekm Thank you, Mikołaj! This is great.
Here are a couple of things that I'd be curious to know:

  1. Sanity check that batching in exporter is about the same resource consumption as batching in processor, which this PR does already.
  2. Like Joshua mentioned, I am too curious if not batching earlier causes higher resource usage for more expensive processing down in the pipeline.
    (3. I'd be curious how size of the goroutine pool used by exporter would affects the performance. seems it's not something we've looked into so far)

@swiatekm swiatekm force-pushed the test/testbed-batcher branch from 7061eeb to f872ba4 Compare December 3, 2024 16:37
@swiatekm
Copy link
Contributor Author

swiatekm commented Dec 3, 2024

Can we merge this PR as-is, and I'll add another benchmark in a follow-up? @MovieStoreGuy

@jmacd @sfc-gh-sili I can see the argument that we could get better performance on some processing tasks because batch size can affect CPU cache saturation. I'm not 100% sure how to verify this though. Maybe spawn 100 transform processors doing the same thing? In that case though, I think we'd see the overhead of nested function calls more than anything else though.

@swiatekm swiatekm force-pushed the test/testbed-batcher branch 2 times, most recently from 8fff77a to be0715a Compare December 4, 2024 12:20
# Conflicts:
#	cmd/oteltestbedcol/builder-config.yaml

# Conflicts:
#	cmd/oteltestbedcol/builder-config.yaml
@swiatekm swiatekm force-pushed the test/testbed-batcher branch from be0715a to f08d866 Compare December 6, 2024 14:18
@swiatekm
Copy link
Contributor Author

swiatekm commented Dec 6, 2024

Reran the benchmark on current main:

Test Result Duration CPU Avg% CPU Max% RAM Avg MiB RAM Max MiB Sent Items Received Items
Log10kDPSNoProcessors/No_batching,_no_queue PASS 15s 19.9 25.7 75 104 150100 150100
Log10kDPSNoProcessors/No_batching,_queue PASS 15s 16.4 17.3 71 100 150100 150100
Log10kDPSNoProcessors/Batch_size_1000_with_batch_processor,_no_queue PASS 15s 12.6 13.7 76 108 150100 150100
Log10kDPSNoProcessors/Batch_size_1000_with_batch_processor,_queue PASS 15s 11.9 12.7 74 106 150100 150100
Log10kDPSNoProcessors/Batch_size_1000_with_exporter_batcher,_no_queue PASS 15s 13.8 15.3 70 100 150100 150100
Log10kDPSNoProcessors/Batch_size_1000_with_exporter_batcher,_queue PASS 15s 13.2 13.7 72 104 150100 150100
Log10kDPSWithProcessors/No_batching,_no_queue PASS 15s 19.3 20.3 73 103 150100 150100
Log10kDPSWithProcessors/No_batching,_queue PASS 15s 19.6 21.0 73 102 150100 150100
Log10kDPSWithProcessors/Batch_size_1000_with_batch_processor,_no_queue PASS 15s 20.0 21.7 73 102 150100 150100
Log10kDPSWithProcessors/Batch_size_1000_with_batch_processor,_queue PASS 15s 19.8 20.7 73 104 150100 150100
Log10kDPSWithProcessors/Batch_size_1000_with_exporter_batcher,_no_queue PASS 15s 17.1 19.7 74 104 150100 150100
Log10kDPSWithProcessors/Batch_size_1000_with_exporter_batcher,_queue PASS 15s 17.7 20.3 74 104 150100 150100

I also ran it with the exporter.UsePullingBasedExporterQueueBatcher feature gate introduced in open-telemetry/opentelemetry-collector#11637 enabled:

Test Result Duration CPU Avg% CPU Max% RAM Avg MiB RAM Max MiB Sent Items Received Items
Log10kDPSNoProcessors/No_batching,_no_queue PASS 15s 16.8 18.0 74 103 150100 150100
Log10kDPSNoProcessors/No_batching,_queue PASS 15s 16.5 16.7 73 103 150100 150100
Log10kDPSNoProcessors/Batch_size_1000_with_batch_processor,_no_queue PASS 15s 11.5 12.3 77 109 150100 150100
Log10kDPSNoProcessors/Batch_size_1000_with_batch_processor,_queue PASS 15s 12.2 13.0 76 109 150100 150100
Log10kDPSNoProcessors/Batch_size_1000_with_exporter_batcher,_no_queue PASS 15s 11.7 12.3 73 104 150100 150100
Log10kDPSNoProcessors/Batch_size_1000_with_exporter_batcher,_queue PASS 15s 12.8 14.0 76 107 150100 150100
Log10kDPSWithProcessors/No_batching,_no_queue PASS 15s 25.5 27.0 73 104 150100 150100
Log10kDPSWithProcessors/No_batching,_queue PASS 15s 21.6 23.7 74 106 150100 150100
Log10kDPSWithProcessors/Batch_size_1000_with_batch_processor,_no_queue PASS 15s 21.0 24.0 73 105 150100 150100
Log10kDPSWithProcessors/Batch_size_1000_with_batch_processor,_queue PASS 15s 20.7 21.0 72 103 150100 150100
Log10kDPSWithProcessors/Batch_size_1000_with_exporter_batcher,_no_queue PASS 15s 16.0 20.0 77 109 150100 150100
Log10kDPSWithProcessors/Batch_size_1000_with_exporter_batcher,_queue PASS 15s 15.5 16.0 78 112 150100 150100

Looks like the new batcher is indeed slightly more performant in these simple scenarios.

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@djaglowski djaglowski merged commit 64384a0 into open-telemetry:main Dec 6, 2024
158 checks passed
@github-actions github-actions bot added this to the next release milestone Dec 6, 2024
ZenoCC-Peng pushed a commit to ZenoCC-Peng/opentelemetry-collector-contrib that referenced this pull request Dec 6, 2024
@swiatekm swiatekm deleted the test/testbed-batcher branch December 9, 2024 11:46
sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this pull request Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[exporterhelper] Run performance tests to compare exporter batching with the batcher processor
7 participants