Benchmarking Correctly

Reduce noise overhead

A common mistake in writing benchmarks is to mask signal with noise. For example, if we write the benchmarks such as to convert the stream to a list then the time consumed by list creation itself will mask the stream processing time. The total time would be:

  total = n * list cons overhead + n * stream processing overhead

If the list cons overhead is too high it will just mask the stream processing time which is what we are really interested in. We need to ensure that any such noise factor is either small enough to neglect or we measure and deduct it from the results.

Describe signal and noise

Ideally in case of micro benchmarks, with each benchmark we need to explain what exactly is being measured or what is the benchmarking code doing. For example, in the previous example we can mention that the benchmark measures the list creation time plus the stream operation time; the list creation time dominates.

Each benchmark should have a description of what it is measuring.

Caveats for fair comparison

To avoid the domination of list creation time we should choose a stream exit operation that takes the minimum amount of time. For example, we could use a sum fold on the output of the stream instead of converting it into a list. We assume that the fold time would be much lower. However, this means that we are measuring the fold time + the stream operation time. If the fold time for a particular library is high then we are really measuring the fold performance for that rather than measuring the performance of that particular operation.

The ideal would be to break down the components involved in a benchmark and show each component separately. However, in the real world it is a combination of operations that we are going to use. That's why micro benchmarks are only useful to optimize individual operations the overall performance may be dominated by the most frequent operations and the weakest link in the whole chain.

Measuring maxrss

Memory once requested from the OS is never released back to the OS by GHC (this has changed since GHC 9.2). This may lead to false maxrss reporting when there are multiple benchmarks in the same benchmark recipe file. We run each benchmark in an isolated process to avoid that. The bench.sh script takes care of this, running the benchmark executable directly may not give correct results if all benchmarks are run together.

Avoiding interference across benchmarks

Running benchmarks in isolation also ensures avoiding any other kind of interference (e.g. unwanted sharing) among benchmarks. Though this may not be able to avoid any compile time interference, e.g. sharing of data across benchmarks may lead GHC to generate suboptimal code.

Fragile inlining

We have tried to optimize (INLINE etc.) each library's code as much as we could, library authors are encouraged to take a look at if their library is being used in a fully optimized manner and report if there are any issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarking-notes.md

benchmarking-notes.md

Benchmarking Correctly

Reduce noise overhead

Describe signal and noise

Caveats for fair comparison

Measuring maxrss

Avoiding interference across benchmarks

Fragile inlining

Files

benchmarking-notes.md

Latest commit

History

benchmarking-notes.md

File metadata and controls

Benchmarking Correctly

Reduce noise overhead

Describe signal and noise

Caveats for fair comparison

Measuring maxrss

Avoiding interference across benchmarks

Fragile inlining