Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Certain performance counters are not measured and are always zero #1896

Open
double-fault opened this issue Dec 31, 2024 · 6 comments
Open

Comments

@double-fault
Copy link

Describe the bug

Certain counters such as frontend_retired.l2_miss are always reported as zero using the performance counters integration with libpfm.

To confirm that this is indeed a bug I have made a small example where I'm flushing a function from all caches using the clflush instruction and then calling it, which should lead to an L2 miss. perf stat and record confirm that frontend_retired.l2_miss should be non-zero and the miss is happening within the function, however the counters reported by google benchmark are always zero. I have also noticed this issue with a few other events such as frontend_retired.latency_ge_1.

System
Which OS, compiler, and compiler version are you using:

  • OS: Ubuntu 22.04, Kernel 6.8.0-49-generic
  • Compiler and version: GNU assembler version 2.38 (x86_64-linux-gnu) using BFD version (GNU Binutils for Ubuntu) 2.38, g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  • Processor: 11th Gen Intel(R) Core(TM) i5-11400H

To reproduce

Make a file fun.S with the following contents.

.section .data

.section .text

.global fun
.global not_fun

.align 512
not_fun:
        nop
        ret

.align 512
fun:
        clflush not_fun
        mfence
        call not_fun
        ret

The function fun flushes not_fun from all the caches and then calls it, which will cause a cache miss as it tries to fetch the first instruction of the function.

Make two more files, prog1.S and prog2.cpp respectively with the following contents.

.section .data

.section .text

.global _start
_start:
        mov $10000000, %rcx

loop:
        call fun
        dec %rcx
        jnz loop       

        xor %rdi,%rdi
        mov $60, %rax
        syscall
#include <benchmark/benchmark.h>

extern "C" void fun();

void BM(benchmark::State &state) {
        for (auto _ : state) {
                fun();
        }
}

BENCHMARK(BM);

BENCHMARK_MAIN();

and a Makefile:

all: fun.o prog1.o prog2.o
	g++ prog1.o fun.o -o prog1.out -std=c++17 -nostdlib -no-pie -O3 -Wall -Werror -fno-omit-frame-pointer -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread -lpfm 
	g++ prog2.o fun.o -o prog2.out -std=c++17 -no-pie -O3 -Wall -Werror -fno-omit-frame-pointer -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread -lpfm 

prog1.o: prog1.S
	as prog1.S -o prog1.o

fun.o: fun.S
	as fun.S -o fun.o

prog2.o:
	g++ prog2.cpp -c -std=c++17 -no-pie -O3 -Wall -Werror -fno-omit-frame-pointer -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread -lpfm

clean:
	rm *.o *.out

First, I run prog1.out along with perf to get the expected results without google benchmark's code interfering with everything. The code for prog1.S is simply a loop which calls fun 1e7 times.

$ sudo perf stat -e cycles,frontend_retired.l1i_miss,frontend_retired.l2_miss,frontend_retired.latency_ge_1 taskset -c 0 nice -n -20 ./prog1.out

 Performance counter stats for 'taskset -c 0 nice -n -20 ./prog1.out':

    6,71,13,49,717      cycles                                                                  (49.99%)
         14,09,768      frontend_retired.l1i_miss                                               (50.04%)
         13,26,441      frontend_retired.l2_miss                                                (24.98%)
       3,53,09,157      frontend_retired.latency_ge_1                                           (24.98%)

       1.517582000 seconds time elapsed

       1.515770000 seconds user
       0.000999000 seconds sys

As expected, we see L2 misses. Using perf record/sampling it can be confirmed that all the misses are inside fun Image

Now, I run prog2.out to see what google benchmarks reports the counters as.

$ sudo taskset -c 0 nice -n -20 ./prog2.out --benchmark_perf_counters=CYCLES,frontend_retired.l1i_miss,frontend_retired.l2_miss,frontend_retired.latency_ge_1 --benchmark_counters_tabular=true --benchmark_min_time=10000000x
Running ./prog2.out
Run on (12 X 4279.92 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1280 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 0.18, 0.38, 0.49
-------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark           Time             CPU   Iterations     CYCLES frontend_retired.l1i_miss frontend_retired.l2_miss frontend_retired.latency_ge_1
-------------------------------------------------------------------------------------------------------------------------------------------------
BM                158 ns          158 ns     10000000    702.339                  0.342522                        0                             0

and the L2 misses are zero.

For one more final confirmation we run google benchmark under perf stat - I'm not sure if both programs trying to use the same counters will mess something up, but anyways google benchmark again reports the counter values as zero whereas perf stat gives non-zero values. It is unlikely that these non-zero values are cause by google benchmarks code instead of the test code, as I have shown above that the test code on its own indeed causes the event as confirmed using perf stat + record.

$ sudo perf stat -e cycles,frontend_retired.l1i_miss,frontend_retired.l2_miss,frontend_retired.latency_ge_1 taskset -c 0 nice -n -20 ./prog2.out --benchmark_perf_counters=CYCLES,frontend_retired.l1i_miss,frontend_retired.l2_miss,frontend_retired.latency_ge_1 --benchmark_counters_tabular=true --benchmark_min_time=10000000x
Running ./prog2.out
Run on (12 X 4184.45 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1280 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 0.41, 0.42, 0.47
-------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark           Time             CPU   Iterations     CYCLES frontend_retired.l1i_miss frontend_retired.l2_miss frontend_retired.latency_ge_1
-------------------------------------------------------------------------------------------------------------------------------------------------
BM                158 ns          158 ns     10000000    699.904                   0.28635                        0                             0

 Performance counter stats for 'taskset -c 0 nice -n -20 ./prog2.out --benchmark_perf_counters=CYCLES,frontend_retired.l1i_miss,frontend_retired.l2_miss,frontend_retired.latency_ge_1 --benchmark_counters_tabular=true --benchmark_min_time=10000000x':

    7,03,84,92,871      cycles                                                                  (16.85%)
         41,63,656      frontend_retired.l1i_miss                                               (16.84%)
         10,25,334      frontend_retired.l2_miss                                                (0.13%)
      26,30,55,471      frontend_retired.latency_ge_1                                           (0.13%)

       1.592229730 seconds time elapsed

       1.586011000 seconds user
       0.006000000 seconds sys

Expected behavior
The counter values should be non-zero when it is clear that L2 cache misses are happening (confirmed with perf).

@dmah42
Copy link
Member

dmah42 commented Jan 8, 2025

can you identify what in the benchmark code might be affecting this? we use a library for perf counting and other than the configuration we pass, it's not obvious to me what could cause such a bug unless it's with the perf counting library itself.

@double-fault
Copy link
Author

@dmah42 actually I forgot to post an update to the issue, but a few days ago I realized that the issue is not with the specific counter, but rather with the order of arguments I pass in.

Like in the issue description I'm passing in ...,frontend_retired.l1i_miss,frontend_retired.l2_miss,... and the L2 misses are zero - whereas if I reverse the order of these two events I'm passing in the L2 misses will be non-zero whereas L1i misses will be zero. Maybe the events are not properly being multiplexed or there is some issue in the configuration being passed, I'm not sure.

@dmah42
Copy link
Member

dmah42 commented Jan 15, 2025

i suspect maybe something in the reporting side rather than the configuration side. i can't see anything obvious wrong in the perf counters Stop method where we write the counter values, or in how those counters are reported.

out of curiousity, if you request JSON output instead do you get the results you expect?

@double-fault
Copy link
Author

No, the same issue exists with JSON output.

@dmah42
Copy link
Member

dmah42 commented Jan 22, 2025

@mtrofin have you seen any issues like this?

@mtrofin
Copy link
Contributor

mtrofin commented Jan 22, 2025

@mtrofin have you seen any issues like this?

No. If you're passing 3 or less counter names, there should be no multiplexing either.

Libpfm translates string counter names to their underlying IDs. To try isolate the problem, could you could try fetching the raw IDs from google/benchmark and feed those to perf?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants