Skip to content

Commit

Permalink
Merge pull request #12 from thanelmas/possible-typos
Browse files Browse the repository at this point in the history
Suggesting a possible fix for some typos.
  • Loading branch information
dendibakh authored Aug 11, 2022
2 parents be62045 + c06f1d2 commit dca4186
Show file tree
Hide file tree
Showing 15 changed files with 21 additions and 21 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ typora-root-url: ..\..\img

## Detecting Slow FP Arithmetic {#sec:SlowFPArith}

For applications that operate with floating-point values, there is some probability of hitting an exceptional scenario when the FP values become [denormalized](https://en.wikipedia.org/wiki/Denormal_number)[^1]. Operations on denormal values could be easy `10` times slower or more. When CPU handles instruction that tries to do arithmetic operation on denormal FP values, it requires special treatment for such cases. Since it is exceptional situation, CPU requests a microcode [assist](https://software.intel.com/en-us/vtune-help-assists)[^10]. Microcode Sequencer (MSROM) will then feed the pipeline with lots of uops (see [@sec:sec_UOP]) for handling such a scenario.
For applications that operate with floating-point values, there is some probability of hitting an exceptional scenario when the FP values become [denormalized](https://en.wikipedia.org/wiki/Denormal_number)[^1]. Operations on denormal values could be easy `10` times slower or more. When CPU handles instruction that tries to do arithmetic operation on denormal FP values, it requires special treatment for such cases. Since this is an exceptional situation, the CPU requests a microcode [assist](https://software.intel.com/en-us/vtune-help-assists)[^10]. Microcode Sequencer (MSROM) will then feed the pipeline with lots of uops (see [@sec:sec_UOP]) for handling such a scenario.

TMA methodology classifies such bottlenecks under the `Retiring` category. This is one of the situations when high Retiring doesn't mean a good thing. Since operations on denormal values likely represent unwanted behavior of the program, one can just collect the `FP_ASSIST.ANY` performance counter. The value should be close to zero. An example of a program that does denormal FP arithmetics and thus experiences many FP assists is presented on easyperf [blog](https://easyperf.net/blog/2018/11/08/Using-denormal-floats-is-slow-how-to-detect-it)[^2]. C++ developers can prevent their application fall into operations with subnormal values by using [`std::isnormal()`](https://en.cppreference.com/w/cpp/numeric/math/isnormal)[^3]function. Alternatively, one can change the mode of SIMD floating-point operations, enabling "flush-to-zero" (FTZ) and "denormals-are-zero" (DAZ) flags in the CPU control register[^5], preventing SIMD instructions from producing denormalized numbers[^4]. Disabling denormal floats at the code level can be done using dedicated macros, which can vary for different compilers[^6].

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ $$

Applications usually have a configurable number of threads, which allows them to run efficiently on platforms with a different number of cores. Obviously, running an application using a lower number of threads than is available on the system underutilizes its resources. On the other hand, running an excessive number of threads can cause a higher CPU time because some of the threads may be waiting on others to complete, or time may be wasted on context switches.

Besides actual worker threads, multithreaded applications usually have helper threads: main thread, input and output threads, etc. If those threads consume significant time, they require dedicated HW thread themselves. This is why it is important to know the total thread count and configure the number of worker threads properly.
Besides actual worker threads, multithreaded applications usually have helper threads: main thread, input and output threads, etc. If those threads consume significant time, they require dedicated HW threads themselves. This is why it is important to know the total thread count and configure the number of worker threads properly.

To avoid a penalty for threads creation and destruction, engineers usually allocate a [pool of threads](https://en.wikipedia.org/wiki/Thread_pool)[^14] with multiple threads waiting for tasks to be allocated for concurrent execution by the supervising program. This is especially beneficial for executing short-lived tasks.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ False sharing is a frequent source of performance issues for multithreaded appli
When using Intel VTune Profiler, the user needs two types of analysis to find and eliminate false sharing issues. Firstly, run [Microarchitecture Exploration](https://software.intel.com/en-us/vtune-help-general-exploration-analysis)[^19] analysis that implements TMA methodology to detect the presence of false sharing in the application. As noted before, the high value for the Contested Accesses metric prompts us to dig deeper and run the [Memory Access](https://software.intel.com/en-us/vtune-help-memory-access-analysis) analysis with the "Analyze dynamic memory objects" option enabled. This analysis helps in finding out accesses to the data structure that caused contention issues. Typically, such memory accesses have high latency, which will be revealed by the analysis. See an example of using Intel VTune Profiler for fixing false sharing issues on [Intel Developer Zone](https://software.intel.com/en-us/vtune-cookbook-false-sharing)[^20].
Linux `perf` has support for finding false sharing as well. As with Intel VTune Profiler, run TMA first (see [@sec:secTMA_perf]) to find out that the program experience false/true sharing issues. If that's the case, use the `perf c2c` tool to detect memory accesses with high cache coherency cost. `perf c2c` matches store/load addresses for different threads and see if the hit in a modified cache line occurred. Readers can find a detailed explanation of the process and how to use the tool in dedicated [blog post](https://joemario.github.io/blog/2016/09/01/c2c-blog/)[^21].
Linux `perf` has support for finding false sharing as well. As with Intel VTune Profiler, run TMA first (see [@sec:secTMA_perf]) to find out that the program experience false/true sharing issues. If that's the case, use the `perf c2c` tool to detect memory accesses with high cache coherency cost. `perf c2c` matches store/load addresses for different threads and sees if the hit in a modified cache line occurred. Readers can find a detailed explanation of the process and how to use the tool in dedicated [blog post](https://joemario.github.io/blog/2016/09/01/c2c-blog/)[^21].
It is possible to eliminate false sharing with the help of aligning/padding memory objects. Example in [@sec:secTrueSharing] can be fixed by ensuring `sumA` and `sumB` do not share the same cache line (see details in [@sec:secMemAlign]).
Expand Down
2 changes: 1 addition & 1 deletion chapters/14-Appendix/Appendix-A.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ $ perf stat -e context-switches,cpu-migrations -r 10 -- taskset -c 0 a.exe
102 context-switches
0 cpu-migrations
```
Notice the number of `cpu-migrations` gets down to `0`, i.e., the process never leaves the `core0`.
notice the number of `cpu-migrations` gets down to `0`, i.e., the process never leaves the `core0`.

Alternatively, you can use [cset](https://github.com/lpechacek/cpuset)[^10] tool to reserve CPUs for just the program you are benchmarking. If using `Linux perf`, leave at least two cores so that `perf` runs on one core, and your program runs in another. The command below will move all threads out of N1 and N2 (`-k on` means that even kernel threads are moved out):

Expand Down
2 changes: 1 addition & 1 deletion chapters/5-Performance-Analysis-Approaches/5-4 Sampling.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ $ perf report -n --stdio --no-children

When using Intel Vtune Profiler, one can collect call stacks data by checking the corresponding "Collect stacks" box while configuring analysis[^2]. When using command-line interface specify `-knob enable-stack-collection=true` option.

\personal{Mechanism of collecting call stacks is very important to understand. I've seen some developers that are not familiar with the concept try to obtain this information by using a debugger. They do this by interrupting the execution of a program and analyze the call stack (like `backtrace` command in `gdb` debugger). Developers should allow profiling tools to do the job, which is much faster and gives more accurate data.}
\personal{The mechanism of collecting call stacks is very important to understand. I've seen some developers that are not familiar with the concept try to obtain this information by using a debugger. They do this by interrupting the execution of a program and analyze the call stack (like `backtrace` command in `gdb` debugger). Developers should allow profiling tools to do the job, which is much faster and gives more accurate data.}

### Flame Graphs

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ S0-C0-T0 BE_Bound.Mem_Bound.Store_Bound: 0.69 % Stalls
S0-C0-T0 BE_Bound.Core_Bound.Divider: 8.56 % Clocks
S0-C0-T0 BE_Bound.Core_Bound.Ports_Util: 11.31 % Clocks
```
We found the bottleneck to be in `DRAM_Bound`. This tells us that many memory accesses miss in all levels of caches and go all the way down to the main memory. We can also confirm if we collect the absolute number of L3 cache misses (DRAM hit) for the program. For Skylake architecture, the `DRAM_Bound` metric is calculated using the `CYCLE_ACTIVITY.STALLS_L3_MISS` performance event. Let’s collect it:
We found the bottleneck to be in `DRAM_Bound`. This tells us that many memory accesses miss in all levels of caches and go all the way down to the main memory. We can also confirm this if we collect the absolute number of L3 cache misses (DRAM hit) for the program. For Skylake architecture, the `DRAM_Bound` metric is calculated using the `CYCLE_ACTIVITY.STALLS_L3_MISS` performance event. Let’s collect it:

```bash
$ perf stat -e cycles,cycle_activity.stalls_l3_miss -- ./a.out
Expand All @@ -130,7 +130,7 @@ According to the definition of `CYCLE_ACTIVITY.STALLS_L3_MISS`, it counts cycles

As the second step in the TMA process, we would locate the place in the code where the bottleneck occurs most frequently. In order to do so, one should sample the workload using a performance event that corresponds to the type of bottleneck that was identified during Step 1.

A recommended way to find such an event is to run `toplev` tool with the `--show-sample` option that will suggest the `perf record` command line that can be used to locate the issue. For the purpose of understanding the mechanics of TMA, we also present the manual way to find an event associated with a particular performance bottleneck. Correspondence between performance bottlenecks and performance events that should be used for locating the place in the code where such bottlenecks take place can be done with the help of [TMA metrics](https://download.01.org/perfmon/TMA_Metrics.xlsx)[^2] table introduced earlier in the chapter. The `Locate-with` column denotes a performance event that should be used to locate the exact place in the code where the issue occurs. For the purpose of our example, in order to find memory accesses that contribute to such a high value of the `DRAM_Bound` metric (miss in the L3 cache), we should sample on `MEM_LOAD_RETIRED.L3_MISS_PS` precise event as shown in the listing above:
A recommended way to find such an event is to run `toplev` tool with the `--show-sample` option that will suggest the `perf record` command line that can be used to locate the issue. For the purpose of understanding the mechanics of TMA, we also present the manual way to find an event associated with a particular performance bottleneck. Correspondence between performance bottlenecks and performance events that should be used for locating the place in the code where such bottlenecks take place can be done with the help of [TMA metrics](https://download.01.org/perfmon/TMA_Metrics.xlsx)[^2] table introduced earlier in the chapter. The `Locate-with` column denotes a performance event that should be used to locate the exact place in the code where the issue occurs. For the purpose of our example, in order to find memory accesses that contribute to such a high value of the `DRAM_Bound` metric (miss in the L3 cache), we should sample on `MEM_LOAD_RETIRED.L3_MISS_PS` precise event as shown in the listing below:

```bash
$ perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp ./a.out
Expand Down Expand Up @@ -209,7 +209,7 @@ $ perf stat -e cycles,cycle_activity.stalls_l3_miss -- ./a.out
6,498080824 seconds time elapsed
```
TMA is an iterative process, so we now need to repeat the process starting from the Step1. Likely it will move the bottleneck into some other bucket, in this case, Retiring. This was an easy example demonstrating the workflow of TMA methodology. Analyzing real-world application is unlikely to be that easy. The next entire chapter in this book is organized in a way to be conveniently used with the TMA process. E.g., its sections are broken down to reflect each high-level category of performance bottlenecks. The idea behind such a structure is to provide some kind of checklist which developer can use to drive code changes after performance issue has been found. For instance, when developers see that the application they are working on is `Memory Bound`, they can look up [@sec:MemBound] for ideas.
TMA is an iterative process, so we now need to repeat the process starting from the Step1. Likely it will move the bottleneck into some other bucket, in this case, Retiring. This was an easy example demonstrating the workflow of TMA methodology. Analyzing real-world application is unlikely to be that easy. The next entire chapter in this book is organized in a way to be conveniently used with the TMA process. E.g., its sections are broken down to reflect each high-level category of performance bottlenecks. The idea behind such a structure is to provide some kind of checklist which developers can use to drive code changes after a performance issue has been found. For instance, when developers see that the application they are working on is `Memory Bound`, they can look up [@sec:MemBound] for ideas.
### Summary
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ $ perf report -n --sort symbol_from,symbol_to -F +cycles,srcline_from,srcline_to
0.58% 3804 24 dec.c:174 dec.c:174
```

Several not significant lines were removed from the output of `perf record` in order to make it fit on the page. If we now focus on the branch in which source and destination is `dec.c:174`[^10], we can find multiple lines associated with it. Linux `perf` sorts entries by overhead first, which requires us to manually filter entries for the branch which we are interested in. In fact, if we filter them, we will get the latency distribution for the basic block that ends with this branch, as shown in the table {@tbl:bb_latency}. Later user can plot this data and get a chart similar to Figure @fig:LBR_timing_BB.
Several not significant lines were removed from the output of `perf record` in order to make it fit on the page. If we now focus on the branch in which source and destination is `dec.c:174`[^10], we can find multiple lines associated with it. Linux `perf` sorts entries by overhead first, which requires us to manually filter entries for the branch which we are interested in. In fact, if we filter them, we will get the latency distribution for the basic block that ends with this branch, as shown in the table {@tbl:bb_latency}. Later we can plot this data and get a chart similar to Figure @fig:LBR_timing_BB.

----------------------------------------------
Cycles Number of samples Probability density
Expand Down
Loading

0 comments on commit dca4186

Please sign in to comment.