Finite Difference Method: How to use omniperf to find the "optimisation hints" to follow? #12

AlexisEspinosaGayosso · 2023-11-08T06:57:06Z

AlexisEspinosaGayosso
Nov 8, 2023

The "Finite Difference Method" analysis is based on an apriori knowledge of the theoretical fetch & write sizes. Then, with some instrumentation of time measurements within the code, these serve for the calculation of the "effective_memory_bandwidth (EffBW)" within the laplacian.cpp code. This gives us a first hint: kernel1_EffBW=808GB/s while theoretical hardware peak is 1638GB/s, so we need to improve that. That works fine, but leave us "unarmed" when using omniperf (or rocprof) for analising a general code where the "theoretical fetch sizes" are unknown a priori. So, your approach suggests (in between the lines) that we should put a strong effort ourselves in estimating a priori the expected "effective_memory_bandwith" (by instrumenting our code with timers and using our paper+pencil+brain to get the theoretical fetch sizes).

I have seen somewhere else (and is some F2F training) the use of the profilers as the very first option without the use of timers instrumentation and paper+pencil calculations. I guess this is because many times it might be difficult to perform all these calculations and we rather go directly to the profiler reports to initiate the investigation. So, could you answer here some of the hints that we may get directly from omniperf (or rocprof) reports without knowing the theoretical fetch sizes and calculated_ffBW a priori? (I prefer, if possible, a response using omniperf, as it is more visual than rocprof.) In other words, can the EffBW be extracted directly from omniperf? (or rocprof).

I tried to use omniperf and, selecting just the 'laplacian_kernel' and normalizing 'per_second' I can see in section 17. L2 Cache shows for kernel1 a kernel1_L2-Read-BW=769GB/s and kernel1_L2-Write-BW=406GB/s. These numbers, unfortunately, do not add up to a number similar to the kernel1_calculated_EffBW=800 GB/s. (Adding them up I got L2-RW-kernel1-BW=1112GB/s or 67% of hardwarePeak.)

Then, I analysed kernel3 with omniperf and got: kernel3_L2-Read-BW=561GB/s and kernel3_L2-Write-BW=552GB/s which add to L2-RW-kernel3-BW=1113GB/s or 67% of hardwarePeak. And, in this case, the numbers indeed add up close to the kernel3_calculated_EffBW=1100GB/s.

And, for kernel5, I got: kernel5_L2-Read-BW=590GB/s and kernel5_L2-Write-BW=584GB/s which add to L2-RW-kernel5-BW=1174GB/s or 71% of hardwarePeak. And, in this case the numbers indeed add up close to the kernel5_calculated_EffBW=1166GB/s.

So, in my efforts to try to "discover" hints for improvement from the omniperf report I was able to see the improvement trend in the numbers reported. Anyway, this was not really a "blind" first investigation, as I have read already the exercise and knew already that I needed to look for the L2 bandwitdh numbers in omniperf. (I guess that to always look into these numbers might help in a real "blind" first search from another code.)

But again, the initial question still persists. I will reformulate the initial question into a list several questions here that I would love to see answered in a listed response too:

What other important hints could we extract from omniperf (or rocm) that can guide us in the optimisation without doing the hard work of instrumenting timers and calculating theoretical fetch sizes?
For example, how to identify that our L2-RW-realisticTarget-BW=1165BG/s from the omniperf report? (or rocprof). (I guess with minimal additional manual calculations if the number itself is not in the reports.)
Also, can we get any hints from the roofline plots?
Finally, for kernel1, why the numbers from omniperf (L2-RW-kernel1-BW) are not adding up to a number close to the kernel1_calculated_EffBW calculated in the code? But, indeed, for kernel3 and kernel5 they add up close to the kernel3 and kernel5calculated_EffBW? In other words, can we trust add-up numbers from the omniperf report to be equivalent to the calculated-EffBW?
Finally, finally, the add-ups L2-RW-kernel1-BW and L2-RW-kernel2-BW are almost identical (being both 67% of the hardwarePeak), but the individual numbers of the Read and Write parts are quite different. What could that mean to us when performing this kind of "blind" investigation that I'm trying to ask you to give us hints from?

Please answer here first, but it would be fantastic if, afterwards, you could add an additional part of the "Finite Difference Method" series with guidance on how to use omniperf (or rocm) starting with a "blind" search for hints for optimising the code.

Many thanks.

Answered by jychang48

Nov 10, 2023

Hi @AlexisEspinosaGayosso ,

TL;DR version below

First of all, thank you for these excellent questions. Regarding your omniperf questions, we are currently writing a blog post demonstrating how to extract useful performance data from omniperf. Case studies like this Laplacian post series plus other more complex kernels will be studied. Much of the guidance you're looking for shall be in that blog. Expect a release early next year.

Before I answer your questions, let me take a few steps back with some general comments:

Yes, the very first step should be the use of profilers, we have a whole blog post discussing the different tools at our disposal and when/where to use each one. But I want…

View full answer

jychang48 · 2023-11-10T05:00:33Z

jychang48
Nov 10, 2023
Maintainer

Hi @AlexisEspinosaGayosso ,

TL;DR version below

First of all, thank you for these excellent questions. Regarding your omniperf questions, we are currently writing a blog post demonstrating how to extract useful performance data from omniperf. Case studies like this Laplacian post series plus other more complex kernels will be studied. Much of the guidance you're looking for shall be in that blog. Expect a release early next year.

Before I answer your questions, let me take a few steps back with some general comments:

Yes, the very first step should be the use of profilers, we have a whole blog post discussing the different tools at our disposal and when/where to use each one. But I want to it clear that a "blind" approach was in fact used for this work. Recall that in the first part of the Laplacian blog series, there is a subsection where the users must ask themselves whether the code is memory bound or compute bound. This blog effort was written before omniperf became publicly available, so we were fortunate that this stencil algorithm is simple enough to estimate the arithmetic intensity by hand. In reality though, one should use omniperf and discern from the roofline plots whether your kernel is compute or memory bound. Knowing that the kernel is memory bound shed some on light on what aspects of the code/performance we should be looking at, we now know that the performance bottlenecks has to do with memory traffic, thus naturally one of the most important metrics to look at are those relating to memory traffic like FETCH_SIZE and WRITE_SIZE.
Generally, when optimizing algorithms, one should ask two questions 1) how well am I using the hardware and 2) how "fast" is my implementation. Tools like rocprof and omniperf are designed to help answer the first question. The second question is much harder to answer and relies on a thorough understanding of the algorithm. The roofline model doesn't show you the drawbacks of the chosen implementation - one can easily throw in DGEMM or STREAM operations to make the algorithm appear more compute or memory bound, respectively, but this doesn't mean the code is "fast". This is where the "effective memory bandwidth" (EffBW) metric comes into play - it is a unique metric we derived solely for this Laplacian blog series and is not something you can find in rocprof or omniperf. The 2nd order central finite difference method is simple enough to understand that every stencil point can be reused up to 6 times by neighboring threads. An algorithm requiring a particular grid point to be fetched by the L2 cache 7 different times by 7 different threads from global memory might look "efficient" from a roofline perspective if the accesses are coalesced (which in this case they are), but this would not by no means be a "fast" implementation. A "faster" implementation should be an algorithm where each grid point is fetched and written once, and the EffBW is essentially our way to discern whether we're efficiently fetching/writing the memory only once. When EffBW matches the reported bandwidth, that means we have achieved our goal from a global memory traffic perspective. Keep in mind that this doesn't necessarily mean we're truly hardware efficient as the performance can depend on a number of other factors like latencies, ROCm versions, system configuration, etc.
If you crunch the numbers, you should see the reported FETCH_SIZE and WRITE_SIZE metrics divided by average kernel time equals omniperf's reported L2 read+write BW. There might be some numerical discrepancies because of the use of Kilobytes (1e3) versus Kibibytes (1024) but the numbers should more or less align. In fact, the omniperf metrics are even more useful because it is able to distinguish between the reads and the writes whereas our blog series just computes the entire read-write bandwidth. But again, this post was written/started before the release of omniperf - a follow up blog post will rewalk readers through the entire analysis using omniperf. If our only goal was to make the algorithm hardware efficient, we would have ended the blog after kernel1 as the L2 read+write BW was already very close to 1.2 TB/s, which is very close to the achievable memory bandwidth. However, introducing the EffBW metric was our attempt to also answer the second question.

In short, an algorithmically inefficient implementation can potentially utilize the hardware efficiently. Likewise, an algorithmically efficient implement could poorly utilize the hardware - in HPC we ideally want both an algorithmically and hardware efficient implementation, but in practice one must find the balance between the two. All of the existing kernels utilize the hardware decently well, but the EffBW metric is our way of determining whether the implementation is efficient or not.

All of that said, here are my answers to your questions:

What other important hints could we extract from omniperf (or rocm) that can guide us in the optimisation without doing the hard work of instrumenting timers and calculating theoretical fetch sizes?

If your goal is to simply gauge how well an algorithm is utilizing the hardware and you've already discerned that your algorithm is memory bound, then the FETCH_SIZE and WRITE_SIZE metrics from rocprof divided by kernel duration should be sufficient. From omniperf, this would be the L2 read+write bandwidths. Gauging how "fast" your algorithmic implementation is will require you to dive deeper into the algorithm, there's no other way around this - the finite difference code was simple enough where we were able to derive a metric to help understand this aspect.

For example, how to identify that our L2-RW-realisticTarget-BW=1165BG/s from the omniperf report? (or rocprof). (I guess with minimal additional manual calculations if the number itself is not in the reports.)

The EffBW target of 1165 GB/s was computed by assuming that FETCH_SIZE will be brought down to the theoretical value and that both WRITE_SIZE and average kernel time remain the same. Alternatively, you can compute the 1165 GB/s from just (FETCH_SIZE + WRITE_SIZE) / average kernel time. This blog point deliberately avoids the question of whether 1165 GB/s is the maximum possible value because as mentioned earlier, this can depend on a number of other factors like latencies, ROCm versions, system configuration, etc not covered yet.

Omniperf should calculate the achievable memory bandwidth of your machine, this is needed for the roofline charts. Alternatively, you can compute these yourself with BabelStream

Also, can we get any hints from the roofline plots?

Again the follow on blog will cover this in greater depth but you may find this video helpful. Do keep in mind that this presentation was from SC22, the omniperf tool has made significant improvements ever since

Finally, for kernel1, why the numbers from omniperf (L2-RW-kernel1-BW) are not adding up to a number close to the kernel1_calculated_EffBW calculated in the code? But, indeed, for kernel3 and kernel5 they add up close to the kernel3 and kernel5calculated_EffBW? In other words, can we trust add-up numbers from the omniperf report to be equivalent to the calculated-EffBW?

EffBW is based on theoretical fetches and writes, the omniperf BW based on the actual fetches and writes. Think of it this way, the closer the EffBW is to the reported L2-RW-kernel-BW, the more "algorithmically efficient" the kernel becomes. Kernels 1 and 2 are obviously inefficient, whereas kernels 3 and above begin to close in on the actual BW. In the blog, we made a projection/target of 1165 GB/s, but alternatively you could simply use the (FETCH_SIZE + WRITE_SIZE)/time metric as your target.

Finally, finally, the add-ups L2-RW-kernel1-BW and L2-RW-kernel2-BW are almost identical (being both 67% of the hardwarePeak), but the individual numbers of the Read and Write parts are quite different. What could that mean to us when performing this kind of "blind" investigation that I'm trying to ask you to give us hints from?

From a hardware perspective, these alglorithms are equivalent as you can see from the omniperf numbers, the point of kernel 2 is to emphasize that loop tiling alone doesn't offer much benefit over the baseline implementation if you're not careful with memory access patterns. Kernel 2 is a deliberately inefficient kernel - details concerning the differences in reads and writes are far less important than the fact that the EffBW for kernel 2 is still below the actual BW.

Hope this helps,
Justin

TL;DR - Omniperf/rocprof only tells you how efficiently your algorithm utilizes the hardware. Understanding how "fast" your implementation is cannot be answered through those tools alone - this requires the user to analyze the algorithm. Fortunately this finite difference method is simple enough where we have derived the "effective memory bandwidth" metric to gauge how "efficient" our implementation is. Be on the lookout for a new omniperf blog.

1 reply

AlexisEspinosaGayosso Nov 24, 2023
Author

Thanks a lot, I'm looking forward for that blog. I'll mark this question as answered now, but please add links to the blog that you mentioned is released.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finite Difference Method: How to use omniperf to find the "optimisation hints" to follow? #12

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Finite Difference Method: How to use omniperf to find the "optimisation hints" to follow? #12

AlexisEspinosaGayosso Nov 8, 2023

Replies: 1 comment · 1 reply

jychang48 Nov 10, 2023 Maintainer

AlexisEspinosaGayosso Nov 24, 2023 Author

AlexisEspinosaGayosso
Nov 8, 2023

Replies: 1 comment 1 reply

jychang48
Nov 10, 2023
Maintainer

AlexisEspinosaGayosso Nov 24, 2023
Author