Finite Difference Method: How to use omniperf to find the "optimisation hints" to follow? #12
-
The "Finite Difference Method" analysis is based on an apriori knowledge of the theoretical fetch & write sizes. Then, with some instrumentation of time measurements within the code, these serve for the calculation of the "effective_memory_bandwidth (EffBW)" within the laplacian.cpp code. This gives us a first hint: kernel1_EffBW=808GB/s while theoretical hardware peak is 1638GB/s, so we need to improve that. That works fine, but leave us "unarmed" when using omniperf (or rocprof) for analising a general code where the "theoretical fetch sizes" are unknown a priori. So, your approach suggests (in between the lines) that we should put a strong effort ourselves in estimating a priori the expected "effective_memory_bandwith" (by instrumenting our code with timers and using our paper+pencil+brain to get the theoretical fetch sizes). I have seen somewhere else (and is some F2F training) the use of the profilers as the very first option without the use of timers instrumentation and paper+pencil calculations. I guess this is because many times it might be difficult to perform all these calculations and we rather go directly to the profiler reports to initiate the investigation. So, could you answer here some of the hints that we may get directly from omniperf (or rocprof) reports without knowing the theoretical fetch sizes and calculated_ffBW a priori? (I prefer, if possible, a response using omniperf, as it is more visual than rocprof.) In other words, can the EffBW be extracted directly from omniperf? (or rocprof). I tried to use omniperf and, selecting just the 'laplacian_kernel' and normalizing 'per_second' I can see in section 17. L2 Cache shows for kernel1 a kernel1_L2-Read-BW=769GB/s and kernel1_L2-Write-BW=406GB/s. These numbers, unfortunately, do not add up to a number similar to the kernel1_calculated_EffBW=800 GB/s. (Adding them up I got L2-RW-kernel1-BW=1112GB/s or 67% of hardwarePeak.) Then, I analysed kernel3 with omniperf and got: kernel3_L2-Read-BW=561GB/s and kernel3_L2-Write-BW=552GB/s which add to L2-RW-kernel3-BW=1113GB/s or 67% of hardwarePeak. And, in this case, the numbers indeed add up close to the kernel3_calculated_EffBW=1100GB/s. And, for kernel5, I got: kernel5_L2-Read-BW=590GB/s and kernel5_L2-Write-BW=584GB/s which add to L2-RW-kernel5-BW=1174GB/s or 71% of hardwarePeak. And, in this case the numbers indeed add up close to the kernel5_calculated_EffBW=1166GB/s. So, in my efforts to try to "discover" hints for improvement from the omniperf report I was able to see the improvement trend in the numbers reported. Anyway, this was not really a "blind" first investigation, as I have read already the exercise and knew already that I needed to look for the L2 bandwitdh numbers in omniperf. (I guess that to always look into these numbers might help in a real "blind" first search from another code.) But again, the initial question still persists. I will reformulate the initial question into a list several questions here that I would love to see answered in a listed response too:
Please answer here first, but it would be fantastic if, afterwards, you could add an additional part of the "Finite Difference Method" series with guidance on how to use omniperf (or rocm) starting with a "blind" search for hints for optimising the code. Many thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
TL;DR version below First of all, thank you for these excellent questions. Regarding your omniperf questions, we are currently writing a blog post demonstrating how to extract useful performance data from omniperf. Case studies like this Laplacian post series plus other more complex kernels will be studied. Much of the guidance you're looking for shall be in that blog. Expect a release early next year. Before I answer your questions, let me take a few steps back with some general comments:
In short, an algorithmically inefficient implementation can potentially utilize the hardware efficiently. Likewise, an algorithmically efficient implement could poorly utilize the hardware - in HPC we ideally want both an algorithmically and hardware efficient implementation, but in practice one must find the balance between the two. All of the existing kernels utilize the hardware decently well, but the EffBW metric is our way of determining whether the implementation is efficient or not. All of that said, here are my answers to your questions:
If your goal is to simply gauge how well an algorithm is utilizing the hardware and you've already discerned that your algorithm is memory bound, then the
The EffBW target of 1165 GB/s was computed by assuming that Omniperf should calculate the achievable memory bandwidth of your machine, this is needed for the roofline charts. Alternatively, you can compute these yourself with BabelStream
Again the follow on blog will cover this in greater depth but you may find this video helpful. Do keep in mind that this presentation was from SC22, the omniperf tool has made significant improvements ever since
EffBW is based on theoretical fetches and writes, the omniperf BW based on the actual fetches and writes. Think of it this way, the closer the EffBW is to the reported L2-RW-kernel-BW, the more "algorithmically efficient" the kernel becomes. Kernels 1 and 2 are obviously inefficient, whereas kernels 3 and above begin to close in on the actual BW. In the blog, we made a projection/target of 1165 GB/s, but alternatively you could simply use the (
From a hardware perspective, these alglorithms are equivalent as you can see from the omniperf numbers, the point of kernel 2 is to emphasize that loop tiling alone doesn't offer much benefit over the baseline implementation if you're not careful with memory access patterns. Kernel 2 is a deliberately inefficient kernel - details concerning the differences in reads and writes are far less important than the fact that the EffBW for kernel 2 is still below the actual BW. Hope this helps, TL;DR - Omniperf/rocprof only tells you how efficiently your algorithm utilizes the hardware. Understanding how "fast" your implementation is cannot be answered through those tools alone - this requires the user to analyze the algorithm. Fortunately this finite difference method is simple enough where we have derived the "effective memory bandwidth" metric to gauge how "efficient" our implementation is. Be on the lookout for a new omniperf blog. |
Beta Was this translation helpful? Give feedback.
Hi @AlexisEspinosaGayosso ,
TL;DR version below
First of all, thank you for these excellent questions. Regarding your omniperf questions, we are currently writing a blog post demonstrating how to extract useful performance data from omniperf. Case studies like this Laplacian post series plus other more complex kernels will be studied. Much of the guidance you're looking for shall be in that blog. Expect a release early next year.
Before I answer your questions, let me take a few steps back with some general comments:
Yes, the very first step should be the use of profilers, we have a whole blog post discussing the different tools at our disposal and when/where to use each one. But I want…