Finite Difference P4: Optimal block, grid, indexing and splitting. Why you choose what you choose? #14

AlexisEspinosaGayosso · 2023-11-29T05:19:10Z

AlexisEspinosaGayosso
Nov 29, 2023

Finite Difference Method - Part4: Your lab note shows us chosen strategies with clear good results. Problem is that you are not fully explaining the reasoning to select those strategies. I mean, you clearly make a strong effort to explain things, but some important details are left out. Could you please expand your reasoning on how you selected those specific strategies? (I'll try to be more concrete below.)

Kernel 5 - 128x1x8 improves performance up to fetch efficiency of 79.8%. But you decided to drop that strategy there and not pursuit further investigation in that line. What is not clear to me is what information, knowledge or experience tells you to stop testing other combinations of "grid" and "block" vectors? What tells you that you will not get further improvements beyond that 79.8% fetch efficiency by following some other combination of this initial strategy? (I mean some other combinations of just "grid" and "block" vectors without shuffling of order.)

(By the way, it's clear that you are assuming that the reader has a perfect knowledge and control of the grid,block and indexing concepts and tweaks. I think that is not the case for many of your readers, specially not at the level of understanding your "new grid indexing strategy" at a first,second or even third read. This strategy is extremely "hacky" and confusing and requires lots of background knowledge and an expanded explanation. Anyways, I would like to encourage you to add a pre-note (or parallel-note) in this topic, dealing on the grid, block and indexing concepts, together with wavefronts. All these with a couple of the "top" exercises/examples you know that can clearly show effects of the chosen parameters and that can make us understand and master these technicalities and tweaks much better.)

With your Kerne6 strategy of shuffling the fetching order in the grid, so that blocks stride pos in y,z and x directions order instead of the natural x,y,z order. If I understood correctly, your new order would load blockDim.x * ny planes into the cache while moving along the z direction in the grid before changing to the next x block in the grid. And this would allow to keep in the L2 cache more of the z direction points needed to go through the finite difference operation for the whole matrix. But, what about shuffling so that the stride order is in z,y,x directions? or in z,x,y directions? or x,z,y? etc. What tells you in advance that y,z,x is the way to go? Again, this is important for us to know, so that we know what to do in other codes rather than experimenting shuffling in the many directions!
Your kernel7 strategy is easy to understand. But, again, the same question, what tells you in advance that the subdomain split needs to be done in the y direction? What about splitting in the x direction? Or in both, x+y directions? Or z? I tried myself the x+y split, thinking that it would outperform kernel7. I divided 2x2, so that each subdomainxy plane is 512x512 (just as the original size of the problem). I was expecting this to be a little bit faster than kernel7. But no, it was not faster, it was indeed slower. Divisions of the 1024 x 1024 x 1024 problem in the xy direction gave me the following results:
| DivX | DivY | FOM (GB/s) | Fetch efficiency (%) |
|-------|-------|--------------|-----------------------|
| 1 | 4 | 1172 | 100.6 |
| 2 | 2 | 1089 | 93.4 |
| 4 | 1 | 1063 | 91.2 |
| 1 | 1 | 553 | 47.4 |
All Fetch Efficiencies calculated with respect to 1165 GB/s norm.
(By the way, in your equivalent tables in the note you seem to be using different norm for each line for the fetch efficiency. First you seem to be using 1677 GB/s as norm, then 1170 GB/s, then 1121.6 GB/s and finally 1180 GB/s. Why is that?)

Question3 here is why splitting just in Y is better? Why you decided to split the problem only in the Y direction? What expert knowledge should we use to make a correct decision without the need of experimenting as in the table I'm writing above?

(4. Your kernel7 is faster for the 1024x1024x1024 problem, but not for the original 512x512x512 problem. When using kernel7 for the original size I get 986 GB/s. Clearly, in practice, we would need a kernel that can decide better when to divide and by how much. It would be great if you could publish an "intelligent" kernel (kernel8) that could decide if domain division is needed and then perform great on different sizes.)

Thank you very much, I'm enjoying a lot your lab notes!

Answered by jychang48

Dec 6, 2023

Also, regarding references, the ones our team created/presented, we have this relatively old one that's still relevant today:

https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf

Another one that might be helpful is this one:

https://enccs.github.io/amd-rocm-development/porting_hip/

Also, check out the ROCm developer hub, which also has links to several training videos:

https://www.amd.com/en/developer/resources/rocm-hub.html

Hope these help!

View full answer

jychang48 · 2023-11-29T07:03:04Z

jychang48
Nov 29, 2023
Maintainer

@AlexisEspinosaGayosso glad to hear you’re enjoying these AMD lab notes! As much as we all enjoy writing these blogs, it is largely considered a side gig and not our primarily job function. This final blog was a bit rushed and came out at a time where several of us were heavily resource constrained. I understand why you believe several details have been left out, so I’ll attempt to provide some clarification here.

Your first three questions basically come down to why we didn’t showcase or try other similar options. To be frank, as you may have guessed by now, we’ve only shown the combinations and parameters that provided the best performance numbers with the ROCm version we had at the time. But as I mentioned before in #13 , the purpose of these blogs is to cover various optimizations in hopes of pointing people in the right direction for their own applications and codes. There’s no “one size fits all” for these optimizations, it’s entirely possible different block/grid/indexing will give better results for your particular hardware or ROCm version. Our ROCm ecosystem is continuing to evolve and mature, so we are hesitant to place any emphasis on the exact configurations for the “best” performance numbers with any of our blogs at this time. Nevertheless, unlike the loop tiling strategy, these last three optimizations require little to no
code changes so I feel like it should not be too much of a burden to experiment with these block grid indexing parameters.

Regarding the block grid indexing concepts not being entirely known to the average reader, that is a fair point. We’ve had several presentations and hackathons that cover these in more detail, and there’s an ongoing effort to organize some of these training presentations into a single repository somewhere easily locatable. Keep in mind that the AMD lab notes was never intended to be the “one documentation to rule them all”. That said, a standalone lab notes covering this does not sound like a bad idea at all.

I’m not sure I follow your question about using different norm for each line for the fetch efficiency. The FOMs have nothing to do with the fetch efficiency calculations, the latter is strictly based on the rocprof FETCH_SIZE and WRITE_SIZE metrics not shown in the blog.

Kernel 7 was hard coded to split the y dimension into four blocks. Running the original size 512x512x512 on kernel 7 without modifying the code will give you the same performance as running kernel 5 on a 512x128x512 grid. So yes, the host function responsible for launching kernel 7 could be written in an “intelligent” way such that it dynamically determines bny based on the provided nx,ny sizes

Hope this helps!

3 replies

AlexisEspinosaGayosso Nov 30, 2023
Author

Thanks Justin,

Thanks a lot for your kind answers. They are indeed clarifying in the sense that you are presenting only one part of the experiments and that other optimisations may give similar or better results, although results presented here are very satisfying. I perfectly understand that notes need to be kept short. But the "internal core" of my questions is indeed: why you decided to follow these strategies in the first place? There is something more here that is not being told, and that is what I want to "squeeze" from you. That "hunch" that "expert knowledge" that told you "let's follow these strategies"!

And I indeed asked these questions because I experimented a bit myself and, after experimenting a bit, I found myself clueless on what else to try. Results from the experiments were worse, and I do not fully know why. That is the point. It's not practical to experiment blindly, and that is how I felt. That is the reason why I'm asking a bit on what made you choose those parameters to be presented? Or, asked in a better way, why the presented choices work better than others? So let me quickly ask again (together with my experimentation results):

I experimented kernel5 with many different combinations of block vector: (256,1,4),(128,1,4),(128,2,4),(64,1,16), etc. All of them gave lower FOM than the (128,1,8) you presented. Can you make an educated/expert guess on why?
I experimented from kernel6 but shuffling the grid order so that it goes: (z,y,x), (x,z,y), (z,x,y). All of them gave lower FOM than the (y,z,x) you presented. Can you make an educated/expert guess on why?
And, as said in my original question. I experimented with x,y divisions: (2,2), (4,1). All of them gave lower FOM than the (1,4) you presented. Can you make an educated/expert guess on why?

In regards to the "norm" used for calculating the fetch efficiency, check the table I'm providing. The code-reported FOM I get from no division is 553 GB/s (very similar to your number with no division), but the Fetch efficiency I'm reporting is 47.4% while you report 33.1%. To get the fetch efficiency I'm dividing by the "norm" (the desired value) of 1165 GB/s. I thought that was the "norm" value to use for calculating the fetch efficiency. But clearly you are using other denominator in the calculation. Why?, is that an error in your tables? Funny thing is that your denominator seems to be different each time: 1677 GB/s, 1170 GB/s, 1121.6 GB/s and 1180 GB/s. Only those denominator values would give you the reported fetch efficiencies in your tables.

In regards to the pre-note (or parallel-note) on block,grid,shuffling,indexing,wavefront concepts. Can you point me out to a couple of good resources that exist already (as you mentioned)? I'm landing too much on NVIDIA material and I prefer to read some AMD good resources about these.

Again, thank you very much for your help and interest. Labs really "ROCK"m!

jychang48 Dec 6, 2023
Maintainer

@AlexisEspinosaGayosso apologies for the slight delay in response. Our team's day to day function is to extensively debug ROCm issues, port applications onto AMD GPUs, and to test out all possible optimization tricks for the chosen hardware and algorithm. We try every optimization, sometimes blindly due to the status of our tools at the time, with every ROCm version across a broad spectrum of applications and algorithms. We do not expect (or want) the scientific community to blindly hunt the same way we had to, hence these blogs. The AMD lab notes is not a place where we showcase all 100+ different optimizations we tried, even if they do not work. Only the optimizations and "best practices" that work for the chosen topics are shown. We hope to eventually have a "critical mass" of all the possible optimizations so that one need simply browse through our repo for pointers.

Regarding why the strategies were chosen in the first place for this part 4, we already established that 1) fetch efficiency (e.g., theoretical fetch over actual FETCH_SIZE) was the target and that 2) the size of the xy plane with the given block grid indexing was the reason why planes larger than 512x512 lowered the fetch efficiency. Therefore, any optimization that increases the kernel 5 baseline FOM of 555.389 GB/s is considered a success.

I experimented kernel5 with many different combinations of block vector: (256,1,4),(128,1,4),(128,2,4),(64,1,16), etc. All of them gave lower FOM than the (128,1,8) you presented. Can you make an educated/expert guess on why?

Because fitting an entire plane of the xy required extra fetches (e.g., lowered fetch efficiency), the only natural choices here involved some significant increase in the z direction e.g., block.z so that smaller bite-sized chunks of xyz blocks can fit into L2 before spilling. All of the combinations you showed above provided significant improvements over the original 555 GB/s. The128x1x8 combination yielding the biggest improvement likely comes from striking the balance between a large block.x and a large block.z. You'll also find (64,2,8) to give a high FOM though not quite as high as (128,1,8)

I experimented from kernel6 but shuffling the grid order so that it goes: (z,y,x), (x,z,y), (z,x,y). All of them gave lower FOM than the (y,z,x) you presented. Can you make an educated/expert guess on why?

Remember that the goal is to modify the indexing so that bite-sized chunks of 3D blocks can fit into the L2 cache. Assuming thread block size of 256x1x1, let's reimagine the block grid indexing from kernels 1-5 in this series of for loops:

int nx = 1024, ny = 1024, nz = 1024;
int block = 256;
int line = nx;
int slice = nx * ny;
for (int z = 0; z < nz; z++) // blockIdx.z
  for (int y = 0; y < ny; y++) // blockIdx.y
    for (int x = 0; x < nx / block; x++) // blockIdx.x
      for (int i = 0; i < block; i++) // threadIdx.x
      {
        size_t pos = x * block+ i + line * y + slice * z;
        out[pos] = in[pos - slice] 
                 + in[pos - line] 
                 + in[pos - 1] 
                 + in[pos]
                 + in[pos + 1]
                 + in[pos + line] 
                 + in[pos + slice]; 
      }

This corresponds to (x,y,z). The problem with this setup is that at int z = 0, the inner three loops completely fill up the L2 cache before loading at least two more xy-planes (recall each thread needs data points from at least 3 xy planes). We would need to rearrange the first three loops somehow. Let's go through the three combinations you listed above:

(z,x,y), this would be rewritten as:

...
for (int y = 0; y < ny; y++) // blockIdx.z
  for (int x = 0; y < nx / block; x++) // blockIdx.y
    for (int z = 0; z < nz; z++) // blockIdx.x
...

Same problem when int y = 0, the nx * nz plane is still too large for the L2 cache

(x,z,y), this would be rewritten as:

...
for (int y = 0; y < ny; y++) // blockIdx.z
  for (int z = 0; z < nz; z++) // blockIdx.y
    for (int x = 0; x < nx / block; x++) // blockIdx.x
...

Same problem when int y = 0, the nx * nz plane is still too large for the L2 cache

(z,y,x), this would be rewritten as:

...
for (int x = 0; y < nx / block; x++) // blockIdx.z
  for (int y = 0; y < ny; y++) // blockIdx.y
    for (int z = 0; z < nz; z++) // blockIdx.x
...

This one is interesting, while int x = 0 could process multiple 3D chunks of data, we have severe violations of the read access pattern: as you jump between the various for loops, you're accessing data in a non ascending way and always reading out of order when iterating between the for loops.

Leaving us the final one: (y,z,x), this would be rewritten as:

...
for (int x = 0; y < nx / block; x++) // blockIdx.z
  for (int z = 0; z < nz; z++) // blockIdx.y
    for (int y = 0; y < ny; y++) // blockIdx.x
...

Yes, we still go "backwards" in memory as we iterate through the for (int x = 0; y < nx / block; x++) loop but at this point, the L2 cache is already filled and the threads already have access to their respective data points needed to process the stencil. Of course this isn't perfect hence why the fetch efficiency is still not 100%

And, as said in my original question. I experimented with x,y divisions: (2,2), (4,1). All of them gave lower FOM than the (1,4) you presented. Can you make an educated/expert guess on why?

Again, so long as there's any improvement over the initial 555 GB/s then it is a success. IIRC those other two x,y divisions should yield some improvement. As for why (1,4) was the best, I did not have the time to do the profiling, but this could be an interesting omniperf case study.

In regards to the "norm" used for calculating the fetch efficiency, check the table I'm providing. The code-reported FOM I get from no division is 553 GB/s (very similar to your number with no division), but the Fetch efficiency I'm reporting is 47.4% while you report 33.1%. To get the fetch efficiency I'm dividing by the "norm" (the desired value) of 1165 GB/s. I thought that was the "norm" value to use for calculating the fetch efficiency. But clearly you are using other denominator in the calculation. Why?, is that an error in your tables? Funny thing is that your denominator seems to be different each time: 1677 GB/s, 1170 GB/s, 1121.6 GB/s and 1180 GB/s. Only those denominator values would give you the reported fetch efficiencies in your tables.

The fetch efficiency metric used throughout these blog series is not what you think it is. Your interpretation of Fetch efficiency appears to be:

Fetch efficiency = reported FOM / 1165 * 100

Whereas in reality it is simply written as:

Fetch efficiency = theoretical_fetch_size / FETCH_SIZE * 100

where FETCH_SIZE is obtained from rocprof (not shown in this part 4) and theoretical_fetch_size = ((nx * ny * nz - 8 - 4 * (nx - 2) - 4 * (ny - 2) - 4 * (nz - 2) ) * sizeof(double); . It has nothing to do with the reported FOMs. If you want to compare what you have with what I have, you need to rerun your experiments with the provided rocprof_input.txt file e.g.,

rocprof -i rocprof_input.txt -o rocprof_output.csv laplacian_dp_kernel5

Hope these help, and again we truly appreciate the interest and questions you have regarding our blogs! We have some more interesting blogs, including omniperf case studies, in the pipeline.

jychang48 Dec 6, 2023
Maintainer

Also, regarding references, the ones our team created/presented, we have this relatively old one that's still relevant today:

https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf

Another one that might be helpful is this one:

https://enccs.github.io/amd-rocm-development/porting_hip/

Also, check out the ROCm developer hub, which also has links to several training videos:

https://www.amd.com/en/developer/resources/rocm-hub.html

Hope these help!

Answer selected by AlexisEspinosaGayosso

AlexisEspinosaGayosso · 2023-12-07T02:49:26Z

AlexisEspinosaGayosso
Dec 7, 2023
Author

Thanks a lot for this Justin,
Amazing! This really clarifies a lot of my doubts!
Will carfully check the provided links.
Regards,
Alexis

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finite Difference P4: Optimal block, grid, indexing and splitting. Why you choose what you choose? #14

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Finite Difference P4: Optimal block, grid, indexing and splitting. Why you choose what you choose? #14

AlexisEspinosaGayosso Nov 29, 2023

Replies: 2 comments · 3 replies

jychang48 Nov 29, 2023 Maintainer

AlexisEspinosaGayosso Nov 30, 2023 Author

jychang48 Dec 6, 2023 Maintainer

jychang48 Dec 6, 2023 Maintainer

AlexisEspinosaGayosso Dec 7, 2023 Author

AlexisEspinosaGayosso
Nov 29, 2023

Replies: 2 comments 3 replies

jychang48
Nov 29, 2023
Maintainer

AlexisEspinosaGayosso Nov 30, 2023
Author

jychang48 Dec 6, 2023
Maintainer

jychang48 Dec 6, 2023
Maintainer

AlexisEspinosaGayosso
Dec 7, 2023
Author