Finite Difference P2: About your loop tiling strategy. Can you explain a bit more of your reasoning? #13
-
In your note Finite Difference Method-part2, you decided to loop tile in the y direction "for demonstration". But afterwards, you never go out of the demonstration and into a clear explanation of why you kept tiling in the y-direction, and in the y-direction only. Could you please provide an extended explanation on why the y-direction is the best candidate for the loop tiling? And, if possible, present results of your experimentation on tiling in other directions. This to allow us make this practical decision in other codes without having to test tiling in every direction. (This following question is coming out of my fingers without too much thinking:) Would it make sense to tile in more than one direction at the same time? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
@AlexisEspinosaGayosso recall that the purpose of loop tiling is to keep reusable stencil values available in register and not rely on cache. Therefore, it doesn’t make much sense to apply the loop tiling strategy to the x direction because the x direction elements are contiguous in memory so the neighboring stencil points are naturally found in cache. This leaves the y and z directions as the only logical choices, and there’s actually no wrong choice here. The purpose of this blog (and all our AMD lab notes articles) is to touch on various possible optimizations one and to provide accompanying code examples that readers are encouraged to experiment with. Providing implementations in both directions would make the blog unnecessarily longer hence only the y direction was showcased - it is our hope that readers are motivated enough to try implementing it in the z direction on their own. Tiling in both y and z direction’s simultaneously is not recommended because it’ll only proliferate the register usage especially higher order finite difference stencils. Tiling in just one direction already creates huge pressure. In fact, the loop tiling strategy might be limited to low order finite difference stencils. One of our upcoming blog series focuses on seismic stencils (8th order finite difference for acoustic wave equation). Computing this stencil means each wavefront needs access to 9 xy-planes - this will clearly not fit in the L2 cache of an MI200 accelerator so completely different optimization strategies are needed particularly for the z direction. |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for your response. It makes a lot of sense. I'm really looking forward to all those new labs you have mentioned. So, I will try loop tiling in Z, but I have the "hunch" that it will perform worse than the tiling in Y. Just because you always present the best result in your notes. And because tiling in Y is already giving kind of optimal results. (But that is me finding cues in other parts that have nothing to do with the code). Although I indeed have a code cue: Z values are separated further than Y values, so bringing them into the registers will occupy more space than bringing the Y values. Therefore I guess in advance that Z tiling will not give better results than Y tiling. What do you think? |
Beta Was this translation helpful? Give feedback.
-
I agree with your hunch. We tried loop tiling in Z direction as well before this blog was released. I recall the performance was similar, maybe slightly worse on ROCm-5.2.3, though it could be very different with today's ROCm-5.7.0+. However, it'll still face the same problem when you look at a grid larger than 512x512x512. So we left it at the y-direction because we wanted to try some other optimizations for the more problematic z-direction. |
Beta Was this translation helpful? Give feedback.
I agree with your hunch. We tried loop tiling in Z direction as well before this blog was released. I recall the performance was similar, maybe slightly worse on ROCm-5.2.3, though it could be very different with today's ROCm-5.7.0+. However, it'll still face the same problem when you look at a grid larger than 512x512x512. So we left it at the y-direction because we wanted to try some other optimizations for the more problematic z-direction.