Finite Difference P4: Optimal block, grid, indexing and splitting. Why you choose what you choose? #14
-
Finite Difference Method - Part4: Your lab note shows us chosen strategies with clear good results. Problem is that you are not fully explaining the reasoning to select those strategies. I mean, you clearly make a strong effort to explain things, but some important details are left out. Could you please expand your reasoning on how you selected those specific strategies? (I'll try to be more concrete below.)
(By the way, it's clear that you are assuming that the reader has a perfect knowledge and control of the grid,block and indexing concepts and tweaks. I think that is not the case for many of your readers, specially not at the level of understanding your "new grid indexing strategy" at a first,second or even third read. This strategy is extremely "hacky" and confusing and requires lots of background knowledge and an expanded explanation. Anyways, I would like to encourage you to add a pre-note (or parallel-note) in this topic, dealing on the grid, block and indexing concepts, together with wavefronts. All these with a couple of the "top" exercises/examples you know that can clearly show effects of the chosen parameters and that can make us understand and master these technicalities and tweaks much better.)
Question3 here is why splitting just in Y is better? Why you decided to split the problem only in the Y direction? What expert knowledge should we use to make a correct decision without the need of experimenting as in the table I'm writing above? (4. Your kernel7 is faster for the 1024x1024x1024 problem, but not for the original 512x512x512 problem. When using kernel7 for the original size I get 986 GB/s. Clearly, in practice, we would need a kernel that can decide better when to divide and by how much. It would be great if you could publish an "intelligent" kernel (kernel8) that could decide if domain division is needed and then perform great on different sizes.) Thank you very much, I'm enjoying a lot your lab notes! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
@AlexisEspinosaGayosso glad to hear you’re enjoying these AMD lab notes! As much as we all enjoy writing these blogs, it is largely considered a side gig and not our primarily job function. This final blog was a bit rushed and came out at a time where several of us were heavily resource constrained. I understand why you believe several details have been left out, so I’ll attempt to provide some clarification here. Your first three questions basically come down to why we didn’t showcase or try other similar options. To be frank, as you may have guessed by now, we’ve only shown the combinations and parameters that provided the best performance numbers with the ROCm version we had at the time. But as I mentioned before in #13 , the purpose of these blogs is to cover various optimizations in hopes of pointing people in the right direction for their own applications and codes. There’s no “one size fits all” for these optimizations, it’s entirely possible different block/grid/indexing will give better results for your particular hardware or ROCm version. Our ROCm ecosystem is continuing to evolve and mature, so we are hesitant to place any emphasis on the exact configurations for the “best” performance numbers with any of our blogs at this time. Nevertheless, unlike the loop tiling strategy, these last three optimizations require little to no Regarding the block grid indexing concepts not being entirely known to the average reader, that is a fair point. We’ve had several presentations and hackathons that cover these in more detail, and there’s an ongoing effort to organize some of these training presentations into a single repository somewhere easily locatable. Keep in mind that the AMD lab notes was never intended to be the “one documentation to rule them all”. That said, a standalone lab notes covering this does not sound like a bad idea at all. I’m not sure I follow your question about using different norm for each line for the fetch efficiency. The FOMs have nothing to do with the fetch efficiency calculations, the latter is strictly based on the rocprof FETCH_SIZE and WRITE_SIZE metrics not shown in the blog. Kernel 7 was hard coded to split the y dimension into four blocks. Running the original size 512x512x512 on kernel 7 without modifying the code will give you the same performance as running kernel 5 on a 512x128x512 grid. So yes, the host function responsible for launching kernel 7 could be written in an “intelligent” way such that it dynamically determines bny based on the provided nx,ny sizes Hope this helps! |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for this Justin, |
Beta Was this translation helpful? Give feedback.
Also, regarding references, the ones our team created/presented, we have this relatively old one that's still relevant today:
https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf
Another one that might be helpful is this one:
https://enccs.github.io/amd-rocm-development/porting_hip/
Also, check out the ROCm developer hub, which also has links to several training videos:
https://www.amd.com/en/developer/resources/rocm-hub.html
Hope these help!