You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I'm interested in optimizing a matmul operations where I know the dims (e.g. 1024x512 @ 512x2048) and the GPU involved (A100).
I know I could just empirically test a lot of different options out, but I'd like to get a better of understanding of how CUDA and triton interact so that in other cases I have a better idea of what to do (I usually work in a context where some dims are variable).
Since I know the GPU, I know the number of SMs available and the size of the caches etc. might it be possible to get a guide on what values one should pick dependent on that?
Based on this I think I should be making sure that the BLOCK_SIZE_M and BLOCK_SIZE_N (to use the vars from the tutorial) result in tiles that optimize the use of the SMs. Is that going in the right direction?
I can find a lot of info about optimizing cuda kernels, but it's difficult to understand how that translates to writing triton code. Any help with that would be much appreciated!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
I'm interested in optimizing a matmul operations where I know the dims (e.g. 1024x512 @ 512x2048) and the GPU involved (A100).
I know I could just empirically test a lot of different options out, but I'd like to get a better of understanding of how CUDA and triton interact so that in other cases I have a better idea of what to do (I usually work in a context where some dims are variable).
Since I know the GPU, I know the number of SMs available and the size of the caches etc. might it be possible to get a guide on what values one should pick dependent on that?
Based on this I think I should be making sure that the
BLOCK_SIZE_M
andBLOCK_SIZE_N
(to use the vars from the tutorial) result in tiles that optimize the use of the SMs. Is that going in the right direction?I can find a lot of info about optimizing cuda kernels, but it's difficult to understand how that translates to writing triton code. Any help with that would be much appreciated!
Beta Was this translation helpful? Give feedback.
All reactions