[AMD] inThreadTranspose: Transpose between global load and local store for non-TN layouts: part 2 of 4 #5223
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
inThreadTranpose: part 2 of 4
Introduction
This PR introduces the AMD-specific inThreadTranspose feature to improve shared memory access efficiency for non-TN GEMM and 2nd dotOp in Flash Attention.
The entire feature has been broken into 4 pieces for more reliable integration and this PR is the 2nd of 4.
Feature description
Currently on AMD hardware, if the dot Operand is K-major, we'd use the same vectorization for
ds_write
asglobal_load
, but won't coalesce onds_read
, resulting in poor shared memory/LDS read efficiency prior to MFMA operation.This feature, inThreadTranspose, groups multiple
global_load
together and packs vector across grain to write to LDS with vectorization, so that when the matrix is written into LDS, it's already consecutive on K dimension, and therefore vectorizedds_read
is also enabled. This is achieved byv_perm_b32
assembly instruction in AMDGCN, allowing independent register to be contiguous in VGPR space, so that we can write them together into LDS.PR description
Continuous from the previous PR, this one updates the SharedEncodingAttr on AMD hardware such that SharedEncodingAttr will always guarantee coalesces LDS read, and let the inThreadTranspose to explore if LDS write can be coalesced on non-KContig tensor.
Beyond the TTGIR level update, lowerToLLVM has to be updated in order to properly align the shared memory address to the vector data about to be written.
storeDistributedToShared
is update to activate special linear layout conversion for blocked layout when it sees blocked -> shared transfer with different order.In summary, there're three changes:
blockedToLinearLayoutThreadRake
when transferring blocked to shared for non-KContig.