Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVIDIA][Backend] Add CoalesceAsyncCopy Pass for in-DotOpEnc Upcasting #5222

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

ggengnv
Copy link
Contributor

@ggengnv ggengnv commented Nov 21, 2024

This is a follow-up to the dotOp hoisting optimization for WGMMA (MMAv3). See #5003 (comment)

In short, when upcasting operand A in registers prior to WGMMA and when pipelining is enabled, AsyncCopyGLobalToLocal's src gmem blocked encoding will have sizePerThread > smem view's vec (along the contiguous dimension). This will resulting in multiple cp.async instructions being generated for a contiguous global data segment, resulting in uncoalesced loads. This was previously confirmed in ncu. See above comment for an example.

I've added a generalized fix in a new pass after the pipeliner. I've reused the logic in the LLVM lowering for AsyncCopyGlobalToLocal to calculate the max contiguous copy size. I compare that to the blockEnc's sizePerThread along the inner (contiguous) dimension. If the former is less than latter, I set the latter to former.

When A is k-major, can verify a small perf improvement and that ncu no longer reports uncoalesced loads.
When A is m-major, this pass is a no-op because copy size == sizePerThread == 16

ptal, thanks @ThomasRaoux

@ggengnv ggengnv changed the title Add CoalesceAsyncCopy Pass for in-DotOpEnc Upcasting [Nvidia][Backend] Add CoalesceAsyncCopy Pass for in-DotOpEnc Upcasting Nov 21, 2024
@ggengnv ggengnv changed the title [Nvidia][Backend] Add CoalesceAsyncCopy Pass for in-DotOpEnc Upcasting [NVIDIA][Backend] Add CoalesceAsyncCopy Pass for in-DotOpEnc Upcasting Nov 21, 2024
Copy link
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, few minor comments

Value mask = copyOp.getMask();
Value other = copyOp.getOther();
auto srcTy = cast<RankedTensorType>(src.getType());
auto blockEnc = cast<BlockedEncodingAttr>(srcTy.getEncoding());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't assume the copy will use blocked layout

Comment on lines 98 to 102
// replace the asyncCopy
auto newCopyOp = rewriter.create<AsyncCopyGlobalToLocalOp>(
copyOp.getLoc(), src, copyOp.getResult(), mask, other,
copyOp.getCache(), copyOp.getEvict(), copyOp.getIsVolatile());
rewriter.replaceOp(copyOp, newCopyOp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, you could do in place update

#include "mlir/Support/LLVM.h"
#include "mlir/Transforms/Passes.h"
#include "triton/Analysis/Utility.h"
#include "triton/Conversion/TritonGPUToLLVM/Utility.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is a bit of a layering violation, getRegToSharedLayout probably belongs to triton gpu dialect utils.

@ggengnv
Copy link
Contributor Author

ggengnv commented Nov 22, 2024

Addressed comments - moved util to lib/Dialect/Transforms/Utility.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants