-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Stream] Add layouts to encodings for all stream tensor AffinityOp. #19726
base: main
Are you sure you want to change the base?
Conversation
The revision adds the support for the rest of AffinityOp that have TensorPhase trait, i.e., TensorCloneOp, TensorSliceOp, TensorFillOp, and TensorUpdateOp ops. There are two stream tensor ops do not implement the AffinityOpInterface, so they are not supported within the revision. They are stream.tensor.load op and stream.tensor.store op. We should be able to track the resource affinity for these two ops, and it requires additional analysis. Thus, they are not scoped within the revision. The revision also adds the missing documentation to the `addLayoutsToTensorPhaseOps` method. Signed-off-by: hanhanW <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm transfer ops are ones where we can't really have encoding changes or possibly even encodings at all - we could at the cost of never hitting the DMA path with them, and that's probably bad. We may have to brainstorm on that -- today I'd say we need to add a trait (SameOperandsAndResultEncodings
) to these ops and verify that they can be implemented with memcpy but I'm not sure even that is enough. Basically, if any transfer op besides clone (which is "entire tensor") has an encoding we have to generate a dispatch (codegen it, bundle it with binary, and ship it) for that specific set of operand and result encodings and those will run on execution units instead of DMA ones (as we can't cuMemcpy a slice of an encoded tensor if that has funny padding, or can't have cuMemcpy swizzle/change encodings for us). If you were to run any code doing this today with this change it will be incorrect (unless it's an entire tensor and the same operand and result encoding).
Such conversions to dispatches happen in MaterializeBuiltinsPass - e.g., if you try to fill with an i64 value we have to use a dispatch as GPUs don't support i64 fills. That pass would need to look at any of the ops, decide if they can be implemented with memcpy/memset, and if not emit new executables with the custom load/store with the encodings.
We may have to look at some IR and see if we can handle that better earlier on - after MaterializeBuiltinsPass we have to have everything be either supportable by devices or turned into dispatches, but that's pretty late in the flow and we want to try to guarantee that we rarely (if ever) end up down those paths (we're properly fusing with producers/consumers).
IOW I'm worried this may not be implementable - we need to iterate a bit.
return success(); | ||
} | ||
|
||
/// Updates the update_encoding for `op`. The op have to define a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Updates the update_encoding for `op`. The op have to define a | |
/// Updates the update_encoding for `op`. The op has to define a |
@@ -141,6 +159,52 @@ updateResultEncoding(RewriterBase &rewriter, OpTy op, | |||
return success(); | |||
} | |||
|
|||
/// Updates the target_encoding for `op`. The op have to define a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Updates the target_encoding for `op`. The op have to define a | |
/// Updates the target_encoding for `op`. The op has to define a |
/// tensor type, the method resolves the layouts, strips outdated information, | ||
/// and adds the resolved layouts to the encodings. The updated encodings should | ||
/// have enough information for other lowering transformations. | ||
/// TODO(hanchung): Add support for stream.tensor.load ops and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's some TBD work to make load/store better - it'll likely require the same behavior as with the current implementation and is something we'll want to do earlier on in flow as otherwise we'll need a stream builtin that can adjust to different data types and perform the conversion (for multiple elements) or something like a switch that translates the load/store indices (for single element) on the host.
The revision adds the support for the rest of AffinityOp that have TensorPhase trait, i.e., TensorCloneOp, TensorSliceOp, TensorFillOp, and TensorUpdateOp ops. There are two stream tensor ops do not implement the AffinityOpInterface, so they are not supported within the revision. They are stream.tensor.load op and stream.tensor.store op. We should be able to track the resource affinity for these two ops, and it requires additional analysis. Thus, they are not scoped within the revision.
The revision also adds the missing documentation to the
addLayoutsToTensorPhaseOps
method.