Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for executable duplication in encoding specialization pass. #19527

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Dec 19, 2024

Duplicates stream.executables based on the affinity analysis of stream.async.dispatch ops. Some executables can be launched by different devices. It can produce wrong codegen artifacts when bindings types are encoded (i.e., the tensor type has an encoding attribute). Because they can result in different layouts, especially when multi-device is involved. E.g., say that device_a and device_b interpret a tensor type with encodings in different layouts, and there is an executable that can be launch with resources from either device_a or device_b. It is confusing what the input layouts for the executable because there are two possibilities. In this case, we have to duplicate the executable with updated encoding, and modify the dispatch to launch proper executable based on device analysis.

@hanhanW hanhanW requested review from lialan and Max191 December 19, 2024 09:59
@hanhanW hanhanW requested a review from benvanik as a code owner December 19, 2024 09:59
@hanhanW
Copy link
Contributor Author

hanhanW commented Dec 19, 2024

The PR depends on #19502. It is ready for review, which compares the diff to the parent mirror branch.

@hanhanW hanhanW force-pushed the specialize-encodings-2-n branch 2 times, most recently from 6e605a4 to 9e70643 Compare December 19, 2024 10:04
@hanhanW hanhanW marked this pull request as draft January 8, 2025 04:53
@hanhanW hanhanW force-pushed the users/hanhanW/specialize-encodings-1-n branch from cb19fce to 86dd9b0 Compare January 8, 2025 04:54
@hanhanW hanhanW force-pushed the specialize-encodings-2-n branch 2 times, most recently from 931032c to 4ca9b91 Compare January 8, 2025 04:55
@hanhanW hanhanW changed the base branch from users/hanhanW/specialize-encodings-1-n to main January 9, 2025 08:42
@hanhanW hanhanW force-pushed the specialize-encodings-2-n branch from 4ca9b91 to a150f11 Compare January 9, 2025 08:43
@hanhanW hanhanW marked this pull request as ready for review January 9, 2025 09:22
@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 9, 2025

I spent some time on the bazel failure, but I can't reproduce it on cmake build even with ASAN enabled. I'll try to set up the bazel one my env, the PR is ready to review. Please take a look, thanks!

Duplicates stream.executables based on the affinity analysis of
stream.async.dispatch ops. Some executables can be launched by different
devices. It can produce wrong codegen artifacts when bindings types are
encoded (i.e., the tensor type has an encoding attribute). Because they
can result in different layouts, especially when multi-device is
involved. E.g., say that device_a and device_b interpret a tensor type
with encodings in different layouts, and there is an executable that can
be launch with resources from either device_a or device_b. It is
confusing what the input layouts for the executable because there are
two possibilities. In this case, we have to duplicate the executable
with updated encoding, and modify the dispatch to launch proper
executable based on device analysis.

Signed-off-by: hanhanW <[email protected]>
@hanhanW hanhanW force-pushed the specialize-encodings-2-n branch from 989dc8c to 8ba01c3 Compare January 13, 2025 18:12
@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 13, 2025

I spent some time on the bazel failure, but I can't reproduce it on cmake build even with ASAN enabled. I'll try to set up the bazel one my env, the PR is ready to review. Please take a look, thanks!

I think I'd be able to reproduce it if I build it in release mode. I put a statement in an assertion because I wanted to check if the return value is true or not. However, the statement is dropped in release build..

Copy link
Collaborator

@benvanik benvanik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to duplicate based on affinity - in your test we can see why: we don't want to double compile executables that are going to have the same exact compilation flow as we're just doubling our compile time (or on an 8 way config x8!).

IREE::Stream::AffinityAttr has a areCompatible / canExecuteTogether which indicates when two operations can execute in the same place (which is not what you want - that's for execution scheduling). Here you need something like areTranslationCompatible which indicates when two affinities share the same execution configuration and only duplicate when that's not the case (you want the smallest set of unique executable targets). We could probably rename areCompatible to areExecutionCompatible to differentiate (at some point we'll add an areAllocationCompatible for unified memory systems and stuff where execution and translation may differ but buffers can be shared).

Note to make the compatibility checks scale better you can gather all (like you do), use llvm::EquivalenceClasses + areTranslationCompatible to build the sets of compatible devices, then run down your executables and duplicate for each non-intersecting set (this avoids doing areTranslationCompatible n^2).

You'll want to have a test like you have with two devices that share targets ensuring they don't get duplicated and two devices with different targets ensuring they do.

MLIRContext *ctx = moduleOp.getContext();
IRRewriter rewriter(ctx);

// 1. Gather per-export [execution affinity -> [resource affinities]] map.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: prefer not using numbers - they always end up wrong when code is moved around or when code is inserted/removed. You can call out preconditions/postconditions if blocks of code do have to be performed in order (e.g. "build cache of valid ops after removing dead ops" etc)

if the algorithm has a flow then that can be documented in the function comment where someone landing on the function wants to know how it works. By keeping them inline a reader has to read the whole function (like I just did, wondering where 2 was for a page :)

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 17, 2025

// TODO(benvanik): replace with more fine-grained compatibility checks.
// "Compatible" can mean a lot of things: are they cache-coherent, are they
// a shared address space, are they able to perform collective operations,
// etc. May be able to represent it with a bitfield or a dedicated
// compatibility struct result.
// Returns true if |desiredAffinity| (if any) is compatible with
// |requiredAffinity|.
static bool areCompatible(AffinityAttr desiredAffinity,
AffinityAttr requiredAffinity);

I found the above comment in areCompatible documentation, and I have a question. It seems like it'd return true if one of desiredAffinity is compatible with requiredAffinity. Does it mean that [device_a, device_b] is compatible with [device_a]? Does it mean that the compatibility could be unidirectional? I.e., X is compatible with Y, but Y is not compatible with X.

In the areTranslationCompatible, should we make it bidirectional?

Note to make the compatibility checks scale better you can gather all (like you do), use llvm::EquivalenceClasses + areTranslationCompatible to build the sets of compatible devices, then run down your executables and duplicate for each non-intersecting set (this avoids doing areTranslationCompatible n^2).

IIUC, the llvm::EquivalenceClasses is a disjoint-set. Say that we have areTranslationCompatible implemented. The algorithm would be:

  1. Collect all the affinities.
  2. Categorize the affinities into few groups, using llvm::EquivalenceClasses.
  3. Each group can map to a unique hash/affinity/whatever. We then use the unique affinity in my current implementation, which can reduce the number of duplications.

Do I understand it correctly?

Meanwhile, I'm going to study the compatibility things in Stream dialect. It's a new thing to me!

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 17, 2025

I just found that the current implementation is simple. It does not consider the case that I listed above. I think I can start with a similar version.

bool AffinityAttr::areCompatible(AffinityAttr desiredAffinity,
AffinityAttr requiredAffinity) {
if (desiredAffinity == requiredAffinity)
return true;
if ((desiredAffinity && !requiredAffinity) ||
(requiredAffinity && !desiredAffinity)) {
return true;
}
// We could do a fuzzier match here (interface isCompatible() etc).
return false;

@benvanik
Copy link
Collaborator

the executable targets are what we care about matching, not the affinities - we don't want to cluster by affinities, but by those sets of targets, and then generate based on required sets of targets. CUDA GPU 0 and CUDA GPU 1 (two different affinities) have the same executable targets and we don't want to duplicate, CUDA GPU 0, CUDA GPU 1, and CPU (three different affinities) have CUDA GPU 0 and CUDA GPU 1 sharing the same targets but the CPU is different and we'd want one copy for the CUDA GPUs and one copy for the CPU (because we're talking about targets, not affinities).

@benvanik
Copy link
Collaborator

(that may be what you're saying, I just realized I didn't word what I said very well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants