Add support for executable duplication in encoding specialization pass. #19527

hanhanW · 2024-12-19T09:59:33Z

Duplicates stream.executables based on the affinity analysis of stream.async.dispatch ops. Some executables can be launched by different devices. It can produce wrong codegen artifacts when bindings types are encoded (i.e., the tensor type has an encoding attribute). Because they can result in different layouts, especially when multi-device is involved. E.g., say that device_a and device_b interpret a tensor type with encodings in different layouts, and there is an executable that can be launch with resources from either device_a or device_b. It is confusing what the input layouts for the executable because there are two possibilities. In this case, we have to duplicate the executable with updated encoding, and modify the dispatch to launch proper executable based on device analysis.

hanhanW · 2024-12-19T10:00:43Z

The PR depends on #19502. It is ready for review, which compares the diff to the parent mirror branch.

hanhanW · 2025-01-09T09:22:18Z

I spent some time on the bazel failure, but I can't reproduce it on cmake build even with ASAN enabled. I'll try to set up the bazel one my env, the PR is ready to review. Please take a look, thanks!

Duplicates stream.executables based on the affinity analysis of stream.async.dispatch ops. Some executables can be launched by different devices. It can produce wrong codegen artifacts when bindings types are encoded (i.e., the tensor type has an encoding attribute). Because they can result in different layouts, especially when multi-device is involved. E.g., say that device_a and device_b interpret a tensor type with encodings in different layouts, and there is an executable that can be launch with resources from either device_a or device_b. It is confusing what the input layouts for the executable because there are two possibilities. In this case, we have to duplicate the executable with updated encoding, and modify the dispatch to launch proper executable based on device analysis. Signed-off-by: hanhanW <[email protected]>

hanhanW · 2025-01-13T18:15:08Z

I spent some time on the bazel failure, but I can't reproduce it on cmake build even with ASAN enabled. I'll try to set up the bazel one my env, the PR is ready to review. Please take a look, thanks!

I think I'd be able to reproduce it if I build it in release mode. I put a statement in an assertion because I wanted to check if the return value is true or not. However, the statement is dropped in release build..

Signed-off-by: hanhanW <[email protected]>

benvanik

We don't want to duplicate based on affinity - in your test we can see why: we don't want to double compile executables that are going to have the same exact compilation flow as we're just doubling our compile time (or on an 8 way config x8!).

IREE::Stream::AffinityAttr has a areCompatible / canExecuteTogether which indicates when two operations can execute in the same place (which is not what you want - that's for execution scheduling). Here you need something like areTranslationCompatible which indicates when two affinities share the same execution configuration and only duplicate when that's not the case (you want the smallest set of unique executable targets). We could probably rename areCompatible to areExecutionCompatible to differentiate (at some point we'll add an areAllocationCompatible for unified memory systems and stuff where execution and translation may differ but buffers can be shared).

Note to make the compatibility checks scale better you can gather all (like you do), use llvm::EquivalenceClasses + areTranslationCompatible to build the sets of compatible devices, then run down your executables and duplicate for each non-intersecting set (this avoids doing areTranslationCompatible n^2).

You'll want to have a test like you have with two devices that share targets ensuring they don't get duplicated and two devices with different targets ensuring they do.

benvanik · 2025-01-15T16:33:37Z

compiler/src/iree/compiler/Dialect/Stream/Transforms/SpecializeEncodings.cpp

+  MLIRContext *ctx = moduleOp.getContext();
+  IRRewriter rewriter(ctx);
+
+  // 1. Gather per-export [execution affinity -> [resource affinities]] map.


style: prefer not using numbers - they always end up wrong when code is moved around or when code is inserted/removed. You can call out preconditions/postconditions if blocks of code do have to be performed in order (e.g. "build cache of valid ops after removing dead ops" etc)

if the algorithm has a flow then that can be documented in the function comment where someone landing on the function wants to know how it works. By keeping them inline a reader has to read the whole function (like I just did, wondering where 2 was for a page :)

hanhanW · 2025-01-17T13:39:26Z

iree/compiler/src/iree/compiler/Dialect/Stream/IR/StreamInterfaces.td

Lines 108 to 116 in 36c2353

    
           // TODO(benvanik): replace with more fine-grained compatibility checks. 
        
           // "Compatible" can mean a lot of things: are they cache-coherent, are they 
        
           // a shared address space, are they able to perform collective operations, 
        
           // etc. May be able to represent it with a bitfield or a dedicated 
        
           // compatibility struct result. 
        
           // Returns true if |desiredAffinity| (if any) is compatible with 
        
           // |requiredAffinity|. 
        
           static bool areCompatible(AffinityAttr desiredAffinity, 
        
                                     AffinityAttr requiredAffinity);

I found the above comment in areCompatible documentation, and I have a question. It seems like it'd return true if one of desiredAffinity is compatible with requiredAffinity. Does it mean that [device_a, device_b] is compatible with [device_a]? Does it mean that the compatibility could be unidirectional? I.e., X is compatible with Y, but Y is not compatible with X.

In the areTranslationCompatible, should we make it bidirectional?

Note to make the compatibility checks scale better you can gather all (like you do), use llvm::EquivalenceClasses + areTranslationCompatible to build the sets of compatible devices, then run down your executables and duplicate for each non-intersecting set (this avoids doing areTranslationCompatible n^2).

IIUC, the llvm::EquivalenceClasses is a disjoint-set. Say that we have areTranslationCompatible implemented. The algorithm would be:

Collect all the affinities.
Categorize the affinities into few groups, using llvm::EquivalenceClasses.
Each group can map to a unique hash/affinity/whatever. We then use the unique affinity in my current implementation, which can reduce the number of duplications.

Do I understand it correctly?

Meanwhile, I'm going to study the compatibility things in Stream dialect. It's a new thing to me!

hanhanW · 2025-01-17T13:51:46Z

I just found that the current implementation is simple. It does not consider the case that I listed above. I think I can start with a similar version.

iree/compiler/src/iree/compiler/Dialect/Stream/IR/StreamTypes.cpp

Lines 375 to 384 in 36c2353

    
           bool AffinityAttr::areCompatible(AffinityAttr desiredAffinity, 
        
                                            AffinityAttr requiredAffinity) { 
        
             if (desiredAffinity == requiredAffinity) 
        
               return true; 
        
             if ((desiredAffinity && !requiredAffinity) || 
        
                 (requiredAffinity && !desiredAffinity)) { 
        
               return true; 
        
             } 
        
             // We could do a fuzzier match here (interface isCompatible() etc). 
        
             return false;

benvanik · 2025-01-17T16:43:31Z

the executable targets are what we care about matching, not the affinities - we don't want to cluster by affinities, but by those sets of targets, and then generate based on required sets of targets. CUDA GPU 0 and CUDA GPU 1 (two different affinities) have the same executable targets and we don't want to duplicate, CUDA GPU 0, CUDA GPU 1, and CPU (three different affinities) have CUDA GPU 0 and CUDA GPU 1 sharing the same targets but the CPU is different and we'd want one copy for the CUDA GPUs and one copy for the CPU (because we're talking about targets, not affinities).

benvanik · 2025-01-17T16:43:53Z

(that may be what you're saying, I just realized I didn't word what I said very well :)

hanhanW requested review from lialan and Max191 December 19, 2024 09:59

hanhanW requested a review from benvanik as a code owner December 19, 2024 09:59

hanhanW force-pushed the specialize-encodings-2-n branch 2 times, most recently from 6e605a4 to 9e70643 Compare December 19, 2024 10:04

hanhanW marked this pull request as draft January 8, 2025 04:53

hanhanW force-pushed the users/hanhanW/specialize-encodings-1-n branch from cb19fce to 86dd9b0 Compare January 8, 2025 04:54

hanhanW force-pushed the specialize-encodings-2-n branch 2 times, most recently from 931032c to 4ca9b91 Compare January 8, 2025 04:55

hanhanW changed the base branch from users/hanhanW/specialize-encodings-1-n to main January 9, 2025 08:42

hanhanW force-pushed the specialize-encodings-2-n branch from 4ca9b91 to a150f11 Compare January 9, 2025 08:43

hanhanW marked this pull request as ready for review January 9, 2025 09:22

hanhanW force-pushed the specialize-encodings-2-n branch from 989dc8c to 8ba01c3 Compare January 13, 2025 18:12

Move the statement out of assertion.

c2756a0

Signed-off-by: hanhanW <[email protected]>

hanhanW force-pushed the specialize-encodings-2-n branch from 8ba01c3 to c2756a0 Compare January 13, 2025 18:15

hanhanW mentioned this pull request Jan 14, 2025

[Stream] Update executable functions in encoding specialization pass. #19700

Open

benvanik requested changes Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for executable duplication in encoding specialization pass. #19527

Add support for executable duplication in encoding specialization pass. #19527

hanhanW commented Dec 19, 2024

hanhanW commented Dec 19, 2024

hanhanW commented Jan 9, 2025

hanhanW commented Jan 13, 2025

benvanik left a comment •

edited

Loading

benvanik Jan 15, 2025

hanhanW commented Jan 17, 2025

hanhanW commented Jan 17, 2025

benvanik commented Jan 17, 2025

benvanik commented Jan 17, 2025

Add support for executable duplication in encoding specialization pass. #19527

Are you sure you want to change the base?

Add support for executable duplication in encoding specialization pass. #19527

Conversation

hanhanW commented Dec 19, 2024

hanhanW commented Dec 19, 2024

hanhanW commented Jan 9, 2025

hanhanW commented Jan 13, 2025

benvanik left a comment • edited Loading

Choose a reason for hiding this comment

benvanik Jan 15, 2025

Choose a reason for hiding this comment

hanhanW commented Jan 17, 2025

hanhanW commented Jan 17, 2025

benvanik commented Jan 17, 2025

benvanik commented Jan 17, 2025

benvanik left a comment •

edited

Loading