Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LV][VPlan] Add initial support for CSA vectorization #106560

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

michaelmaitland
Copy link
Contributor

@michaelmaitland michaelmaitland commented Aug 29, 2024

This patch adds initial support for CSA vectorization LLVM. This new class
can be characterized by vectorization of assignment to a scalar in a loop,
such that the assignment is conditional from the perspective of its use.
An assignment is conditional in a loop if a value may or may not be assigned
in the loop body.

For example:

int t = init_val;
for (int i = 0; i < N; i++) {
  if (cond[i]) 
    t = a[i];
}
s = t; // use t

Using pseudo-LLVM code this can be vectorized as

vector.ph:
  ...
  %t = %init_val
  %init.mask = <all-false-vec>
  %init.data = <poison-vec> ; uninitialized
vector.body:
  ...
  %mask.phi = phi [%init.mask, %vector.ph], [%new.mask, %vector.body]
  %data.phi = phi [%data.mask, %vector.ph], [%new.mask, %vector.body]
  %cond.vec = <widened-cmp> ...
  %a.vec    = <widened-load> %a, %i
  %b        = <any-lane-active> %cond.vec
  %new.mask = select %b, %cond.vec, %mask.phi
  %new.data = select %b, %a.vec, %data.phi
  ...
middle.block:
  %s = <extract-last-active-lane> %new.mask, %new.data

On each iteration, we track whether any lane in the widened condition was active,
and if it was take the current mask and data as the new mask and data vector.
Then at the end of the loop, the scalar can be extracted only once.

This transformation works the same way for integer, pointer, and floating point
conditional assignment, since the transformation does not require inspection
of the data being assigned.

In the vectorization of a CSA, we will be introducing recipes into the vector
preheader, the vector body, and the middle block. Recipes that are introduced
into the preheader and middle block are executed only one time, and recipes
that are in the vector body will be possibly executed multiple times. The more
times that the vector body is executed, the less of an impact the preheader
and middle block cost have on the overall cost of a CSA.

A detailed explanation of the concept can be found here.

This patch is further tested in llvm/llvm-test-suite#155.

@llvmbot
Copy link
Member

llvmbot commented Aug 29, 2024

@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-llvm-transforms

Author: Michael Maitland (michaelmaitland)

Changes

This PR contains a series of commits which together implement an initial version of conditional scalar vectorization. I will edit this description with a link to the RFC which gives a more in depth explanation of the changes.

This patch is further tested in llvm/llvm-test-suite#155.


Patch is 331.04 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/106560.diff

19 Files Affected:

  • (added) llvm/include/llvm/Analysis/CSADescriptors.h (+78)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+9)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
  • (modified) llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h (+18)
  • (modified) llvm/lib/Analysis/CMakeLists.txt (+1)
  • (added) llvm/lib/Analysis/CSADescriptors.cpp (+73)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+5)
  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+4)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp (+30-4)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+203-10)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+5-2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+195)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+367-2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+49)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+9)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+3)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+3-3)
  • (added) llvm/test/Transforms/LoopVectorize/RISCV/csa.ll (+4234)
diff --git a/llvm/include/llvm/Analysis/CSADescriptors.h b/llvm/include/llvm/Analysis/CSADescriptors.h
new file mode 100644
index 00000000000000..3f95b3484d1e22
--- /dev/null
+++ b/llvm/include/llvm/Analysis/CSADescriptors.h
@@ -0,0 +1,78 @@
+//===- llvm/Analysis/CSADescriptors.h - CSA Descriptors --*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file "describes" conditional scalar assignments (CSA).
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Analysis/LoopInfo.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/Value.h"
+
+#ifndef LLVM_ANALYSIS_SIFIVECSADESCRIPTORS_H
+#define LLVM_ANALYSIS_SIFIVECSADESCRIPTORS_H
+
+namespace llvm {
+
+/// A Conditional Scalar Assignment (CSA) is an assignment from an initial
+/// scalar that may or may not occur.
+class CSADescriptor {
+  /// If the conditional assignment occurs inside a loop, then Phi chooses
+  /// the value of the assignment from the entry block or the loop body block.
+  PHINode *Phi = nullptr;
+
+  /// The initial value of the CSA. If the condition guarding the assignment is
+  /// not met, then the assignment retains this value.
+  Value *InitScalar = nullptr;
+
+  /// The Instruction that conditionally assigned to inside the loop.
+  Instruction *Assignment = nullptr;
+
+  /// Create a CSA Descriptor that models an invalid CSA.
+  CSADescriptor() = default;
+
+  /// Create a CSA Descriptor that models a valid CSA with its members
+  /// initialized correctly.
+  CSADescriptor(PHINode *Phi, Instruction *Assignment, Value *InitScalar)
+      : Phi(Phi), InitScalar(InitScalar), Assignment(Assignment) {}
+
+public:
+  /// If Phi is the root of a CSA, return the CSADescriptor of the CSA rooted by
+  /// Phi. Otherwise, return a CSADescriptor with IsValidCSA set to false.
+  static CSADescriptor isCSAPhi(PHINode *Phi, Loop *TheLoop);
+
+  operator bool() const { return isValid(); }
+
+  /// Returns whether SI is the Assignment in CSA
+  static bool isCSASelect(CSADescriptor Desc, SelectInst *SI) {
+    return Desc.getAssignment() == SI;
+  }
+
+  /// Return whether this CSADescriptor models a valid CSA.
+  bool isValid() const { return Phi && InitScalar && Assignment; }
+
+  /// Return the PHI that roots this CSA.
+  PHINode *getPhi() const { return Phi; }
+
+  /// Return the initial value of the CSA. This is the value if the conditional
+  /// assignment does not occur.
+  Value *getInitScalar() const { return InitScalar; }
+
+  /// The Instruction that is used after the loop
+  Instruction *getAssignment() const { return Assignment; }
+
+  /// Return the condition that this CSA is conditional upon.
+  Value *getCond() const {
+    if (auto *SI = dyn_cast_or_null<SelectInst>(Assignment))
+      return SI->getCondition();
+    return nullptr;
+  }
+};
+} // namespace llvm
+
+#endif // LLVM_ANALYSIS_CSADESCRIPTORS_H
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index b2124c6106198e..d0192f7d90a812 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1767,6 +1767,10 @@ class TargetTransformInfo {
         : EVLParamStrategy(EVLParamStrategy), OpStrategy(OpStrategy) {}
   };
 
+  /// \returns true if the loop vectorizer should vectorize conditional
+  /// scalar assignments for the target.
+  bool enableCSAVectorization() const;
+
   /// \returns How the target needs this vector-predicated operation to be
   /// transformed.
   VPLegalization getVPLegalizationStrategy(const VPIntrinsic &PI) const;
@@ -2175,6 +2179,7 @@ class TargetTransformInfo::Concept {
   virtual bool supportsScalableVectors() const = 0;
   virtual bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
                                      Align Alignment) const = 0;
+  virtual bool enableCSAVectorization() const = 0;
   virtual VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const = 0;
   virtual bool hasArmWideBranch(bool Thumb) const = 0;
@@ -2940,6 +2945,10 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
     return Impl.hasActiveVectorLength(Opcode, DataType, Alignment);
   }
 
+  bool enableCSAVectorization() const override {
+    return Impl.enableCSAVectorization();
+  }
+
   VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const override {
     return Impl.getVPLegalizationStrategy(PI);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 11b07ac0b7fc47..dbf0cf888e168a 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -956,6 +956,8 @@ class TargetTransformInfoImplBase {
     return false;
   }
 
+  bool enableCSAVectorization() const { return false; }
+
   TargetTransformInfo::VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const {
     return TargetTransformInfo::VPLegalization(
diff --git a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
index 0f4d1355dd2bfe..7ef29a8cb36e49 100644
--- a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
+++ b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
@@ -27,6 +27,7 @@
 #define LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONLEGALITY_H
 
 #include "llvm/ADT/MapVector.h"
+#include "llvm/Analysis/CSADescriptors.h"
 #include "llvm/Analysis/LoopAccessAnalysis.h"
 #include "llvm/Support/TypeSize.h"
 #include "llvm/Transforms/Utils/LoopUtils.h"
@@ -257,6 +258,10 @@ class LoopVectorizationLegality {
   /// induction descriptor.
   using InductionList = MapVector<PHINode *, InductionDescriptor>;
 
+  /// CSAList contains the CSA descriptors for all the CSAs that were found
+  /// in the loop, rooted by their phis.
+  using CSAList = MapVector<PHINode *, CSADescriptor>;
+
   /// RecurrenceSet contains the phi nodes that are recurrences other than
   /// inductions and reductions.
   using RecurrenceSet = SmallPtrSet<const PHINode *, 8>;
@@ -309,6 +314,12 @@ class LoopVectorizationLegality {
   /// Returns True if V is a Phi node of an induction variable in this loop.
   bool isInductionPhi(const Value *V) const;
 
+  /// Returns the CSAs found in the loop.
+  const CSAList& getCSAs() const { return CSAs; }
+
+  /// Returns true if Phi is the root of a CSA in the loop.
+  bool isCSAPhi(PHINode *Phi) const { return CSAs.count(Phi) != 0; }
+
   /// Returns a pointer to the induction descriptor, if \p Phi is an integer or
   /// floating point induction.
   const InductionDescriptor *getIntOrFpInductionDescriptor(PHINode *Phi) const;
@@ -463,6 +474,10 @@ class LoopVectorizationLegality {
   void addInductionPhi(PHINode *Phi, const InductionDescriptor &ID,
                        SmallPtrSetImpl<Value *> &AllowedExit);
 
+  // Updates the vetorization state by adding \p Phi to the CSA list.
+  void addCSAPhi(PHINode *Phi, const CSADescriptor &CSADesc,
+                 SmallPtrSetImpl<Value *> &AllowedExit);
+
   /// The loop that we evaluate.
   Loop *TheLoop;
 
@@ -507,6 +522,9 @@ class LoopVectorizationLegality {
   /// variables can be pointers.
   InductionList Inductions;
 
+  /// Holds the conditional scalar assignments
+  CSAList CSAs;
+
   /// Holds all the casts that participate in the update chain of the induction
   /// variables, and that have been proven to be redundant (possibly under a
   /// runtime guard). These casts can be ignored when creating the vectorized
diff --git a/llvm/lib/Analysis/CMakeLists.txt b/llvm/lib/Analysis/CMakeLists.txt
index 393803fad89383..24ca426990d9ed 100644
--- a/llvm/lib/Analysis/CMakeLists.txt
+++ b/llvm/lib/Analysis/CMakeLists.txt
@@ -46,6 +46,7 @@ add_llvm_component_library(LLVMAnalysis
   CostModel.cpp
   CodeMetrics.cpp
   ConstantFolding.cpp
+  CSADescriptors.cpp
   CtxProfAnalysis.cpp
   CycleAnalysis.cpp
   DDG.cpp
diff --git a/llvm/lib/Analysis/CSADescriptors.cpp b/llvm/lib/Analysis/CSADescriptors.cpp
new file mode 100644
index 00000000000000..d0377c8c16de33
--- /dev/null
+++ b/llvm/lib/Analysis/CSADescriptors.cpp
@@ -0,0 +1,73 @@
+//=== llvm/Analysis/CSADescriptors.cpp - CSA Descriptors -*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file "describes" conditional scalar assignments (CSA).
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Analysis/CSADescriptors.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/IR/Type.h"
+
+using namespace llvm;
+using namespace llvm::PatternMatch;
+
+#define DEBUG_TYPE "csa-descriptors"
+
+CSADescriptor CSADescriptor::isCSAPhi(PHINode *Phi, Loop *TheLoop) {
+  // Return CSADescriptor that describes a CSA that matches one of these
+  // patterns:
+  //   phi loop_inv, (select cmp, value, phi)
+  //   phi loop_inv, (select cmp, phi, value)
+  //   phi (select cmp, value, phi), loop_inv
+  //   phi (select cmp, phi, value), loop_inv
+  // If the CSA does not match any of these paterns, return a CSADescriptor
+  // that describes an InvalidCSA.
+
+  // Must be a scalar
+  Type *Type = Phi->getType();
+  if (!Type->isIntegerTy() && !Type->isFloatingPointTy() &&
+      !Type->isPointerTy())
+    return CSADescriptor();
+
+  // Match phi loop_inv, (select cmp, value, phi)
+  //    or phi loop_inv, (select cmp, phi, value)
+  //    or phi (select cmp, value, phi), loop_inv
+  //    or phi (select cmp, phi, value), loop_inv
+  if (Phi->getNumIncomingValues() != 2)
+    return CSADescriptor();
+  auto SelectInstIt = find_if(Phi->incoming_values(), [&Phi](Use &U) {
+    return match(U.get(), m_Select(m_Value(), m_Specific(Phi), m_Value())) ||
+           match(U.get(), m_Select(m_Value(), m_Value(), m_Specific(Phi)));
+  });
+  if (SelectInstIt == Phi->incoming_values().end())
+    return CSADescriptor();
+  auto LoopInvIt = find_if(Phi->incoming_values(), [&](Use &U) {
+    return U.get() != *SelectInstIt && TheLoop->isLoopInvariant(U.get());
+  });
+  if (LoopInvIt == Phi->incoming_values().end())
+    return CSADescriptor();
+
+  // Phi or Sel must be used only outside the loop,
+  // excluding if Phi use Sel or Sel use Phi
+  auto IsOnlyUsedOutsideLoop = [=](Value *V, Value *Ignore) {
+    return all_of(V->users(), [Ignore, TheLoop](User *U) {
+      if (U == Ignore)
+        return true;
+      if (auto *I = dyn_cast<Instruction>(U))
+        return !TheLoop->contains(I);
+      return true;
+    });
+  };
+  auto *Sel = cast<SelectInst>(SelectInstIt->get());
+  auto *LoopInv = LoopInvIt->get();
+  if (!IsOnlyUsedOutsideLoop(Phi, Sel) || !IsOnlyUsedOutsideLoop(Sel, Phi))
+    return CSADescriptor();
+
+  return CSADescriptor(Phi, Sel, LoopInv);
+}
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 2c26493bd3f1ca..4f882475b74e74 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1304,6 +1304,10 @@ bool TargetTransformInfo::preferEpilogueVectorization() const {
   return TTIImpl->preferEpilogueVectorization();
 }
 
+bool TargetTransformInfo::enableCSAVectorization() const {
+  return TTIImpl->enableCSAVectorization();
+}
+
 TargetTransformInfo::VPLegalization
 TargetTransformInfo::getVPLegalizationStrategy(const VPIntrinsic &VPI) const {
   return TTIImpl->getVPLegalizationStrategy(VPI);
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 537c62bb0aacd1..b76f2fc72c6f47 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -1985,6 +1985,11 @@ bool RISCVTTIImpl::isLSRCostLess(const TargetTransformInfo::LSRCost &C1,
                   C2.ScaleCost, C2.ImmCost, C2.SetupCost);
 }
 
+bool RISCVTTIImpl::enableCSAVectorization() const {
+  return ST->hasVInstructions() &&
+         ST->getProcFamily() == RISCVSubtarget::SiFive7;
+}
+
 bool RISCVTTIImpl::isLegalMaskedCompressStore(Type *DataTy, Align Alignment) {
   auto *VTy = dyn_cast<VectorType>(DataTy);
   if (!VTy || VTy->isScalableTy())
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index cc69e1d118b5a1..17245150ec10ae 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -287,6 +287,10 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
     return TLI->isVScaleKnownToBeAPowerOfTwo();
   }
 
+  /// \returns true if the loop vectorizer should vectorize conditional
+  /// scalar assignments for the target.
+  bool enableCSAVectorization() const;
+
   /// \returns How the target needs this vector-predicated operation to be
   /// transformed.
   TargetTransformInfo::VPLegalization
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index 66a779da8c25bc..9633ba9cc70ee9 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -79,6 +79,10 @@ static cl::opt<LoopVectorizeHints::ScalableForceKind>
                 "Scalable vectorization is available and favored when the "
                 "cost is inconclusive.")));
 
+static cl::opt<bool>
+    EnableCSA("enable-csa-vectorization", cl::init(false), cl::Hidden,
+              cl::desc("Control whether CSA loop vectorization is enabled"));
+
 /// Maximum vectorization interleave count.
 static const unsigned MaxInterleaveFactor = 16;
 
@@ -749,6 +753,15 @@ bool LoopVectorizationLegality::setupOuterLoopInductions() {
   return llvm::all_of(Header->phis(), IsSupportedPhi);
 }
 
+void LoopVectorizationLegality::addCSAPhi(
+    PHINode *Phi, const CSADescriptor &CSADesc,
+    SmallPtrSetImpl<Value *> &AllowedExit) {
+  assert(CSADesc.isValid() && "Expected Valid CSADescriptor");
+  LLVM_DEBUG(dbgs() << "LV: found legal CSA opportunity" << *Phi << "\n");
+  AllowedExit.insert(Phi);
+  CSAs.insert({Phi, CSADesc});
+}
+
 /// Checks if a function is scalarizable according to the TLI, in
 /// the sense that it should be vectorized and then expanded in
 /// multiple scalar calls. This is represented in the
@@ -866,14 +879,23 @@ bool LoopVectorizationLegality::canVectorizeInstrs() {
           continue;
         }
 
-        // As a last resort, coerce the PHI to a AddRec expression
-        // and re-try classifying it a an induction PHI.
+        // Try to coerce the PHI to a AddRec expression and re-try classifying
+        // it a an induction PHI.
         if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID, true) &&
             !IsDisallowedStridedPointerInduction(ID)) {
           addInductionPhi(Phi, ID, AllowedExit);
           continue;
         }
 
+        // Check if the PHI can be classified as a CSA PHI.
+        if (EnableCSA || (TTI->enableCSAVectorization() &&
+                          EnableCSA.getNumOccurrences() == 0)) {
+          if (auto CSADesc = CSADescriptor::isCSAPhi(Phi, TheLoop)) {
+            addCSAPhi(Phi, CSADesc, AllowedExit);
+            continue;
+          }
+        }
+
         reportVectorizationFailure("Found an unidentified PHI",
             "value that could not be identified as "
             "reduction is used outside the loop",
@@ -1555,11 +1577,15 @@ bool LoopVectorizationLegality::canFoldTailByMasking() const {
   for (const auto &Reduction : getReductionVars())
     ReductionLiveOuts.insert(Reduction.second.getLoopExitInstr());
 
+  SmallPtrSet<const Value *, 8> CSALiveOuts;
+  for (const auto &CSA: getCSAs())
+    CSALiveOuts.insert(CSA.second.getAssignment());
+
   // TODO: handle non-reduction outside users when tail is folded by masking.
   for (auto *AE : AllowedExit) {
     // Check that all users of allowed exit values are inside the loop or
-    // are the live-out of a reduction.
-    if (ReductionLiveOuts.count(AE))
+    // are the live-out of a reduction or a CSA
+    if (ReductionLiveOuts.count(AE) || CSALiveOuts.count(AE))
       continue;
     for (User *U : AE->users()) {
       Instruction *UI = cast<Instruction>(U);
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 56f51e14a6eba9..5e45f500482826 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -180,6 +180,8 @@ const char LLVMLoopVectorizeFollowupEpilogue[] =
 STATISTIC(LoopsVectorized, "Number of loops vectorized");
 STATISTIC(LoopsAnalyzed, "Number of loops analyzed for vectorization");
 STATISTIC(LoopsEpilogueVectorized, "Number of epilogues vectorized");
+STATISTIC(CSAsVectorized,
+          "Number of conditional scalar assignments vectorized");
 
 static cl::opt<bool> EnableEpilogueVectorization(
     "enable-epilogue-vectorization", cl::init(true), cl::Hidden,
@@ -500,6 +502,10 @@ class InnerLoopVectorizer {
   virtual std::pair<BasicBlock *, Value *>
   createVectorizedLoopSkeleton(const SCEV2ValueTy &ExpandedSCEVs);
 
+  /// For all vectorized CSAs, replace uses of live-out scalar from the orignal
+  /// loop with the extracted scalar from the vector loop for.
+  void fixCSALiveOuts(VPTransformState &State, VPlan &Plan);
+
   /// Fix the vectorized code, taking care of header phi's, live-outs, and more.
   void fixVectorizedLoop(VPTransformState &State, VPlan &Plan);
 
@@ -2932,6 +2938,25 @@ LoopVectorizationCostModel::getVectorIntrinsicCost(CallInst *CI,
                                    TargetTransformInfo::TCK_RecipThroughput);
 }
 
+void InnerLoopVectorizer::fixCSALiveOuts(VPTransformState &State, VPlan &Plan) {
+  for (const auto &CSA: Plan.getCSAStates()) {
+    VPCSADataUpdateRecipe *VPDataUpdate = CSA.second->getDataUpdate();
+    assert(VPDataUpdate &&
+           "VPDataUpdate must have been introduced prior to fixing live outs");
+    Value *V = VPDataUpdate->getUnderlyingValue();
+    Value *ExtractedScalar = State.get(CSA.second->getExtractScalarRecipe(), 0,
+                                       /*NeedsScalar=*/true);
+    // Fix LCSSAPhis
+    llvm::SmallPtrSet<PHINode *, 2> ToFix;
+    for (User *U : V->users())
+      if (auto *Phi = dyn_cast<PHINode>(U);
+          Phi && Phi->getParent() == LoopExitBlock)
+        ToFix.insert(Phi);
+    for (PHINode *Phi : ToFix)
+      Phi->addIncoming(ExtractedScalar, LoopMiddleBlock);
+  }
+}
+
 void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
                                             VPlan &Plan) {
   // Fix widened non-induction PHIs by setting up the PHI operands.
@@ -2972,6 +2997,8 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
                    getOrCreateVectorTripCount(VectorLoop->getLoopPreheader()),
                    IVEndValues[Entry.first], LoopMiddleBlock,
                    VectorLoop->getHeader(), Plan, State);
+
+    fixCSALiveOuts(State, Plan);
   }
 
   // Fix live-out phis not already fixed earlier.
@@ -4110,7 +4137,6 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
   // found modulo the vectorization factor is not zero, try to fold the tail
   // by masking.
   // FIXME: look for a smaller MaxVF that does divide TC rather than masking.
-  setTailFoldingStyles(MaxFactors.ScalableVF.isScalable(), UserIC);
   if (foldTailByMasking()) {
     if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) {
       LLVM_DEBUG(
@@ -4482,6 +4508,9 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF,
       case VPDef::VPEVLBasedIVPHISC:
       case VPDef::VPPredInstPHISC:
       case VPDef::VPBranchOnMaskSC:
+      case VPRecipeBase::VPCSADataUpdateSC:
+      case VPRecipeBase::VPCSAExtractScalarSC:
+      case VPRecipeBase::VPCSAHe...
[truncated]

Copy link

github-actions bot commented Aug 29, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@artagnon artagnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a very rough initial review, to get the ball rolling. Can you kindly look at iv-select-cmp.ll (and associated tests) to increase test coverage to include negative tests as well? It would be nice if this patch created changes in iv-select-cmp.ll, although I'll need to review in more detail to find out why.

llvm/test/Transforms/LoopVectorize/RISCV/csa.ll Outdated Show resolved Hide resolved
llvm/test/Transforms/LoopVectorize/RISCV/csa.ll Outdated Show resolved Hide resolved
llvm/test/Transforms/LoopVectorize/RISCV/csa.ll Outdated Show resolved Hide resolved
llvm/test/Transforms/LoopVectorize/RISCV/csa.ll Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/VPlan.h Outdated Show resolved Hide resolved
llvm/include/llvm/Analysis/CSADescriptors.h Outdated Show resolved Hide resolved
@artagnon
Copy link
Contributor

Another high-level question: is it possible to not nest the test under RISCV? I realized that you've said it's disabled in BasicTTIImpl, but having a separate cas.ll under each target wouldn't be ideal, no?

Copy link
Contributor Author

@michaelmaitland michaelmaitland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@artagnon thanks so much for the initial review!

I've addressed comments and made some responses :)

llvm/test/Transforms/LoopVectorize/RISCV/csa.ll Outdated Show resolved Hide resolved
llvm/test/Transforms/LoopVectorize/RISCV/csa.ll Outdated Show resolved Hide resolved
llvm/test/Transforms/LoopVectorize/RISCV/csa.ll Outdated Show resolved Hide resolved
@michaelmaitland
Copy link
Contributor Author

It would be nice if this patch created changes in iv-select-cmp.ll, although I'll need to review in more detail to find out why.

CSA vectorization is not enabled by default.

@michaelmaitland
Copy link
Contributor Author

is it possible to not nest the test under RISCV? I realized that you've said it's disabled in BasicTTIImpl, but having a separate cas.ll under each target wouldn't be ideal, no?

I can do this. It's a good idea.

@artagnon
Copy link
Contributor

This PR contains a series of commits which together implement an initial version of conditional scalar vectorization.

A detailed explanation of the changes can be found here.

Kindly note that the only functionality is squash-and-merge, and this will be the final commit message: "series of commits" doesn't make sense in the final commit message. Also, it would be nice to have a short explanation in the commit message itself (possibly with a few examples). Finally, kindly ensure that markdown doesn't end up when someone does a git log: it's plain-text.

@artagnon
Copy link
Contributor

is it possible to not nest the test under RISCV? I realized that you've said it's disabled in BasicTTIImpl, but having a separate cas.ll under each target wouldn't be ideal, no?

I can do this. It's a good idea.

Perhaps a good idea to name the test conditional-scalar-assignment.ll instead of the more cryptic csa.ll.

@michaelmaitland
Copy link
Contributor Author

I have updated the tests to use -NOT: vector.body for the cases that cannot be vectorized
I have also moved tests up a directory. I had to keep a copy of some in the RISCV target directory since they depend on TTI specific to RISC-V (i.e. pointer size).

@michaelmaitland michaelmaitland force-pushed the csa-vectorization branch 2 times, most recently from c59efaf to ecd9285 Compare August 30, 2024 18:50
Copy link
Contributor

@artagnon artagnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a closer inspection of the patch, and found mostly small things to fix. Otherwise, I'm a fan of the approach you've taken: the code is quite easy to understand, and there is little bloat. Thanks again for the patch, and I hope the other reviewers also have similar sentiments!

On a minor note, it might be helpful if you could locally enable CSA vectorization, and post the diff of the test changes somewhere, just to make sure we're doing the right thing.

llvm/include/llvm/Analysis/CSADescriptors.h Outdated Show resolved Hide resolved
llvm/lib/Analysis/CSADescriptors.cpp Outdated Show resolved Hide resolved
llvm/lib/Analysis/CSADescriptors.cpp Outdated Show resolved Hide resolved
llvm/lib/Analysis/CSADescriptors.cpp Outdated Show resolved Hide resolved
llvm/lib/Analysis/CSADescriptors.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/VPlanUtils.cpp Outdated Show resolved Hide resolved
michaelmaitland added a commit to michaelmaitland/llvm-project that referenced this pull request Aug 31, 2024
The docstring of this function:

> Returns true if this VPInstruction's operands are single scalars and the
> result is also a single scalar.

ExplicitVectorLength fits this description. I don't have a test for it now,
but it is needed by llvm#106560.
@michaelmaitland michaelmaitland force-pushed the csa-vectorization branch 3 times, most recently from a03e73b to c2e26ff Compare September 3, 2024 18:18
@michaelmaitland
Copy link
Contributor Author

ping


/// A Conditional Scalar Assignment (CSA) is an assignment from an initial
/// scalar that may or may not occur.
class CSADescriptor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to introduce a whole new class + analysis file instead of adding a new RecurKind in IVDescriptors?

@Mel-Chen 's #67812 seems to be quite similar work which just extends the existing code. This CSA patch is more general in that it can handle values that aren't just the induction variable, but I think the amount of extra code can be cut down a little.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to introduce a whole new class + analysis file instead of adding a new RecurKind in IVDescriptors?

I’ve gone ahead and put CSADescriptor inside the IVDescriptor file. But there isn't anything that CSADescriptor looks to reuse from RecurrenceDescriptor or InductionDescriptor, so I think it warrants a new class. CSADescriptor also adds new functions and data members that aren't be relevant for other Recurrences or Inductions. I expect CSADescriptors class to diverge even further, as we extend CSA do support IR with control flow.

public:
/// If Phi is the root of a CSA, return the CSADescriptor of the CSA rooted by
/// Phi. Otherwise, return a CSADescriptor with IsValidCSA set to false.
static CSADescriptor isCSAPhi(PHINode *Phi, Loop *TheLoop);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think returning a std::optional instead of having valid/invalid descriptors would work better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the approach that RecurrenceDescriptor and InductionDescriptor used to take. They now return bool here and take a descriptor by reference. I have updated CSADesciptors use the new approach.

; EVL-NEXT: [[TMP9:%.*]] = sext <vscale x 4 x i32> [[VP_OP_LOAD]] to <vscale x 4 x i64>
; EVL-NEXT: [[TMP10:%.*]] = icmp slt <vscale x 4 x i64> [[BROADCAST_SPLAT]], [[TMP9]]
; EVL-NEXT: [[CSA_COND_ANYACTIVE:%.*]] = call i1 @llvm.vp.reduce.or.nxv4i1(i1 false, <vscale x 4 x i1> [[TMP10]], <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), i32 [[TMP5]])
; EVL-NEXT: [[CSA_VL_SEL]] = select i1 [[CSA_COND_ANYACTIVE]], i32 [[TMP5]], i32 [[CSA_VL_PHI]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what the eventual asm looks like for this. I had a look through the RVV spec and couldn't find an instruction which selected between whole registers based on a single condition, just merging moves which do a per-lane select. I may have missed something though. Does this require branches inside the loop?

I ask because I'm trying to evaluate this work from an SVE perspective, and I think I would prefer to do the extract inside the loop using clastb with a scalar phi instead of outside the loop with a branch on the inside to conditionally update the whole register.

I guess we may end up with diverging recipes for vectorizing the same pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the generated code:

# %bb.0:                                # %entry
        sext.w  a3, a0
        blez    a3, .LBB0_6
# %bb.1:                                # %loop.preheader
        li      a3, 0
        li      a6, 0
        slli    a0, a0, 32
        srli    a0, a0, 32
        and     a5, a5, a7
        vsetvli a7, zero, e8, mf2, ta, ma
        vmclr.m v8
        j       .LBB0_3
.LBB0_2:                                # %vector.body
        snez    a6, t1
        vsetvli t1, zero, e8, mf2, ta, ma
        vmv.v.x v14, a6
        vmsne.vi        v0, v14, 0
        vmand.mm        v9, v9, v0
        vmandn.mm       v8, v8, v0
        vmor.mm v8, v9, v8
        vsetvli zero, zero, e32, m2, ta, ma
        vmerge.vvm      v10, v10, v12, v0
        sub     a5, a5, a4
        add     a3, a7, a3
        mv      a6, t0
        beqz    a5, .LBB0_5
.LBB0_3:                                # %vector.body
        sub     t0, a0, a3
        vsetvli a7, t0, e8, mf2, ta, ma
        slli    t1, a3, 2
        add     t1, a1, t1
        vle32.v v12, (t1)
        vsetvli t1, zero, e64, m4, ta, ma
        vsext.vf2       v16, v12
        vmsgt.vx        v9, v16, a2
        vsetvli zero, t0, e8, mf2, ta, ma
        vcpop.m t1, v9
        mv      t0, a7
        bnez    t1, .LBB0_2
# %bb.4:                                # %vector.body
        mv      t0, a6
        j       .LBB0_2
.LBB0_5:                                # %middle.block
        slli    t0, t0, 32
        srli    a0, t0, 32
        vsetvli zero, a0, e32, m2, ta, ma
        vid.v   v12
        vsetivli        zero, 1, e32, m1, ta, ma
        vmv.v.i v9, -1
        vmv1r.v v0, v8
        vsetvli zero, a0, e32, m2, ta, ma
        vredmax.vs      v9, v12, v9, v0.t
        vmv.x.s a0, v9
        slli    a1, a0, 32
        srli    a1, a1, 32
        vsetivli        zero, 1, e32, m2, ta, ma
        vslidedown.vx   v8, v10, a1
        vmv.x.s a1, v8
        srai    a0, a0, 63
        or      a0, a0, a1
        ret
.LBB0_6:
        li      a0, -1
        ret

llvm/lib/Transforms/Vectorize/VPlan.h Show resolved Hide resolved
Copy link
Contributor Author

@michaelmaitland michaelmaitland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huntergr-arm thanks for the review. I have addressed or responded to your comments.

; EVL-NEXT: [[TMP9:%.*]] = sext <vscale x 4 x i32> [[VP_OP_LOAD]] to <vscale x 4 x i64>
; EVL-NEXT: [[TMP10:%.*]] = icmp slt <vscale x 4 x i64> [[BROADCAST_SPLAT]], [[TMP9]]
; EVL-NEXT: [[CSA_COND_ANYACTIVE:%.*]] = call i1 @llvm.vp.reduce.or.nxv4i1(i1 false, <vscale x 4 x i1> [[TMP10]], <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), i32 [[TMP5]])
; EVL-NEXT: [[CSA_VL_SEL]] = select i1 [[CSA_COND_ANYACTIVE]], i32 [[TMP5]], i32 [[CSA_VL_PHI]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the generated code:

# %bb.0:                                # %entry
        sext.w  a3, a0
        blez    a3, .LBB0_6
# %bb.1:                                # %loop.preheader
        li      a3, 0
        li      a6, 0
        slli    a0, a0, 32
        srli    a0, a0, 32
        and     a5, a5, a7
        vsetvli a7, zero, e8, mf2, ta, ma
        vmclr.m v8
        j       .LBB0_3
.LBB0_2:                                # %vector.body
        snez    a6, t1
        vsetvli t1, zero, e8, mf2, ta, ma
        vmv.v.x v14, a6
        vmsne.vi        v0, v14, 0
        vmand.mm        v9, v9, v0
        vmandn.mm       v8, v8, v0
        vmor.mm v8, v9, v8
        vsetvli zero, zero, e32, m2, ta, ma
        vmerge.vvm      v10, v10, v12, v0
        sub     a5, a5, a4
        add     a3, a7, a3
        mv      a6, t0
        beqz    a5, .LBB0_5
.LBB0_3:                                # %vector.body
        sub     t0, a0, a3
        vsetvli a7, t0, e8, mf2, ta, ma
        slli    t1, a3, 2
        add     t1, a1, t1
        vle32.v v12, (t1)
        vsetvli t1, zero, e64, m4, ta, ma
        vsext.vf2       v16, v12
        vmsgt.vx        v9, v16, a2
        vsetvli zero, t0, e8, mf2, ta, ma
        vcpop.m t1, v9
        mv      t0, a7
        bnez    t1, .LBB0_2
# %bb.4:                                # %vector.body
        mv      t0, a6
        j       .LBB0_2
.LBB0_5:                                # %middle.block
        slli    t0, t0, 32
        srli    a0, t0, 32
        vsetvli zero, a0, e32, m2, ta, ma
        vid.v   v12
        vsetivli        zero, 1, e32, m1, ta, ma
        vmv.v.i v9, -1
        vmv1r.v v0, v8
        vsetvli zero, a0, e32, m2, ta, ma
        vredmax.vs      v9, v12, v9, v0.t
        vmv.x.s a0, v9
        slli    a1, a0, 32
        srli    a1, a1, 32
        vsetivli        zero, 1, e32, m2, ta, ma
        vslidedown.vx   v8, v10, a1
        vmv.x.s a1, v8
        srai    a0, a0, 63
        or      a0, a0, a1
        ret
.LBB0_6:
        li      a0, -1
        ret


/// A Conditional Scalar Assignment (CSA) is an assignment from an initial
/// scalar that may or may not occur.
class CSADescriptor {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to introduce a whole new class + analysis file instead of adding a new RecurKind in IVDescriptors?

I’ve gone ahead and put CSADescriptor inside the IVDescriptor file. But there isn't anything that CSADescriptor looks to reuse from RecurrenceDescriptor or InductionDescriptor, so I think it warrants a new class. CSADescriptor also adds new functions and data members that aren't be relevant for other Recurrences or Inductions. I expect CSADescriptors class to diverge even further, as we extend CSA do support IR with control flow.

public:
/// If Phi is the root of a CSA, return the CSADescriptor of the CSA rooted by
/// Phi. Otherwise, return a CSADescriptor with IsValidCSA set to false.
static CSADescriptor isCSAPhi(PHINode *Phi, Loop *TheLoop);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the approach that RecurrenceDescriptor and InductionDescriptor used to take. They now return bool here and take a descriptor by reference. I have updated CSADesciptors use the new approach.

assert(VPDataUpdate &&
"VPDataUpdate must have been introduced prior to fixing live outs");
Value *V = VPDataUpdate->getUnderlyingValue();
Value *ExtractedScalar =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can those fixups be directly done in VPlan, similar to how other live-outs are handled (with the exception of induction users, which is the last remaining case that needs fixing up outside of VPlan)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@@ -669,6 +673,65 @@ Value *VPInstruction::generate(VPTransformState &State) {
}
return NewPhi;
}
case VPInstruction::CSAMaskPhi: {
IRBuilder<>::InsertPointGuard Guard(State.Builder);
State.Builder.SetInsertPoint(State.CFG.PrevBB->getFirstNonPHI());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the builder need to be reset here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@@ -3497,6 +3695,8 @@ class VPlan {
/// live-outs are fixed via VPLiveOut::fixPhi.
MapVector<PHINode *, VPLiveOut *> LiveOuts;

MapVector<PHINode *, VPCSAState *> CSAStates;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be part of VPlan? Is it not possible to encode all required info in recipes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -0,0 +1,2932 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the general code-gen parts be tested independent of RISCV? Could you add a test checking the printed VPlan containing the CAS recipes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to make a generic target support predicated vectorization to my knowledge. As a result, I have moved these tests back RISC-V. I think we can consider moving them back out until a generic target can handle it. We do something similar for other code that uses this option. WDYT?

/// VPCSAState holds information required to vectorize a conditional scalar
/// assignment.
class VPCSAState {
VPValue *VPInitScalar = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to avoid holding references to VPValues/Recipes, as they may be removed/replaced, in case the state here gets stale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@Mel-Chen
Copy link
Contributor

@Mel-Chen 's #67812 seems to be quite similar work which just extends the existing code. This CSA patch is more general in that it can handle values that aren't just the induction variable, but I think the amount of extra code can be cut down a little.

#67812 only handles increasing induction variables because FindLastIV was designed to address the index reduction in min/max with index reductions, and the index is typically an increasing induction variable.

Currently, I do not have plans to support FindLast, mainly because RISC-V has vfirst but no vlast instruction. However, I can still provide some suggestions.
For idiom recognition, we could directly extend the RecurrenceDescriptor.
For vectorization, I recommend focusing on an in-loop reduction approach for this idiom, as an out-loop reduction would require generating additional recurrence phi in the vectorized loop, making vectorization too costly. To implement in-loop reduction, it's best to introduce a new intrinsic, vector.extractlast(%scalar_start, %vec_val, %vec_mask), with semantics as follows:

  %c = vcpop(%vec_mask)
  if(%c == 0)
    return %scalar_start;
  %vec_com = compress(%vec_val, %vec_mask)
  return extractelement %vec_com, %c - 1

For targets without a corresponding instruction, such as RISC-V, we could implement the same semantics in the backend using multiple instructions (e.g., vcpop + beqz + vcompress) if it is profitable. In VPlan, we might need to create a new recipe, VPExtractLastRecipe, to emit this intrinsic. I hope these suggestions are helpful.

In the end, if there is indeed a need to support this semantic, please let me know, and I can discuss internally whether to include it in the plan. :)

@michaelmaitland
Copy link
Contributor Author

rebase; ping

huntergr-arm added a commit that referenced this pull request Nov 14, 2024
As discussed in #112738, it may be better to have an intrinsic to represent vector element extracts based on mask bits. This intrinsic is for the case of extracting the last active element, if any, or a default value if the mask is all-false.

The target-agnostic SelectionDAG lowering is similar to the IR in #106560.
akshayrdeodhar pushed a commit to akshayrdeodhar/llvm-project that referenced this pull request Nov 18, 2024
As discussed in llvm#112738, it may be better to have an intrinsic to represent vector element extracts based on mask bits. This intrinsic is for the case of extracting the last active element, if any, or a default value if the mask is all-false.

The target-agnostic SelectionDAG lowering is similar to the IR in llvm#106560.
@michaelmaitland
Copy link
Contributor Author

ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants