-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LV][VPlan] Add initial support for CSA vectorization #106560
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-backend-risc-v @llvm/pr-subscribers-llvm-transforms Author: Michael Maitland (michaelmaitland) ChangesThis PR contains a series of commits which together implement an initial version of conditional scalar vectorization. I will edit this description with a link to the RFC which gives a more in depth explanation of the changes. This patch is further tested in llvm/llvm-test-suite#155. Patch is 331.04 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/106560.diff 19 Files Affected:
diff --git a/llvm/include/llvm/Analysis/CSADescriptors.h b/llvm/include/llvm/Analysis/CSADescriptors.h
new file mode 100644
index 00000000000000..3f95b3484d1e22
--- /dev/null
+++ b/llvm/include/llvm/Analysis/CSADescriptors.h
@@ -0,0 +1,78 @@
+//===- llvm/Analysis/CSADescriptors.h - CSA Descriptors --*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file "describes" conditional scalar assignments (CSA).
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Analysis/LoopInfo.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/Value.h"
+
+#ifndef LLVM_ANALYSIS_SIFIVECSADESCRIPTORS_H
+#define LLVM_ANALYSIS_SIFIVECSADESCRIPTORS_H
+
+namespace llvm {
+
+/// A Conditional Scalar Assignment (CSA) is an assignment from an initial
+/// scalar that may or may not occur.
+class CSADescriptor {
+ /// If the conditional assignment occurs inside a loop, then Phi chooses
+ /// the value of the assignment from the entry block or the loop body block.
+ PHINode *Phi = nullptr;
+
+ /// The initial value of the CSA. If the condition guarding the assignment is
+ /// not met, then the assignment retains this value.
+ Value *InitScalar = nullptr;
+
+ /// The Instruction that conditionally assigned to inside the loop.
+ Instruction *Assignment = nullptr;
+
+ /// Create a CSA Descriptor that models an invalid CSA.
+ CSADescriptor() = default;
+
+ /// Create a CSA Descriptor that models a valid CSA with its members
+ /// initialized correctly.
+ CSADescriptor(PHINode *Phi, Instruction *Assignment, Value *InitScalar)
+ : Phi(Phi), InitScalar(InitScalar), Assignment(Assignment) {}
+
+public:
+ /// If Phi is the root of a CSA, return the CSADescriptor of the CSA rooted by
+ /// Phi. Otherwise, return a CSADescriptor with IsValidCSA set to false.
+ static CSADescriptor isCSAPhi(PHINode *Phi, Loop *TheLoop);
+
+ operator bool() const { return isValid(); }
+
+ /// Returns whether SI is the Assignment in CSA
+ static bool isCSASelect(CSADescriptor Desc, SelectInst *SI) {
+ return Desc.getAssignment() == SI;
+ }
+
+ /// Return whether this CSADescriptor models a valid CSA.
+ bool isValid() const { return Phi && InitScalar && Assignment; }
+
+ /// Return the PHI that roots this CSA.
+ PHINode *getPhi() const { return Phi; }
+
+ /// Return the initial value of the CSA. This is the value if the conditional
+ /// assignment does not occur.
+ Value *getInitScalar() const { return InitScalar; }
+
+ /// The Instruction that is used after the loop
+ Instruction *getAssignment() const { return Assignment; }
+
+ /// Return the condition that this CSA is conditional upon.
+ Value *getCond() const {
+ if (auto *SI = dyn_cast_or_null<SelectInst>(Assignment))
+ return SI->getCondition();
+ return nullptr;
+ }
+};
+} // namespace llvm
+
+#endif // LLVM_ANALYSIS_CSADESCRIPTORS_H
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index b2124c6106198e..d0192f7d90a812 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1767,6 +1767,10 @@ class TargetTransformInfo {
: EVLParamStrategy(EVLParamStrategy), OpStrategy(OpStrategy) {}
};
+ /// \returns true if the loop vectorizer should vectorize conditional
+ /// scalar assignments for the target.
+ bool enableCSAVectorization() const;
+
/// \returns How the target needs this vector-predicated operation to be
/// transformed.
VPLegalization getVPLegalizationStrategy(const VPIntrinsic &PI) const;
@@ -2175,6 +2179,7 @@ class TargetTransformInfo::Concept {
virtual bool supportsScalableVectors() const = 0;
virtual bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
Align Alignment) const = 0;
+ virtual bool enableCSAVectorization() const = 0;
virtual VPLegalization
getVPLegalizationStrategy(const VPIntrinsic &PI) const = 0;
virtual bool hasArmWideBranch(bool Thumb) const = 0;
@@ -2940,6 +2945,10 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
return Impl.hasActiveVectorLength(Opcode, DataType, Alignment);
}
+ bool enableCSAVectorization() const override {
+ return Impl.enableCSAVectorization();
+ }
+
VPLegalization
getVPLegalizationStrategy(const VPIntrinsic &PI) const override {
return Impl.getVPLegalizationStrategy(PI);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 11b07ac0b7fc47..dbf0cf888e168a 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -956,6 +956,8 @@ class TargetTransformInfoImplBase {
return false;
}
+ bool enableCSAVectorization() const { return false; }
+
TargetTransformInfo::VPLegalization
getVPLegalizationStrategy(const VPIntrinsic &PI) const {
return TargetTransformInfo::VPLegalization(
diff --git a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
index 0f4d1355dd2bfe..7ef29a8cb36e49 100644
--- a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
+++ b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
@@ -27,6 +27,7 @@
#define LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONLEGALITY_H
#include "llvm/ADT/MapVector.h"
+#include "llvm/Analysis/CSADescriptors.h"
#include "llvm/Analysis/LoopAccessAnalysis.h"
#include "llvm/Support/TypeSize.h"
#include "llvm/Transforms/Utils/LoopUtils.h"
@@ -257,6 +258,10 @@ class LoopVectorizationLegality {
/// induction descriptor.
using InductionList = MapVector<PHINode *, InductionDescriptor>;
+ /// CSAList contains the CSA descriptors for all the CSAs that were found
+ /// in the loop, rooted by their phis.
+ using CSAList = MapVector<PHINode *, CSADescriptor>;
+
/// RecurrenceSet contains the phi nodes that are recurrences other than
/// inductions and reductions.
using RecurrenceSet = SmallPtrSet<const PHINode *, 8>;
@@ -309,6 +314,12 @@ class LoopVectorizationLegality {
/// Returns True if V is a Phi node of an induction variable in this loop.
bool isInductionPhi(const Value *V) const;
+ /// Returns the CSAs found in the loop.
+ const CSAList& getCSAs() const { return CSAs; }
+
+ /// Returns true if Phi is the root of a CSA in the loop.
+ bool isCSAPhi(PHINode *Phi) const { return CSAs.count(Phi) != 0; }
+
/// Returns a pointer to the induction descriptor, if \p Phi is an integer or
/// floating point induction.
const InductionDescriptor *getIntOrFpInductionDescriptor(PHINode *Phi) const;
@@ -463,6 +474,10 @@ class LoopVectorizationLegality {
void addInductionPhi(PHINode *Phi, const InductionDescriptor &ID,
SmallPtrSetImpl<Value *> &AllowedExit);
+ // Updates the vetorization state by adding \p Phi to the CSA list.
+ void addCSAPhi(PHINode *Phi, const CSADescriptor &CSADesc,
+ SmallPtrSetImpl<Value *> &AllowedExit);
+
/// The loop that we evaluate.
Loop *TheLoop;
@@ -507,6 +522,9 @@ class LoopVectorizationLegality {
/// variables can be pointers.
InductionList Inductions;
+ /// Holds the conditional scalar assignments
+ CSAList CSAs;
+
/// Holds all the casts that participate in the update chain of the induction
/// variables, and that have been proven to be redundant (possibly under a
/// runtime guard). These casts can be ignored when creating the vectorized
diff --git a/llvm/lib/Analysis/CMakeLists.txt b/llvm/lib/Analysis/CMakeLists.txt
index 393803fad89383..24ca426990d9ed 100644
--- a/llvm/lib/Analysis/CMakeLists.txt
+++ b/llvm/lib/Analysis/CMakeLists.txt
@@ -46,6 +46,7 @@ add_llvm_component_library(LLVMAnalysis
CostModel.cpp
CodeMetrics.cpp
ConstantFolding.cpp
+ CSADescriptors.cpp
CtxProfAnalysis.cpp
CycleAnalysis.cpp
DDG.cpp
diff --git a/llvm/lib/Analysis/CSADescriptors.cpp b/llvm/lib/Analysis/CSADescriptors.cpp
new file mode 100644
index 00000000000000..d0377c8c16de33
--- /dev/null
+++ b/llvm/lib/Analysis/CSADescriptors.cpp
@@ -0,0 +1,73 @@
+//=== llvm/Analysis/CSADescriptors.cpp - CSA Descriptors -*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file "describes" conditional scalar assignments (CSA).
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Analysis/CSADescriptors.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/IR/Type.h"
+
+using namespace llvm;
+using namespace llvm::PatternMatch;
+
+#define DEBUG_TYPE "csa-descriptors"
+
+CSADescriptor CSADescriptor::isCSAPhi(PHINode *Phi, Loop *TheLoop) {
+ // Return CSADescriptor that describes a CSA that matches one of these
+ // patterns:
+ // phi loop_inv, (select cmp, value, phi)
+ // phi loop_inv, (select cmp, phi, value)
+ // phi (select cmp, value, phi), loop_inv
+ // phi (select cmp, phi, value), loop_inv
+ // If the CSA does not match any of these paterns, return a CSADescriptor
+ // that describes an InvalidCSA.
+
+ // Must be a scalar
+ Type *Type = Phi->getType();
+ if (!Type->isIntegerTy() && !Type->isFloatingPointTy() &&
+ !Type->isPointerTy())
+ return CSADescriptor();
+
+ // Match phi loop_inv, (select cmp, value, phi)
+ // or phi loop_inv, (select cmp, phi, value)
+ // or phi (select cmp, value, phi), loop_inv
+ // or phi (select cmp, phi, value), loop_inv
+ if (Phi->getNumIncomingValues() != 2)
+ return CSADescriptor();
+ auto SelectInstIt = find_if(Phi->incoming_values(), [&Phi](Use &U) {
+ return match(U.get(), m_Select(m_Value(), m_Specific(Phi), m_Value())) ||
+ match(U.get(), m_Select(m_Value(), m_Value(), m_Specific(Phi)));
+ });
+ if (SelectInstIt == Phi->incoming_values().end())
+ return CSADescriptor();
+ auto LoopInvIt = find_if(Phi->incoming_values(), [&](Use &U) {
+ return U.get() != *SelectInstIt && TheLoop->isLoopInvariant(U.get());
+ });
+ if (LoopInvIt == Phi->incoming_values().end())
+ return CSADescriptor();
+
+ // Phi or Sel must be used only outside the loop,
+ // excluding if Phi use Sel or Sel use Phi
+ auto IsOnlyUsedOutsideLoop = [=](Value *V, Value *Ignore) {
+ return all_of(V->users(), [Ignore, TheLoop](User *U) {
+ if (U == Ignore)
+ return true;
+ if (auto *I = dyn_cast<Instruction>(U))
+ return !TheLoop->contains(I);
+ return true;
+ });
+ };
+ auto *Sel = cast<SelectInst>(SelectInstIt->get());
+ auto *LoopInv = LoopInvIt->get();
+ if (!IsOnlyUsedOutsideLoop(Phi, Sel) || !IsOnlyUsedOutsideLoop(Sel, Phi))
+ return CSADescriptor();
+
+ return CSADescriptor(Phi, Sel, LoopInv);
+}
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 2c26493bd3f1ca..4f882475b74e74 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1304,6 +1304,10 @@ bool TargetTransformInfo::preferEpilogueVectorization() const {
return TTIImpl->preferEpilogueVectorization();
}
+bool TargetTransformInfo::enableCSAVectorization() const {
+ return TTIImpl->enableCSAVectorization();
+}
+
TargetTransformInfo::VPLegalization
TargetTransformInfo::getVPLegalizationStrategy(const VPIntrinsic &VPI) const {
return TTIImpl->getVPLegalizationStrategy(VPI);
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 537c62bb0aacd1..b76f2fc72c6f47 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -1985,6 +1985,11 @@ bool RISCVTTIImpl::isLSRCostLess(const TargetTransformInfo::LSRCost &C1,
C2.ScaleCost, C2.ImmCost, C2.SetupCost);
}
+bool RISCVTTIImpl::enableCSAVectorization() const {
+ return ST->hasVInstructions() &&
+ ST->getProcFamily() == RISCVSubtarget::SiFive7;
+}
+
bool RISCVTTIImpl::isLegalMaskedCompressStore(Type *DataTy, Align Alignment) {
auto *VTy = dyn_cast<VectorType>(DataTy);
if (!VTy || VTy->isScalableTy())
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index cc69e1d118b5a1..17245150ec10ae 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -287,6 +287,10 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
return TLI->isVScaleKnownToBeAPowerOfTwo();
}
+ /// \returns true if the loop vectorizer should vectorize conditional
+ /// scalar assignments for the target.
+ bool enableCSAVectorization() const;
+
/// \returns How the target needs this vector-predicated operation to be
/// transformed.
TargetTransformInfo::VPLegalization
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index 66a779da8c25bc..9633ba9cc70ee9 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -79,6 +79,10 @@ static cl::opt<LoopVectorizeHints::ScalableForceKind>
"Scalable vectorization is available and favored when the "
"cost is inconclusive.")));
+static cl::opt<bool>
+ EnableCSA("enable-csa-vectorization", cl::init(false), cl::Hidden,
+ cl::desc("Control whether CSA loop vectorization is enabled"));
+
/// Maximum vectorization interleave count.
static const unsigned MaxInterleaveFactor = 16;
@@ -749,6 +753,15 @@ bool LoopVectorizationLegality::setupOuterLoopInductions() {
return llvm::all_of(Header->phis(), IsSupportedPhi);
}
+void LoopVectorizationLegality::addCSAPhi(
+ PHINode *Phi, const CSADescriptor &CSADesc,
+ SmallPtrSetImpl<Value *> &AllowedExit) {
+ assert(CSADesc.isValid() && "Expected Valid CSADescriptor");
+ LLVM_DEBUG(dbgs() << "LV: found legal CSA opportunity" << *Phi << "\n");
+ AllowedExit.insert(Phi);
+ CSAs.insert({Phi, CSADesc});
+}
+
/// Checks if a function is scalarizable according to the TLI, in
/// the sense that it should be vectorized and then expanded in
/// multiple scalar calls. This is represented in the
@@ -866,14 +879,23 @@ bool LoopVectorizationLegality::canVectorizeInstrs() {
continue;
}
- // As a last resort, coerce the PHI to a AddRec expression
- // and re-try classifying it a an induction PHI.
+ // Try to coerce the PHI to a AddRec expression and re-try classifying
+ // it a an induction PHI.
if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID, true) &&
!IsDisallowedStridedPointerInduction(ID)) {
addInductionPhi(Phi, ID, AllowedExit);
continue;
}
+ // Check if the PHI can be classified as a CSA PHI.
+ if (EnableCSA || (TTI->enableCSAVectorization() &&
+ EnableCSA.getNumOccurrences() == 0)) {
+ if (auto CSADesc = CSADescriptor::isCSAPhi(Phi, TheLoop)) {
+ addCSAPhi(Phi, CSADesc, AllowedExit);
+ continue;
+ }
+ }
+
reportVectorizationFailure("Found an unidentified PHI",
"value that could not be identified as "
"reduction is used outside the loop",
@@ -1555,11 +1577,15 @@ bool LoopVectorizationLegality::canFoldTailByMasking() const {
for (const auto &Reduction : getReductionVars())
ReductionLiveOuts.insert(Reduction.second.getLoopExitInstr());
+ SmallPtrSet<const Value *, 8> CSALiveOuts;
+ for (const auto &CSA: getCSAs())
+ CSALiveOuts.insert(CSA.second.getAssignment());
+
// TODO: handle non-reduction outside users when tail is folded by masking.
for (auto *AE : AllowedExit) {
// Check that all users of allowed exit values are inside the loop or
- // are the live-out of a reduction.
- if (ReductionLiveOuts.count(AE))
+ // are the live-out of a reduction or a CSA
+ if (ReductionLiveOuts.count(AE) || CSALiveOuts.count(AE))
continue;
for (User *U : AE->users()) {
Instruction *UI = cast<Instruction>(U);
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 56f51e14a6eba9..5e45f500482826 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -180,6 +180,8 @@ const char LLVMLoopVectorizeFollowupEpilogue[] =
STATISTIC(LoopsVectorized, "Number of loops vectorized");
STATISTIC(LoopsAnalyzed, "Number of loops analyzed for vectorization");
STATISTIC(LoopsEpilogueVectorized, "Number of epilogues vectorized");
+STATISTIC(CSAsVectorized,
+ "Number of conditional scalar assignments vectorized");
static cl::opt<bool> EnableEpilogueVectorization(
"enable-epilogue-vectorization", cl::init(true), cl::Hidden,
@@ -500,6 +502,10 @@ class InnerLoopVectorizer {
virtual std::pair<BasicBlock *, Value *>
createVectorizedLoopSkeleton(const SCEV2ValueTy &ExpandedSCEVs);
+ /// For all vectorized CSAs, replace uses of live-out scalar from the orignal
+ /// loop with the extracted scalar from the vector loop for.
+ void fixCSALiveOuts(VPTransformState &State, VPlan &Plan);
+
/// Fix the vectorized code, taking care of header phi's, live-outs, and more.
void fixVectorizedLoop(VPTransformState &State, VPlan &Plan);
@@ -2932,6 +2938,25 @@ LoopVectorizationCostModel::getVectorIntrinsicCost(CallInst *CI,
TargetTransformInfo::TCK_RecipThroughput);
}
+void InnerLoopVectorizer::fixCSALiveOuts(VPTransformState &State, VPlan &Plan) {
+ for (const auto &CSA: Plan.getCSAStates()) {
+ VPCSADataUpdateRecipe *VPDataUpdate = CSA.second->getDataUpdate();
+ assert(VPDataUpdate &&
+ "VPDataUpdate must have been introduced prior to fixing live outs");
+ Value *V = VPDataUpdate->getUnderlyingValue();
+ Value *ExtractedScalar = State.get(CSA.second->getExtractScalarRecipe(), 0,
+ /*NeedsScalar=*/true);
+ // Fix LCSSAPhis
+ llvm::SmallPtrSet<PHINode *, 2> ToFix;
+ for (User *U : V->users())
+ if (auto *Phi = dyn_cast<PHINode>(U);
+ Phi && Phi->getParent() == LoopExitBlock)
+ ToFix.insert(Phi);
+ for (PHINode *Phi : ToFix)
+ Phi->addIncoming(ExtractedScalar, LoopMiddleBlock);
+ }
+}
+
void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
VPlan &Plan) {
// Fix widened non-induction PHIs by setting up the PHI operands.
@@ -2972,6 +2997,8 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
getOrCreateVectorTripCount(VectorLoop->getLoopPreheader()),
IVEndValues[Entry.first], LoopMiddleBlock,
VectorLoop->getHeader(), Plan, State);
+
+ fixCSALiveOuts(State, Plan);
}
// Fix live-out phis not already fixed earlier.
@@ -4110,7 +4137,6 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
// found modulo the vectorization factor is not zero, try to fold the tail
// by masking.
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
- setTailFoldingStyles(MaxFactors.ScalableVF.isScalable(), UserIC);
if (foldTailByMasking()) {
if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) {
LLVM_DEBUG(
@@ -4482,6 +4508,9 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF,
case VPDef::VPEVLBasedIVPHISC:
case VPDef::VPPredInstPHISC:
case VPDef::VPBranchOnMaskSC:
+ case VPRecipeBase::VPCSADataUpdateSC:
+ case VPRecipeBase::VPCSAExtractScalarSC:
+ case VPRecipeBase::VPCSAHe...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
c15cf30
to
3dc4a47
Compare
3dc4a47
to
1e31c22
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a very rough initial review, to get the ball rolling. Can you kindly look at iv-select-cmp.ll (and associated tests) to increase test coverage to include negative tests as well? It would be nice if this patch created changes in iv-select-cmp.ll, although I'll need to review in more detail to find out why.
Another high-level question: is it possible to not nest the test under RISCV? I realized that you've said it's disabled in BasicTTIImpl, but having a separate cas.ll under each target wouldn't be ideal, no? |
1e31c22
to
51e4c67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@artagnon thanks so much for the initial review!
I've addressed comments and made some responses :)
CSA vectorization is not enabled by default. |
I can do this. It's a good idea. |
Kindly note that the only functionality is squash-and-merge, and this will be the final commit message: "series of commits" doesn't make sense in the final commit message. Also, it would be nice to have a short explanation in the commit message itself (possibly with a few examples). Finally, kindly ensure that markdown doesn't end up when someone does a |
Perhaps a good idea to name the test conditional-scalar-assignment.ll instead of the more cryptic csa.ll. |
I have updated the tests to use |
c59efaf
to
ecd9285
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a closer inspection of the patch, and found mostly small things to fix. Otherwise, I'm a fan of the approach you've taken: the code is quite easy to understand, and there is little bloat. Thanks again for the patch, and I hope the other reviewers also have similar sentiments!
On a minor note, it might be helpful if you could locally enable CSA vectorization, and post the diff of the test changes somewhere, just to make sure we're doing the right thing.
llvm/test/Transforms/LoopVectorize/conditional-scalar-assignment.ll
Outdated
Show resolved
Hide resolved
The docstring of this function: > Returns true if this VPInstruction's operands are single scalars and the > result is also a single scalar. ExplicitVectorLength fits this description. I don't have a test for it now, but it is needed by llvm#106560.
a03e73b
to
c2e26ff
Compare
be3ca24
to
f4228d0
Compare
ping |
|
||
/// A Conditional Scalar Assignment (CSA) is an assignment from an initial | ||
/// scalar that may or may not occur. | ||
class CSADescriptor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to introduce a whole new class + analysis file instead of adding a new RecurKind in IVDescriptors?
@Mel-Chen 's #67812 seems to be quite similar work which just extends the existing code. This CSA patch is more general in that it can handle values that aren't just the induction variable, but I think the amount of extra code can be cut down a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to introduce a whole new class + analysis file instead of adding a new RecurKind in IVDescriptors?
I’ve gone ahead and put CSADescriptor inside the IVDescriptor file. But there isn't anything that CSADescriptor looks to reuse from RecurrenceDescriptor or InductionDescriptor, so I think it warrants a new class. CSADescriptor also adds new functions and data members that aren't be relevant for other Recurrences or Inductions. I expect CSADescriptors class to diverge even further, as we extend CSA do support IR with control flow.
public: | ||
/// If Phi is the root of a CSA, return the CSADescriptor of the CSA rooted by | ||
/// Phi. Otherwise, return a CSADescriptor with IsValidCSA set to false. | ||
static CSADescriptor isCSAPhi(PHINode *Phi, Loop *TheLoop); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think returning a std::optional instead of having valid/invalid descriptors would work better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the approach that RecurrenceDescriptor and InductionDescriptor used to take. They now return bool here and take a descriptor by reference. I have updated CSADesciptors use the new approach.
; EVL-NEXT: [[TMP9:%.*]] = sext <vscale x 4 x i32> [[VP_OP_LOAD]] to <vscale x 4 x i64> | ||
; EVL-NEXT: [[TMP10:%.*]] = icmp slt <vscale x 4 x i64> [[BROADCAST_SPLAT]], [[TMP9]] | ||
; EVL-NEXT: [[CSA_COND_ANYACTIVE:%.*]] = call i1 @llvm.vp.reduce.or.nxv4i1(i1 false, <vscale x 4 x i1> [[TMP10]], <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), i32 [[TMP5]]) | ||
; EVL-NEXT: [[CSA_VL_SEL]] = select i1 [[CSA_COND_ANYACTIVE]], i32 [[TMP5]], i32 [[CSA_VL_PHI]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what the eventual asm looks like for this. I had a look through the RVV spec and couldn't find an instruction which selected between whole registers based on a single condition, just merging moves which do a per-lane select. I may have missed something though. Does this require branches inside the loop?
I ask because I'm trying to evaluate this work from an SVE perspective, and I think I would prefer to do the extract inside the loop using clastb
with a scalar phi instead of outside the loop with a branch on the inside to conditionally update the whole register.
I guess we may end up with diverging recipes for vectorizing the same pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the generated code:
# %bb.0: # %entry
sext.w a3, a0
blez a3, .LBB0_6
# %bb.1: # %loop.preheader
li a3, 0
li a6, 0
slli a0, a0, 32
srli a0, a0, 32
and a5, a5, a7
vsetvli a7, zero, e8, mf2, ta, ma
vmclr.m v8
j .LBB0_3
.LBB0_2: # %vector.body
snez a6, t1
vsetvli t1, zero, e8, mf2, ta, ma
vmv.v.x v14, a6
vmsne.vi v0, v14, 0
vmand.mm v9, v9, v0
vmandn.mm v8, v8, v0
vmor.mm v8, v9, v8
vsetvli zero, zero, e32, m2, ta, ma
vmerge.vvm v10, v10, v12, v0
sub a5, a5, a4
add a3, a7, a3
mv a6, t0
beqz a5, .LBB0_5
.LBB0_3: # %vector.body
sub t0, a0, a3
vsetvli a7, t0, e8, mf2, ta, ma
slli t1, a3, 2
add t1, a1, t1
vle32.v v12, (t1)
vsetvli t1, zero, e64, m4, ta, ma
vsext.vf2 v16, v12
vmsgt.vx v9, v16, a2
vsetvli zero, t0, e8, mf2, ta, ma
vcpop.m t1, v9
mv t0, a7
bnez t1, .LBB0_2
# %bb.4: # %vector.body
mv t0, a6
j .LBB0_2
.LBB0_5: # %middle.block
slli t0, t0, 32
srli a0, t0, 32
vsetvli zero, a0, e32, m2, ta, ma
vid.v v12
vsetivli zero, 1, e32, m1, ta, ma
vmv.v.i v9, -1
vmv1r.v v0, v8
vsetvli zero, a0, e32, m2, ta, ma
vredmax.vs v9, v12, v9, v0.t
vmv.x.s a0, v9
slli a1, a0, 32
srli a1, a1, 32
vsetivli zero, 1, e32, m2, ta, ma
vslidedown.vx v8, v10, a1
vmv.x.s a1, v8
srai a0, a0, 63
or a0, a0, a1
ret
.LBB0_6:
li a0, -1
ret
08cd3e7
to
d18455c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huntergr-arm thanks for the review. I have addressed or responded to your comments.
; EVL-NEXT: [[TMP9:%.*]] = sext <vscale x 4 x i32> [[VP_OP_LOAD]] to <vscale x 4 x i64> | ||
; EVL-NEXT: [[TMP10:%.*]] = icmp slt <vscale x 4 x i64> [[BROADCAST_SPLAT]], [[TMP9]] | ||
; EVL-NEXT: [[CSA_COND_ANYACTIVE:%.*]] = call i1 @llvm.vp.reduce.or.nxv4i1(i1 false, <vscale x 4 x i1> [[TMP10]], <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), i32 [[TMP5]]) | ||
; EVL-NEXT: [[CSA_VL_SEL]] = select i1 [[CSA_COND_ANYACTIVE]], i32 [[TMP5]], i32 [[CSA_VL_PHI]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the generated code:
# %bb.0: # %entry
sext.w a3, a0
blez a3, .LBB0_6
# %bb.1: # %loop.preheader
li a3, 0
li a6, 0
slli a0, a0, 32
srli a0, a0, 32
and a5, a5, a7
vsetvli a7, zero, e8, mf2, ta, ma
vmclr.m v8
j .LBB0_3
.LBB0_2: # %vector.body
snez a6, t1
vsetvli t1, zero, e8, mf2, ta, ma
vmv.v.x v14, a6
vmsne.vi v0, v14, 0
vmand.mm v9, v9, v0
vmandn.mm v8, v8, v0
vmor.mm v8, v9, v8
vsetvli zero, zero, e32, m2, ta, ma
vmerge.vvm v10, v10, v12, v0
sub a5, a5, a4
add a3, a7, a3
mv a6, t0
beqz a5, .LBB0_5
.LBB0_3: # %vector.body
sub t0, a0, a3
vsetvli a7, t0, e8, mf2, ta, ma
slli t1, a3, 2
add t1, a1, t1
vle32.v v12, (t1)
vsetvli t1, zero, e64, m4, ta, ma
vsext.vf2 v16, v12
vmsgt.vx v9, v16, a2
vsetvli zero, t0, e8, mf2, ta, ma
vcpop.m t1, v9
mv t0, a7
bnez t1, .LBB0_2
# %bb.4: # %vector.body
mv t0, a6
j .LBB0_2
.LBB0_5: # %middle.block
slli t0, t0, 32
srli a0, t0, 32
vsetvli zero, a0, e32, m2, ta, ma
vid.v v12
vsetivli zero, 1, e32, m1, ta, ma
vmv.v.i v9, -1
vmv1r.v v0, v8
vsetvli zero, a0, e32, m2, ta, ma
vredmax.vs v9, v12, v9, v0.t
vmv.x.s a0, v9
slli a1, a0, 32
srli a1, a1, 32
vsetivli zero, 1, e32, m2, ta, ma
vslidedown.vx v8, v10, a1
vmv.x.s a1, v8
srai a0, a0, 63
or a0, a0, a1
ret
.LBB0_6:
li a0, -1
ret
|
||
/// A Conditional Scalar Assignment (CSA) is an assignment from an initial | ||
/// scalar that may or may not occur. | ||
class CSADescriptor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to introduce a whole new class + analysis file instead of adding a new RecurKind in IVDescriptors?
I’ve gone ahead and put CSADescriptor inside the IVDescriptor file. But there isn't anything that CSADescriptor looks to reuse from RecurrenceDescriptor or InductionDescriptor, so I think it warrants a new class. CSADescriptor also adds new functions and data members that aren't be relevant for other Recurrences or Inductions. I expect CSADescriptors class to diverge even further, as we extend CSA do support IR with control flow.
public: | ||
/// If Phi is the root of a CSA, return the CSADescriptor of the CSA rooted by | ||
/// Phi. Otherwise, return a CSADescriptor with IsValidCSA set to false. | ||
static CSADescriptor isCSAPhi(PHINode *Phi, Loop *TheLoop); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the approach that RecurrenceDescriptor and InductionDescriptor used to take. They now return bool here and take a descriptor by reference. I have updated CSADesciptors use the new approach.
assert(VPDataUpdate && | ||
"VPDataUpdate must have been introduced prior to fixing live outs"); | ||
Value *V = VPDataUpdate->getUnderlyingValue(); | ||
Value *ExtractedScalar = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can those fixups be directly done in VPlan, similar to how other live-outs are handled (with the exception of induction users, which is the last remaining case that needs fixing up outside of VPlan)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
@@ -669,6 +673,65 @@ Value *VPInstruction::generate(VPTransformState &State) { | |||
} | |||
return NewPhi; | |||
} | |||
case VPInstruction::CSAMaskPhi: { | |||
IRBuilder<>::InsertPointGuard Guard(State.Builder); | |||
State.Builder.SetInsertPoint(State.CFG.PrevBB->getFirstNonPHI()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the builder need to be reset here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
@@ -3497,6 +3695,8 @@ class VPlan { | |||
/// live-outs are fixed via VPLiveOut::fixPhi. | |||
MapVector<PHINode *, VPLiveOut *> LiveOuts; | |||
|
|||
MapVector<PHINode *, VPCSAState *> CSAStates; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be part of VPlan? Is it not possible to encode all required info in recipes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
@@ -0,0 +1,2932 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the general code-gen parts be tested independent of RISCV? Could you add a test checking the printed VPlan containing the CAS recipes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no way to make a generic target support predicated vectorization to my knowledge. As a result, I have moved these tests back RISC-V. I think we can consider moving them back out until a generic target can handle it. We do something similar for other code that uses this option. WDYT?
/// VPCSAState holds information required to vectorize a conditional scalar | ||
/// assignment. | ||
class VPCSAState { | ||
VPValue *VPInitScalar = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to avoid holding references to VPValues/Recipes, as they may be removed/replaced, in case the state here gets stale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
#67812 only handles increasing induction variables because Currently, I do not have plans to support
For targets without a corresponding instruction, such as RISC-V, we could implement the same semantics in the backend using multiple instructions (e.g., vcpop + beqz + vcompress) if it is profitable. In VPlan, we might need to create a new recipe, In the end, if there is indeed a need to support this semantic, please let me know, and I can discuss internally whether to include it in the plan. :) |
5afb2a3
to
ea5ab07
Compare
e55606a
to
de0df31
Compare
rebase; ping |
de0df31
to
dac2839
Compare
As discussed in #112738, it may be better to have an intrinsic to represent vector element extracts based on mask bits. This intrinsic is for the case of extracting the last active element, if any, or a default value if the mask is all-false. The target-agnostic SelectionDAG lowering is similar to the IR in #106560.
As discussed in llvm#112738, it may be better to have an intrinsic to represent vector element extracts based on mask bits. This intrinsic is for the case of extracting the last active element, if any, or a default value if the mask is all-false. The target-agnostic SelectionDAG lowering is similar to the IR in llvm#106560.
ping |
This patch adds initial support for CSA vectorization LLVM. This new class
can be characterized by vectorization of assignment to a scalar in a loop,
such that the assignment is conditional from the perspective of its use.
An assignment is conditional in a loop if a value may or may not be assigned
in the loop body.
For example:
Using pseudo-LLVM code this can be vectorized as
On each iteration, we track whether any lane in the widened condition was active,
and if it was take the current mask and data as the new mask and data vector.
Then at the end of the loop, the scalar can be extracted only once.
This transformation works the same way for integer, pointer, and floating point
conditional assignment, since the transformation does not require inspection
of the data being assigned.
In the vectorization of a CSA, we will be introducing recipes into the vector
preheader, the vector body, and the middle block. Recipes that are introduced
into the preheader and middle block are executed only one time, and recipes
that are in the vector body will be possibly executed multiple times. The more
times that the vector body is executed, the less of an impact the preheader
and middle block cost have on the overall cost of a CSA.
A detailed explanation of the concept can be found here.
This patch is further tested in llvm/llvm-test-suite#155.