[Snippets] Added Dynamism support to intermediate memory

[Snippets] Renamed BufferID to BufferRegisterGroup [Snippets] Changed allocation shape on size [Snippets] Added Buffer cluster_ID [Snippets][Tests] Fixed build insert_load_store test [Snippets] Splited SolveBufferMemory into static and dynamic logic [Snippets] Rewrote ComputeBufferAllocationSize::get_allocation_size [Snippets] Added synamism support to InitBuffersDefault [Snippets][Tests] Added tests for clusters [Snippets] Added buffer_expressions to ComputeBufferAllocationSize [Snippets] Added to LoopInfo for splitted loops: [Snippets] Removed copy from UpdateLoopInfo [Snippets] Moved UpdateLoopInfo to RunimeConfigurator [Snippets] Add dynamic buffers support to Configurator [Snippets] Fixed Reduce decomp: add shape infer for outputs [snippets] Fixed broadcast_merge_dim in shape inference [Snippets][CPU][Tests] Enabled dynamic Softmax tests [Snippets] Removed useless function calculate_size [Snippets][CPU][Tests] Enabled dynamic reduce test [Snippets] Small fixes in solve_buffer_memory for dynamic nodes [CPU][Snippets] Removed useless emitters LoadConvert and StoreConvert [Snippets] Added missed consumers cloning [Snippets][CPU] Added buffer offsets to call_args [Snippets][CPU] Added dynamic offsets support to load and store emitters [CPU][UnitTests} Fixed build [Snippets][AArch64] Fixed build [Snippets] Small fixes
a-sidorova · May 28, 2024 · 38b570f · 38b570f
1 parent b0b4201
commit 38b570f
Show file tree

Hide file tree

Showing 66 changed files with 1,175 additions and 905 deletions.
diff --git a/src/common/snippets/docs/snippets_design_guide.md b/src/common/snippets/docs/snippets_design_guide.md
@@ -605,17 +605,17 @@ Again, the explicit operations are needed to emit appropriate instructions later
 As mentioned above the `op::Buffer` operations are managed by the pass `AllocateBuffers`.
 Before describing the algorithm, it is necessary to briefly consider the structure of `Buffer`:
 * All `Buffers` represent `Buffer scratchpad` together (a common memory that is needed for intermediate results storing).
-* Each `Buffer` has an `offset` relative to the common data pointer (pointer of `Buffer scratchpad`) and `ID` (the `Buffers` with the same `ID` have the same assigned register).
+* Each `Buffer` has an `offset` relative to the common data pointer (pointer of `Buffer scratchpad`), `RegGroup` (the `Buffers` with the same `RegGroup` have the same assigned register) and `ClusterID` (the buffers from the same cluster refer to the same memory area - they have the same `offset` relative to the `Buffer scratchpad` data pointer).
 
 The algorithm supports two modes: optimized and non-optimized.
-The optimized one calculates minimal memory size and minimal unique `ID` count required to handle all the buffers.
-The non-optimized version assigns each buffer an unique `ID` and `offset`.
+The optimized one calculates minimal memory size and minimal unique `RegGroup` count required to handle all the buffers.
+The non-optimized version assigns each buffer an unique `RegGroup`, `ClusterID` and `offset`.
 The first mode is the default one, while the second one might be used for debugging the optimized version.
 The optimized algorithm `AllocateBuffers` has the main following steps:
-1. `IdentifyBuffers` - analyzes `Buffers` access patterns to avoid redundant pointer increments. A graph coloring algorithm is utilized for this purpose.
-2. `DefineBufferClusters` - creates sets of `Buffer` ops - `BufferClusters`.
-`Buffers` from one `BufferCluster` refer to the same memory area (they have the same `offset` relative to the `Buffer scratchpad` data pointer).
-For example, there is a loop with `Buffer` ops on input and output. If the body of this loop can write data to the memory from which it was read, these `Buffers` are in one `BufferCluster`.
+1. `SetBufferRegGroup` - analyzes `Buffers` access patterns to avoid redundant pointer increments. A graph coloring algorithm is utilized for this purpose.
+2. `DefineBufferClusters` - creates sets of `Buffer` ops (buffer clusters) and set `ClusterID` value to `Buffer` ops.
+As noticed above, `Buffers` from one cluster refer to the same memory area.
+For example, there is a loop with `Buffer` ops on input and output. If the body of this loop can write data to the memory from which it was read, these `Buffers` are in one cluster.
 3. `SolveBufferMemory` - calculate the most optimal memory size of `Buffer scratchpad` based on `BufferClusters` and life time of `Buffers`.
 
 More details on control flow optimization passes could be found in the `control_flow_transformations(...)` method inside [subgraph.cpp](../src/op/subgraph.cpp). 

diff --git a/src/common/snippets/include/snippets/lowered/linear_ir.hpp b/src/common/snippets/include/snippets/lowered/linear_ir.hpp
@@ -76,13 +76,14 @@ class LinearIR {
     ExpressionPtr create_expression(const std::shared_ptr<Node>& n, const std::vector<PortConnectorPtr>& inputs) const;
 
     const container& get_ops() const { return m_expressions; }
+    const container& get_buffer_ops() const { return m_buffer_expressions; }
     const container& get_parameters() const { return m_parameter_expressions; }
     const container& get_results() const { return m_result_expressions; }
     const Config& get_config() const { return m_config; }
-    size_t get_buffer_scratchpad_size() const { return m_buffer_scratchpad_size; }
+    size_t get_static_buffer_scratchpad_size() const { return m_static_buffer_scratchpad_size; }
 
     void set_loop_depth(size_t loop_depth) { m_config.m_loop_depth = loop_depth; }
-    void set_buffer_scratchpad_size(size_t size) { m_buffer_scratchpad_size = size; }
+    void set_static_buffer_scratchpad_size(size_t size) { m_static_buffer_scratchpad_size = size; }
 
     const ExpressionPtr& get_expr_by_node(const std::shared_ptr<Node>& n) const;
 
@@ -278,13 +279,15 @@ class LinearIR {
     std::unordered_map<std::shared_ptr<Node>, std::shared_ptr<Expression>> m_node2expression_map;
     container m_parameter_expressions{};
     container m_result_expressions{};
+    container m_buffer_expressions{};
     Config m_config{};
     LoopManagerPtr m_loop_manager;
     std::shared_ptr<IShapeInferSnippetsFactory> m_shape_infer_factory;
     std::shared_ptr<ShapeInferSnippetsNode> m_shape_infer = nullptr;
     bool m_is_dynamic = false;
 
-    size_t m_buffer_scratchpad_size = 0;
+    // Size of static Buffer Scratchpad (Buffers with defined allocation size)
+    size_t m_static_buffer_scratchpad_size = 0;
 };
 using LinearIRPtr = std::shared_ptr<LinearIR>;
 

diff --git a/src/common/snippets/include/snippets/lowered/linear_ir_builder.hpp b/src/common/snippets/include/snippets/lowered/linear_ir_builder.hpp
@@ -17,11 +17,24 @@ namespace lowered {
 class LinearIRBuilder {
 public:
     struct Config {
-        Config(bool deep_copy_of_shapes_ = true) : deep_copy_of_shapes(deep_copy_of_shapes_) {}
+        Config(bool deep_copy_of_shapes_ = true, bool copy_missed_consumers_ = true)
+            : deep_copy_of_shapes(deep_copy_of_shapes_), copy_missed_consumers(copy_missed_consumers_) {}
 
         // If True, copy of stored pointer in `PortDescriptor::m_tensor_shape`.
         // If False, copy shapes as shared pointers.
         const bool deep_copy_of_shapes = true;
+        // At the moment, input port of expression must have only one source.
+        // However, for example, after LinearIR range insertion to the LinearIR (InsertSpecificIteration pass)
+        // several operations can write to the same consumer: several `Store` ops from different loop bodies store to the same Buffer/Result.
+        // Since `clone` algorithm is linear and during expression cloning creates only input port connectors from sources,
+        // algorithm can miss some consumers. For example:
+        //      The consumers of Store0 : Buffer0
+        //      The consumers of Store1 : Buffer0
+        // The result: Buffer0 has only one source in input connector - Store1
+        // Algorithm automatically doesn't add Buffer to consumers of Store0. Thus,
+        // If True, `clone` algorithm add missed consumers.
+        // If False, cloned LinearIR will be built by default (without extra consumers).
+        const bool copy_missed_consumers = true;
     };
 
     LinearIRBuilder(Config config = {}) : m_config(std::move(config)) {}

diff --git a/src/common/snippets/include/snippets/lowered/loop_info.hpp b/src/common/snippets/include/snippets/lowered/loop_info.hpp
@@ -23,8 +23,9 @@ class LoopInfo {
     enum {UNDEFINED_DIM_IDX = std::numeric_limits<size_t>::max()};
 
     LoopInfo() = default;
-    LoopInfo(size_t work_amount, size_t increment, const std::vector<LoopPort>& entries, const std::vector<LoopPort>& exits);
-    LoopInfo(size_t work_amount, size_t increment, const std::vector<ExpressionPort>& entries, const std::vector<ExpressionPort>& exits);
+    LoopInfo(size_t work_amount, size_t increment, const std::vector<LoopPort>& entries, const std::vector<LoopPort>& exits, bool is_wa_const = false);
+    LoopInfo(size_t work_amount, size_t increment, const std::vector<ExpressionPort>& entries, const std::vector<ExpressionPort>& exits,
+             bool is_wa_const = false);
     virtual ~LoopInfo() = default;
 
     /**
@@ -76,6 +77,11 @@ class LoopInfo {
      * @return m_output_ports
      */
     const std::vector<LoopPort>& get_output_ports() const;
+    /**
+     * @brief Returns True if `work_amount` cannot be rewritten/updated by passes.
+     * @return m_is_work_amount_const
+     */
+    bool is_work_amount_const() const;
 
     /**
      * @brief Set m_work_amount value
@@ -92,6 +98,11 @@ class LoopInfo {
      * @param dim_idx - index
      */
     void set_dim_idx(size_t dim_idx);
+    /**
+     * @brief Sets `value` to `m_is_work_amount_const`
+     * @param value - value of the attribute
+     */
+    void set_work_amount_const(bool value);
 
     /**
      * @brief Replace the current LoopPort `actual_port` with new `target_ports`
@@ -164,6 +175,9 @@ class LoopInfo {
     // Note: Scalars aren't input expressions but can be before first input expr in Linear IR
     std::vector<LoopPort> m_input_ports = {};
     std::vector<LoopPort> m_output_ports = {};
+
+    // If True, no one pass can rewrite the value of `m_work_amount`
+    bool m_is_work_amount_const = false;
 };
 using LoopInfoPtr = std::shared_ptr<LoopInfo>;
 
@@ -197,13 +211,13 @@ class UnifiedLoopInfo : public LoopInfo {
     UnifiedLoopInfo(size_t work_amount, size_t increment,
                     const std::vector<LoopPort>& entries, const std::vector<LoopPort>& exits,
                     const std::vector<LoopPortDesc>& in_descs, const std::vector<LoopPortDesc>& out_descs,
-                    const SpecificIterationHandlers& handlers = SpecificIterationHandlers());
+                    const SpecificIterationHandlers& handlers = SpecificIterationHandlers(), bool is_wa_const = false);
     UnifiedLoopInfo(size_t work_amount, size_t increment,
                     const std::vector<LoopPort>& entries, const std::vector<LoopPort>& exits,
-                    const SpecificIterationHandlers& handlers = SpecificIterationHandlers());
+                    const SpecificIterationHandlers& handlers = SpecificIterationHandlers(), bool is_wa_const = false);
     UnifiedLoopInfo(size_t work_amount, size_t increment,
                     const std::vector<ExpressionPort>& entries, const std::vector<ExpressionPort>& exits,
-                    const SpecificIterationHandlers& handlers = SpecificIterationHandlers());
+                    const SpecificIterationHandlers& handlers = SpecificIterationHandlers(), bool is_wa_const = false);
 
     /**
      * @brief Clone LoopInfo with new expressions
@@ -365,7 +379,7 @@ class ExpandedLoopInfo : public LoopInfo {
     ExpandedLoopInfo(size_t work_amount, size_t increment,
                      const std::vector<LoopPort>& entries, const std::vector<LoopPort>& exits,
                      std::vector<int64_t> ptr_increments, std::vector<int64_t> final_offsets, std::vector<int64_t> data_sizes,
-                     SpecificLoopIterType type, std::shared_ptr<UnifiedLoopInfo> unified_loop_info);
+                     SpecificLoopIterType type, std::shared_ptr<UnifiedLoopInfo> unified_loop_info, bool is_wa_const = false);
     /**
      * @brief Clone LoopInfo with new expressions
      * @param expr_map map of new and old expressions

diff --git a/src/common/snippets/include/snippets/lowered/loop_manager.hpp b/src/common/snippets/include/snippets/lowered/loop_manager.hpp
@@ -99,12 +99,13 @@ class LoopManager {
                      size_t increment,
                      const std::vector<T>& entries,
                      const std::vector<T>& exits,
-                     bool set_default_handlers = true) {
+                     bool set_default_handlers = true,
+                     bool is_work_amount_const = false) {
         const auto normalized_increment = utils::is_dynamic_value(work_amount) || work_amount == 0 ? increment : std::min(increment, work_amount);
         const auto handlers = set_default_handlers
                                   ? SpecificIterationHandlers(work_amount, normalized_increment)
                                   : SpecificIterationHandlers();
-        const auto loop_info = std::make_shared<UnifiedLoopInfo>(work_amount, normalized_increment, entries, exits, handlers);
+        const auto loop_info = std::make_shared<UnifiedLoopInfo>(work_amount, normalized_increment, entries, exits, handlers, is_work_amount_const);
         const auto loop_id = this->add_loop_info(loop_info);
         for (auto expr_it = loop_begin_pos; expr_it != loop_end_pos; ++expr_it) {
             insert_loop_id(*expr_it, loop_id);
@@ -131,8 +132,9 @@ class LoopManager {
                      size_t dim_idx,
                      const std::vector<T>& entries,
                      const std::vector<T>& exits,
-                     bool set_default_handlers = true) {
-        const auto loop_id = mark_loop(loop_begin_pos, loop_end_pos, work_amount, increment, entries, exits, set_default_handlers);
+                     bool set_default_handlers = true,
+                     bool is_work_amount_const = false) {
+        const auto loop_id = mark_loop(loop_begin_pos, loop_end_pos, work_amount, increment, entries, exits, set_default_handlers, is_work_amount_const);
         const auto loop_info = get_loop_info<UnifiedLoopInfo>(loop_id);
         loop_info->set_dim_idx(dim_idx);
         return loop_id;

diff --git a/src/common/snippets/include/snippets/lowered/pass/allocate_buffers.hpp b/src/common/snippets/include/snippets/lowered/pass/allocate_buffers.hpp
@@ -42,8 +42,6 @@ class AllocateBuffers: public RangedPass {
      */
     static void set_buffer_offset(const ExpressionPtr& buffer_expr, const size_t offset);
 
-    using BufferCluster = std::set<ExpressionPtr>;
-    using BufferClusters = std::vector<BufferCluster>;
 private:
     bool m_is_optimized_mode = true;
 };

diff --git a/src/common/snippets/include/snippets/lowered/pass/compute_buffer_allocation_size.hpp b/src/common/snippets/include/snippets/lowered/pass/compute_buffer_allocation_size.hpp
@@ -0,0 +1,38 @@
+// Copyright (C) 2018-2024 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+#pragma once
+
+#include "pass.hpp"
+
+#include "snippets/lowered/loop_manager.hpp"
+
+namespace ov {
+namespace snippets {
+namespace lowered {
+namespace pass {
+
+/**
+ * @interface ComputeBufferAllocationSize
+ * @brief The pass calculate allocation sizes of Buffers.
+ * @param m_buffer_allocation_rank - rank of shape for memory allocation: shape[m_allocation_rank : -1]
+ * @ingroup snippets
+ */
+class ComputeBufferAllocationSize : public RangedPass {
+public:
+    OPENVINO_RTTI("ComputeBufferAllocationSize", "RangedPass")
+    ComputeBufferAllocationSize(size_t buffer_allocation_rank) : m_buffer_allocation_rank(buffer_allocation_rank) {}
+
+    bool run(LinearIR& linear_ir, lowered::LinearIR::constExprIt begin, lowered::LinearIR::constExprIt end) override;
+
+    static size_t get_allocation_size(const LoopManagerPtr& loop_manager, const ExpressionPtr& buffer_expr, size_t allocation_rank);
+
+private:
+    size_t m_buffer_allocation_rank = 0;
+};
+
+} // namespace pass
+} // namespace lowered
+} // namespace snippets
+} // namespace ov
diff --git a/src/common/snippets/include/snippets/lowered/pass/define_buffer_clusters.hpp b/src/common/snippets/include/snippets/lowered/pass/define_buffer_clusters.hpp
@@ -6,8 +6,6 @@
 
 #include "pass.hpp"
 
-#include "allocate_buffers.hpp"
-
 namespace ov {
 namespace snippets {
 namespace lowered {
@@ -35,7 +33,7 @@ class DefineBufferClusters : public RangedPass {
 public:
     OPENVINO_RTTI("DefineBufferClusters", "RangedPass")
 
-    DefineBufferClusters(AllocateBuffers::BufferClusters& clusters) : m_clusters(clusters) {}
+    DefineBufferClusters() = default;
 
     /**
      * @brief Apply the pass to the Linear IR
@@ -45,13 +43,15 @@ class DefineBufferClusters : public RangedPass {
     bool run(lowered::LinearIR& linear_ir, lowered::LinearIR::constExprIt begin, lowered::LinearIR::constExprIt end) override;
 
 private:
+    using BufferCluster = std::set<ExpressionPtr>;
+    using BufferClusters = std::vector<BufferCluster>;
     using BufferPorts = std::unordered_map<ExpressionPtr, std::set<size_t>>;
     /**
      * @brief Finds Buffer cluster in set of clusters which contains the target expression with Buffer
      * @param target target expression with Buffer op
      * @return vector iterator which refers to the found cluster
      */
-    AllocateBuffers::BufferClusters::iterator find_cluster_by_expr(const ExpressionPtr& target);
+    BufferClusters::iterator find_cluster_by_expr(const ExpressionPtr& target);
     /**
      * @brief Returns True if Buffer is direct source for the target expr (there aren't other loop between the Buffer and target expr)
      * @param buffer_expr expression with assumed Buffer op
@@ -70,7 +70,7 @@ class DefineBufferClusters : public RangedPass {
      * @param cluster set of Buffer expressions - cluster
      * @return common buffer ID or SIZE_MAX - size value
      */
-    size_t get_cluster_buffer_id(const AllocateBuffers::BufferCluster& cluster) const;
+    size_t get_cluster_buffer_id(const BufferCluster& cluster) const;
 
     /**
      * @brief Analyzes Loop: if Loop has Buffer ops on inputs and outputs, Loop can read and write from/to the same memory.
@@ -126,10 +126,10 @@ class DefineBufferClusters : public RangedPass {
      * @param is_outer_up true if outer buffer is upper in Linear IR than inner Buffers
      * @return Return True if clusters have been united
      */
-    bool unite_nested_clusters(const AllocateBuffers::BufferClusters::iterator& inner_cluster_it, AllocateBuffers::BufferCluster& outer_cluster,
+    bool unite_nested_clusters(const BufferClusters::iterator& inner_cluster_it, BufferCluster& outer_cluster,
                                const ExpressionPtr& outer_buffer, bool is_outer_up);
 
-    AllocateBuffers::BufferClusters& m_clusters;
+    BufferClusters m_clusters;
 };
 
 } // namespace pass

diff --git a/src/common/snippets/include/snippets/lowered/pass/init_buffers_default.hpp b/src/common/snippets/include/snippets/lowered/pass/init_buffers_default.hpp
@@ -13,7 +13,7 @@ namespace pass {
 
 /**
  * @interface InitBuffersDefault
- * @brief The pass inits Buffer expressions in LinearIR default (non-optimized): sets unique offsets and ID to Buffers.
+ * @brief The pass inits Buffer expressions in LinearIR default (non-optimized): sets unique offsets and reg groups to Buffers.
  * @ingroup snippets
  */
 

diff --git a/src/common/snippets/include/snippets/lowered/pass/insert_buffers.hpp b/src/common/snippets/include/snippets/lowered/pass/insert_buffers.hpp
@@ -24,7 +24,7 @@ namespace pass {
 class InsertBuffers : public RangedPass {
 public:
     OPENVINO_RTTI("InsertBuffers", "RangedPass")
-    InsertBuffers(int32_t buffer_allocation_rank);
+    InsertBuffers() = default;
     bool run(LinearIR& linear_ir, lowered::LinearIR::constExprIt begin, lowered::LinearIR::constExprIt end) override;
 
 private:
@@ -39,8 +39,6 @@ class InsertBuffers : public RangedPass {
                                                     const LoopManagerPtr& loop_manager,
                                                     const ExpressionPtr& expr,
                                                     const ExpressionPtr& down_expr);
-
-    int32_t m_buffer_allocation_rank;
 };
 
 } // namespace pass

diff --git a/...ets/lowered/pass/normalize_buffer_ids.hpp → ...ered/pass/normalize_buffer_reg_groups.hpp b/...ets/lowered/pass/normalize_buffer_ids.hpp → ...ered/pass/normalize_buffer_reg_groups.hpp
@@ -12,20 +12,20 @@ namespace lowered {
 namespace pass {
 
 /**
- * @interface NormalizeBufferIDs
- * @brief After optimizations some Buffer IDs might be set unevenly: some numbers are missed.
+ * @interface NormalizeBufferRegisterGroups
+ * @brief After optimizations some Buffer RegGroups might be set unevenly: some numbers are missed.
  *        For example,
- *                 [Buffer -> ID]
- *                  Buffer0 -> 0    Two Buffers have ID = 0, one has ID = 2.
- *                  Buffer1 -> 2    Obviosly, we can normalize this IDs to set ID = 1 to Buffer1.
+ *                 [Buffer -> RegGroup]
+ *                  Buffer0 -> 0    Two Buffers have RegGroup = 0, one has RegGroup = 2.
+ *                  Buffer1 -> 2    Obviosly, we can normalize this IDs to set RegGroup = 1 to Buffer1.
  *                  Buffer2 -> 0    It helps to assign GPR registers in `AssignRegister` more effective.
  *        Thus, the pass normalize IDs of Buffers in Linear IR.
  * @ingroup snippets
  */
 
-class NormalizeBufferIDs : public RangedPass {
+class NormalizeBufferRegisterGroups : public RangedPass {
 public:
-    OPENVINO_RTTI("NormalizeBufferIDs", "RangedPass")
+    OPENVINO_RTTI("NormalizeBufferRegisterGroups", "RangedPass")
     /**
      * @brief Apply the pass to the Linear IR
      * @param linear_ir the target Linear IR