Skip to content

Commit

Permalink
[Snippets] Added Dynamism support to intermediate memory
Browse files Browse the repository at this point in the history
[Snippets] Renamed BufferID to BufferRegisterGroup

[Snippets] Changed allocation shape on size

[Snippets] Added Buffer cluster_ID

[Snippets][Tests] Fixed build insert_load_store test

[Snippets] Splited SolveBufferMemory into static and dynamic logic

[Snippets] Rewrote ComputeBufferAllocationSize::get_allocation_size

[Snippets] Added synamism support to InitBuffersDefault

[Snippets][Tests] Added tests for clusters

[Snippets] Added buffer_expressions to ComputeBufferAllocationSize

[Snippets] Added  to LoopInfo for splitted loops:

[Snippets] Removed copy from UpdateLoopInfo

[Snippets] Moved UpdateLoopInfo to RunimeConfigurator

[Snippets] Add dynamic buffers support to Configurator

[Snippets] Fixed Reduce decomp: add shape infer for outputs

[snippets] Fixed broadcast_merge_dim in shape inference

[Snippets][CPU][Tests] Enabled dynamic Softmax tests

[Snippets] Removed useless function calculate_size

[Snippets][CPU][Tests] Enabled dynamic reduce test

[Snippets] Small fixes in solve_buffer_memory for dynamic nodes

[CPU][Snippets] Removed useless emitters LoadConvert and StoreConvert

[Snippets] Added missed consumers cloning

[Snippets][CPU] Added buffer offsets to call_args

[Snippets][CPU] Added dynamic offsets support to load and store emitters

[CPU][UnitTests} Fixed build

[Snippets][AArch64] Fixed build

[Snippets] Small fixes
  • Loading branch information
a-sidorova committed May 28, 2024
1 parent b0b4201 commit 38b570f
Show file tree
Hide file tree
Showing 66 changed files with 1,175 additions and 905 deletions.
14 changes: 7 additions & 7 deletions src/common/snippets/docs/snippets_design_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -605,17 +605,17 @@ Again, the explicit operations are needed to emit appropriate instructions later
As mentioned above the `op::Buffer` operations are managed by the pass `AllocateBuffers`.
Before describing the algorithm, it is necessary to briefly consider the structure of `Buffer`:
* All `Buffers` represent `Buffer scratchpad` together (a common memory that is needed for intermediate results storing).
* Each `Buffer` has an `offset` relative to the common data pointer (pointer of `Buffer scratchpad`) and `ID` (the `Buffers` with the same `ID` have the same assigned register).
* Each `Buffer` has an `offset` relative to the common data pointer (pointer of `Buffer scratchpad`), `RegGroup` (the `Buffers` with the same `RegGroup` have the same assigned register) and `ClusterID` (the buffers from the same cluster refer to the same memory area - they have the same `offset` relative to the `Buffer scratchpad` data pointer).

The algorithm supports two modes: optimized and non-optimized.
The optimized one calculates minimal memory size and minimal unique `ID` count required to handle all the buffers.
The non-optimized version assigns each buffer an unique `ID` and `offset`.
The optimized one calculates minimal memory size and minimal unique `RegGroup` count required to handle all the buffers.
The non-optimized version assigns each buffer an unique `RegGroup`, `ClusterID` and `offset`.
The first mode is the default one, while the second one might be used for debugging the optimized version.
The optimized algorithm `AllocateBuffers` has the main following steps:
1. `IdentifyBuffers` - analyzes `Buffers` access patterns to avoid redundant pointer increments. A graph coloring algorithm is utilized for this purpose.
2. `DefineBufferClusters` - creates sets of `Buffer` ops - `BufferClusters`.
`Buffers` from one `BufferCluster` refer to the same memory area (they have the same `offset` relative to the `Buffer scratchpad` data pointer).
For example, there is a loop with `Buffer` ops on input and output. If the body of this loop can write data to the memory from which it was read, these `Buffers` are in one `BufferCluster`.
1. `SetBufferRegGroup` - analyzes `Buffers` access patterns to avoid redundant pointer increments. A graph coloring algorithm is utilized for this purpose.
2. `DefineBufferClusters` - creates sets of `Buffer` ops (buffer clusters) and set `ClusterID` value to `Buffer` ops.
As noticed above, `Buffers` from one cluster refer to the same memory area.
For example, there is a loop with `Buffer` ops on input and output. If the body of this loop can write data to the memory from which it was read, these `Buffers` are in one cluster.
3. `SolveBufferMemory` - calculate the most optimal memory size of `Buffer scratchpad` based on `BufferClusters` and life time of `Buffers`.

More details on control flow optimization passes could be found in the `control_flow_transformations(...)` method inside [subgraph.cpp](../src/op/subgraph.cpp).
Expand Down
9 changes: 6 additions & 3 deletions src/common/snippets/include/snippets/lowered/linear_ir.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -76,13 +76,14 @@ class LinearIR {
ExpressionPtr create_expression(const std::shared_ptr<Node>& n, const std::vector<PortConnectorPtr>& inputs) const;

const container& get_ops() const { return m_expressions; }
const container& get_buffer_ops() const { return m_buffer_expressions; }
const container& get_parameters() const { return m_parameter_expressions; }
const container& get_results() const { return m_result_expressions; }
const Config& get_config() const { return m_config; }
size_t get_buffer_scratchpad_size() const { return m_buffer_scratchpad_size; }
size_t get_static_buffer_scratchpad_size() const { return m_static_buffer_scratchpad_size; }

void set_loop_depth(size_t loop_depth) { m_config.m_loop_depth = loop_depth; }
void set_buffer_scratchpad_size(size_t size) { m_buffer_scratchpad_size = size; }
void set_static_buffer_scratchpad_size(size_t size) { m_static_buffer_scratchpad_size = size; }

const ExpressionPtr& get_expr_by_node(const std::shared_ptr<Node>& n) const;

Expand Down Expand Up @@ -278,13 +279,15 @@ class LinearIR {
std::unordered_map<std::shared_ptr<Node>, std::shared_ptr<Expression>> m_node2expression_map;
container m_parameter_expressions{};
container m_result_expressions{};
container m_buffer_expressions{};
Config m_config{};
LoopManagerPtr m_loop_manager;
std::shared_ptr<IShapeInferSnippetsFactory> m_shape_infer_factory;
std::shared_ptr<ShapeInferSnippetsNode> m_shape_infer = nullptr;
bool m_is_dynamic = false;

size_t m_buffer_scratchpad_size = 0;
// Size of static Buffer Scratchpad (Buffers with defined allocation size)
size_t m_static_buffer_scratchpad_size = 0;
};
using LinearIRPtr = std::shared_ptr<LinearIR>;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,24 @@ namespace lowered {
class LinearIRBuilder {
public:
struct Config {
Config(bool deep_copy_of_shapes_ = true) : deep_copy_of_shapes(deep_copy_of_shapes_) {}
Config(bool deep_copy_of_shapes_ = true, bool copy_missed_consumers_ = true)
: deep_copy_of_shapes(deep_copy_of_shapes_), copy_missed_consumers(copy_missed_consumers_) {}

// If True, copy of stored pointer in `PortDescriptor::m_tensor_shape`.
// If False, copy shapes as shared pointers.
const bool deep_copy_of_shapes = true;
// At the moment, input port of expression must have only one source.
// However, for example, after LinearIR range insertion to the LinearIR (InsertSpecificIteration pass)
// several operations can write to the same consumer: several `Store` ops from different loop bodies store to the same Buffer/Result.
// Since `clone` algorithm is linear and during expression cloning creates only input port connectors from sources,
// algorithm can miss some consumers. For example:
// The consumers of Store0 : Buffer0
// The consumers of Store1 : Buffer0
// The result: Buffer0 has only one source in input connector - Store1
// Algorithm automatically doesn't add Buffer to consumers of Store0. Thus,
// If True, `clone` algorithm add missed consumers.
// If False, cloned LinearIR will be built by default (without extra consumers).
const bool copy_missed_consumers = true;
};

LinearIRBuilder(Config config = {}) : m_config(std::move(config)) {}
Expand Down
26 changes: 20 additions & 6 deletions src/common/snippets/include/snippets/lowered/loop_info.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,9 @@ class LoopInfo {
enum {UNDEFINED_DIM_IDX = std::numeric_limits<size_t>::max()};

LoopInfo() = default;
LoopInfo(size_t work_amount, size_t increment, const std::vector<LoopPort>& entries, const std::vector<LoopPort>& exits);
LoopInfo(size_t work_amount, size_t increment, const std::vector<ExpressionPort>& entries, const std::vector<ExpressionPort>& exits);
LoopInfo(size_t work_amount, size_t increment, const std::vector<LoopPort>& entries, const std::vector<LoopPort>& exits, bool is_wa_const = false);
LoopInfo(size_t work_amount, size_t increment, const std::vector<ExpressionPort>& entries, const std::vector<ExpressionPort>& exits,
bool is_wa_const = false);
virtual ~LoopInfo() = default;

/**
Expand Down Expand Up @@ -76,6 +77,11 @@ class LoopInfo {
* @return m_output_ports
*/
const std::vector<LoopPort>& get_output_ports() const;
/**
* @brief Returns True if `work_amount` cannot be rewritten/updated by passes.
* @return m_is_work_amount_const
*/
bool is_work_amount_const() const;

/**
* @brief Set m_work_amount value
Expand All @@ -92,6 +98,11 @@ class LoopInfo {
* @param dim_idx - index
*/
void set_dim_idx(size_t dim_idx);
/**
* @brief Sets `value` to `m_is_work_amount_const`
* @param value - value of the attribute
*/
void set_work_amount_const(bool value);

/**
* @brief Replace the current LoopPort `actual_port` with new `target_ports`
Expand Down Expand Up @@ -164,6 +175,9 @@ class LoopInfo {
// Note: Scalars aren't input expressions but can be before first input expr in Linear IR
std::vector<LoopPort> m_input_ports = {};
std::vector<LoopPort> m_output_ports = {};

// If True, no one pass can rewrite the value of `m_work_amount`
bool m_is_work_amount_const = false;
};
using LoopInfoPtr = std::shared_ptr<LoopInfo>;

Expand Down Expand Up @@ -197,13 +211,13 @@ class UnifiedLoopInfo : public LoopInfo {
UnifiedLoopInfo(size_t work_amount, size_t increment,
const std::vector<LoopPort>& entries, const std::vector<LoopPort>& exits,
const std::vector<LoopPortDesc>& in_descs, const std::vector<LoopPortDesc>& out_descs,
const SpecificIterationHandlers& handlers = SpecificIterationHandlers());
const SpecificIterationHandlers& handlers = SpecificIterationHandlers(), bool is_wa_const = false);
UnifiedLoopInfo(size_t work_amount, size_t increment,
const std::vector<LoopPort>& entries, const std::vector<LoopPort>& exits,
const SpecificIterationHandlers& handlers = SpecificIterationHandlers());
const SpecificIterationHandlers& handlers = SpecificIterationHandlers(), bool is_wa_const = false);
UnifiedLoopInfo(size_t work_amount, size_t increment,
const std::vector<ExpressionPort>& entries, const std::vector<ExpressionPort>& exits,
const SpecificIterationHandlers& handlers = SpecificIterationHandlers());
const SpecificIterationHandlers& handlers = SpecificIterationHandlers(), bool is_wa_const = false);

/**
* @brief Clone LoopInfo with new expressions
Expand Down Expand Up @@ -365,7 +379,7 @@ class ExpandedLoopInfo : public LoopInfo {
ExpandedLoopInfo(size_t work_amount, size_t increment,
const std::vector<LoopPort>& entries, const std::vector<LoopPort>& exits,
std::vector<int64_t> ptr_increments, std::vector<int64_t> final_offsets, std::vector<int64_t> data_sizes,
SpecificLoopIterType type, std::shared_ptr<UnifiedLoopInfo> unified_loop_info);
SpecificLoopIterType type, std::shared_ptr<UnifiedLoopInfo> unified_loop_info, bool is_wa_const = false);
/**
* @brief Clone LoopInfo with new expressions
* @param expr_map map of new and old expressions
Expand Down
10 changes: 6 additions & 4 deletions src/common/snippets/include/snippets/lowered/loop_manager.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -99,12 +99,13 @@ class LoopManager {
size_t increment,
const std::vector<T>& entries,
const std::vector<T>& exits,
bool set_default_handlers = true) {
bool set_default_handlers = true,
bool is_work_amount_const = false) {
const auto normalized_increment = utils::is_dynamic_value(work_amount) || work_amount == 0 ? increment : std::min(increment, work_amount);
const auto handlers = set_default_handlers
? SpecificIterationHandlers(work_amount, normalized_increment)
: SpecificIterationHandlers();
const auto loop_info = std::make_shared<UnifiedLoopInfo>(work_amount, normalized_increment, entries, exits, handlers);
const auto loop_info = std::make_shared<UnifiedLoopInfo>(work_amount, normalized_increment, entries, exits, handlers, is_work_amount_const);
const auto loop_id = this->add_loop_info(loop_info);
for (auto expr_it = loop_begin_pos; expr_it != loop_end_pos; ++expr_it) {
insert_loop_id(*expr_it, loop_id);
Expand All @@ -131,8 +132,9 @@ class LoopManager {
size_t dim_idx,
const std::vector<T>& entries,
const std::vector<T>& exits,
bool set_default_handlers = true) {
const auto loop_id = mark_loop(loop_begin_pos, loop_end_pos, work_amount, increment, entries, exits, set_default_handlers);
bool set_default_handlers = true,
bool is_work_amount_const = false) {
const auto loop_id = mark_loop(loop_begin_pos, loop_end_pos, work_amount, increment, entries, exits, set_default_handlers, is_work_amount_const);
const auto loop_info = get_loop_info<UnifiedLoopInfo>(loop_id);
loop_info->set_dim_idx(dim_idx);
return loop_id;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,6 @@ class AllocateBuffers: public RangedPass {
*/
static void set_buffer_offset(const ExpressionPtr& buffer_expr, const size_t offset);

using BufferCluster = std::set<ExpressionPtr>;
using BufferClusters = std::vector<BufferCluster>;
private:
bool m_is_optimized_mode = true;
};
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
// Copyright (C) 2018-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

#pragma once

#include "pass.hpp"

#include "snippets/lowered/loop_manager.hpp"

namespace ov {
namespace snippets {
namespace lowered {
namespace pass {

/**
* @interface ComputeBufferAllocationSize
* @brief The pass calculate allocation sizes of Buffers.
* @param m_buffer_allocation_rank - rank of shape for memory allocation: shape[m_allocation_rank : -1]
* @ingroup snippets
*/
class ComputeBufferAllocationSize : public RangedPass {
public:
OPENVINO_RTTI("ComputeBufferAllocationSize", "RangedPass")
ComputeBufferAllocationSize(size_t buffer_allocation_rank) : m_buffer_allocation_rank(buffer_allocation_rank) {}

bool run(LinearIR& linear_ir, lowered::LinearIR::constExprIt begin, lowered::LinearIR::constExprIt end) override;

static size_t get_allocation_size(const LoopManagerPtr& loop_manager, const ExpressionPtr& buffer_expr, size_t allocation_rank);

private:
size_t m_buffer_allocation_rank = 0;
};

} // namespace pass
} // namespace lowered
} // namespace snippets
} // namespace ov
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@

#include "pass.hpp"

#include "allocate_buffers.hpp"

namespace ov {
namespace snippets {
namespace lowered {
Expand Down Expand Up @@ -35,7 +33,7 @@ class DefineBufferClusters : public RangedPass {
public:
OPENVINO_RTTI("DefineBufferClusters", "RangedPass")

DefineBufferClusters(AllocateBuffers::BufferClusters& clusters) : m_clusters(clusters) {}
DefineBufferClusters() = default;

/**
* @brief Apply the pass to the Linear IR
Expand All @@ -45,13 +43,15 @@ class DefineBufferClusters : public RangedPass {
bool run(lowered::LinearIR& linear_ir, lowered::LinearIR::constExprIt begin, lowered::LinearIR::constExprIt end) override;

private:
using BufferCluster = std::set<ExpressionPtr>;
using BufferClusters = std::vector<BufferCluster>;
using BufferPorts = std::unordered_map<ExpressionPtr, std::set<size_t>>;
/**
* @brief Finds Buffer cluster in set of clusters which contains the target expression with Buffer
* @param target target expression with Buffer op
* @return vector iterator which refers to the found cluster
*/
AllocateBuffers::BufferClusters::iterator find_cluster_by_expr(const ExpressionPtr& target);
BufferClusters::iterator find_cluster_by_expr(const ExpressionPtr& target);
/**
* @brief Returns True if Buffer is direct source for the target expr (there aren't other loop between the Buffer and target expr)
* @param buffer_expr expression with assumed Buffer op
Expand All @@ -70,7 +70,7 @@ class DefineBufferClusters : public RangedPass {
* @param cluster set of Buffer expressions - cluster
* @return common buffer ID or SIZE_MAX - size value
*/
size_t get_cluster_buffer_id(const AllocateBuffers::BufferCluster& cluster) const;
size_t get_cluster_buffer_id(const BufferCluster& cluster) const;

/**
* @brief Analyzes Loop: if Loop has Buffer ops on inputs and outputs, Loop can read and write from/to the same memory.
Expand Down Expand Up @@ -126,10 +126,10 @@ class DefineBufferClusters : public RangedPass {
* @param is_outer_up true if outer buffer is upper in Linear IR than inner Buffers
* @return Return True if clusters have been united
*/
bool unite_nested_clusters(const AllocateBuffers::BufferClusters::iterator& inner_cluster_it, AllocateBuffers::BufferCluster& outer_cluster,
bool unite_nested_clusters(const BufferClusters::iterator& inner_cluster_it, BufferCluster& outer_cluster,
const ExpressionPtr& outer_buffer, bool is_outer_up);

AllocateBuffers::BufferClusters& m_clusters;
BufferClusters m_clusters;
};

} // namespace pass
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ namespace pass {

/**
* @interface InitBuffersDefault
* @brief The pass inits Buffer expressions in LinearIR default (non-optimized): sets unique offsets and ID to Buffers.
* @brief The pass inits Buffer expressions in LinearIR default (non-optimized): sets unique offsets and reg groups to Buffers.
* @ingroup snippets
*/

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ namespace pass {
class InsertBuffers : public RangedPass {
public:
OPENVINO_RTTI("InsertBuffers", "RangedPass")
InsertBuffers(int32_t buffer_allocation_rank);
InsertBuffers() = default;
bool run(LinearIR& linear_ir, lowered::LinearIR::constExprIt begin, lowered::LinearIR::constExprIt end) override;

private:
Expand All @@ -39,8 +39,6 @@ class InsertBuffers : public RangedPass {
const LoopManagerPtr& loop_manager,
const ExpressionPtr& expr,
const ExpressionPtr& down_expr);

int32_t m_buffer_allocation_rank;
};

} // namespace pass
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,20 @@ namespace lowered {
namespace pass {

/**
* @interface NormalizeBufferIDs
* @brief After optimizations some Buffer IDs might be set unevenly: some numbers are missed.
* @interface NormalizeBufferRegisterGroups
* @brief After optimizations some Buffer RegGroups might be set unevenly: some numbers are missed.
* For example,
* [Buffer -> ID]
* Buffer0 -> 0 Two Buffers have ID = 0, one has ID = 2.
* Buffer1 -> 2 Obviosly, we can normalize this IDs to set ID = 1 to Buffer1.
* [Buffer -> RegGroup]
* Buffer0 -> 0 Two Buffers have RegGroup = 0, one has RegGroup = 2.
* Buffer1 -> 2 Obviosly, we can normalize this IDs to set RegGroup = 1 to Buffer1.
* Buffer2 -> 0 It helps to assign GPR registers in `AssignRegister` more effective.
* Thus, the pass normalize IDs of Buffers in Linear IR.
* @ingroup snippets
*/

class NormalizeBufferIDs : public RangedPass {
class NormalizeBufferRegisterGroups : public RangedPass {
public:
OPENVINO_RTTI("NormalizeBufferIDs", "RangedPass")
OPENVINO_RTTI("NormalizeBufferRegisterGroups", "RangedPass")
/**
* @brief Apply the pass to the Linear IR
* @param linear_ir the target Linear IR
Expand Down
Loading

0 comments on commit 38b570f

Please sign in to comment.