-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize merge_sort
algorithm for largest data sizes
#1977
base: main
Are you sure you want to change the base?
Conversation
…introduce new function __find_start_point_in Signed-off-by: Sergey Kopienko <[email protected]>
…introduce __parallel_merge_submitter_large for merge of biggest data sizes Signed-off-by: Sergey Kopienko <[email protected]>
…using __parallel_merge_submitter_large for merge data equal or greater then 4M items Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
…fix compile error Signed-off-by: Sergey Kopienko <[email protected]>
…fix Kernel names Signed-off-by: Sergey Kopienko <[email protected]>
…rename template parameter names in __parallel_merge_submitter Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
…fix review comment Signed-off-by: Sergey Kopienko <[email protected]>
…fix review comment Signed-off-by: Sergey Kopienko <[email protected]>
…introduce __starting_size_limit_for_large_submitter into __parallel_merge Signed-off-by: Sergey Kopienko <[email protected]>
…renames Signed-off-by: Sergey Kopienko <[email protected]>
…introduce _split_point_t type Signed-off-by: Sergey Kopienko <[email protected]>
…remove usages of std::make_pair Signed-off-by: Sergey Kopienko <[email protected]>
…optimize evaluation of split-points on base diagonals Signed-off-by: Sergey Kopienko <[email protected]>
…renames Signed-off-by: Sergey Kopienko <[email protected]>
…extract eval_split_points_for_groups function Signed-off-by: Sergey Kopienko <[email protected]>
…extract run_parallel_merge function Signed-off-by: Sergey Kopienko <[email protected]>
…using SLM bank size to define chunk in the eval_nd_range_params function Signed-off-by: Sergey Kopienko <[email protected]>
…using SLM bank size to define chunk in the eval_nd_range_params function (16) Signed-off-by: Sergey Kopienko <[email protected]>
…restore old implementation of __find_start_point Signed-off-by: Sergey Kopienko <[email protected]>
…rename: base_diag_part -> steps_between_two_base_diags Signed-off-by: Sergey Kopienko <[email protected]>
…fix review comment Signed-off-by: Sergey Kopienko <[email protected]>
…fix an error in __parallel_merge_submitter_large::eval_split_points_for_groups Signed-off-by: Sergey Kopienko <[email protected]>
…onals is too short Signed-off-by: Sergey Kopienko <[email protected]>
…erge_submitter_large` into one `__parallel_merge_submitter` (#1956)
…fix review comment: remove extra condition check from __find_start_point_in Signed-off-by: Sergey Kopienko <[email protected]>
…fix review comment: fix condition check in __find_start_point_in Signed-off-by: Sergey Kopienko <[email protected]>
…apply GitHUB clang format Signed-off-by: Sergey Kopienko <[email protected]>
623cec8
to
74455e3
Compare
….h - trace Kernel's execution time info
74455e3
to
991f32a
Compare
…declare local variables outside of the loop in __serial_merge
…rge_sort.h - Simple using 32 Kb of base diagonals for every merging part + calc max amount of base diagonals" This reverts commit bdb68d3.
….h - calcs get_max_base_diags_count
….h - trace Kernel's execution time info
….h - remove duplicated creation of DropViews
….h - fix comment in __lookup_sp
…trace merge Kernel's execution time info
….h - calcs get_max_base_diags_count
…trace merge Kernel's execution time info
…trace merge Kernel's execution time info
# Conflicts: # include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_merge_sort.h
….h - log __n_sorted state in __merge_sort_global_submitter::operator()
….h - declare __n_sorted as const in __merge_sort_global_submitter::eval_nd_range_params
….h - fix comments for get_max_base_diags_count
….h - remove hard-coded numbers from get_max_base_diags_count
82be174
to
d356335
Compare
|
||
// Calculate offset to the first element in the subrange (i.e. the first element of the first subrange for merge) | ||
offset = std::min<_IndexT>(i_elem - i_elem_local, __n); | ||
|
||
// Calculate size of the first and the second subranges | ||
n1 = std::min<_IndexT>(offset + __n_sorted, __n) - offset; | ||
n2 = std::min<_IndexT>(offset + __n_sorted + n1, __n) - (offset + n1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine you can replace these branches for the vast majority of work-groups / work-items with a single branch to check if it is a "full" subproblem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There is no work-group / work-item separation at all, only planar
sycl::range
- This code exists before too, I simple repacked it into this class too reduce the amount of code duplications.
__data_area.is_i_elem_local_inside_merge_matrix() | ||
? (__data_in_temp | ||
? __find_start_point_w(__data_area, DropViews(__dst, __data_area), __comp) | ||
: __find_start_point_w(__data_area, DropViews(__rng, __data_area), __comp)) | ||
: _merge_split_point_t{__data_area.n1, __data_area.n2}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we might end up with base diagonals at the {0,0} point for individual subproblems, but if the subproblems are big enough, it is not too much wasted effort perhaps...
Nothing really to do here I don't think but I want to point it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am tried to explain this design in commenthttps://github.com//pull/1977#discussion_r1926601240
get_max_base_diags_count(_ExecutionPolicy&& __exec, const _IndexT __chunk, std::size_t __n) const | ||
{ | ||
const std::size_t __max_wg_size = oneapi::dpl::__internal::__max_work_group_size(__exec); | ||
return oneapi::dpl::__internal::__dpl_ceiling_div(__n, __chunk * __max_wg_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... should this be oneapi::dpl::__internal::__dpl_ceiling_div(__n, __chunk * __max_wg_size)-1;
?
As you mention in a comment below, we don't need to calculate {0,0}, nor do we need the final diagonal.
Think about the situation with a single subproblem, but
__max_wg_size = 1024
__chunk = 4
__n = 8191
I would think we only want a single base diagonal, a bound for the first workgroup input data. With this function we get 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It simple doesn't make sense to think about this because all split-points which is out of merge matrix calculated as constant {n1, n2}
, where n1
is the size of the first merging data and n2
is the size of the second merging data. So in any case no extra calculations here.
About your example - we strongly get sizes starting from '8192' and next sizes multiplied by 2
:
8192, 16384, 32768 and so on.
For base diagonal for the float
data type we will start from 2097152
:
2097152, 4194304, 8388608 and so on.
So I believe no real problem here.
const _IndexT __steps = oneapi::dpl::__internal::__dpl_ceiling_div(__rng_size, __chunk); | ||
|
||
_IndexT __base_diag_count = get_max_base_diags_count(__exec, __chunk, __n_sorted); | ||
_IndexT __steps_between_two_base_diags = oneapi::dpl::__internal::__dpl_ceiling_div(__steps, __base_diag_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes evenly divided base diagonals, but it seems like maybe we would prefer to explicitly place them at the edges of workgroups instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decreasing of amount of base diagonals gives us poor performance, as far as I have checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional comments.
I am tried to comment this design in
__lookup_sp(const std::size_t __linear_id_in_steps_range, const nd_range_params& __nd_range_params, |
There is two reasons:
- We save series of base diagonals data in common storage for a lot of logical merges.
- When we calculates base diagonals, we works with their indexes in storage through all the data without additional index recalculation.
- It follows from (1) and (2) that we cannot save both dates for the first and the last base diagonals of every merge matrix, we can save on one of them.
- As far as we know that split-point of first base diagonal always is
{0, 0}
I made choice to save last base diagonal data.
} | ||
}); | ||
}); | ||
if (2 * __n_sorted < __get_starting_size_limit_for_large_submitter<__value_type>()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I would suggest for now, is to take a look at the values within __get_starting_size_limit_for_large_submitter
, and see if we get reasonable results for different types when using something like this:
template <typename _Tp>
constexpr std::size_t
__get_starting_size_limit_for_large_submitter()
{
return MagicConstantCacheSize / sizeof(_Tp) ;
}
for some fixed value of MagicConstantCacheSize
. We can have this be empirically driven to start, but at least it would be consistent with itself and have a basis in something that we can understand from the hardware.
const bool __is_base_diagonal = | ||
__linear_id_in_steps_range % __nd_range_params.steps_between_two_base_diags == 0; | ||
|
||
if (__sp_right.first + __sp_right.second > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this branch is asking if there is a right (upper) bound base diagonal which has successfully defined.
If this is correct, can we instead replace it with a similar check when initializing __sp_right
to bound it by the end of the sequence if it is "missing"?
It seems like we are throwing away the left bound information when we don't have the right bound when we should just keep the left bound and use the right bound as the end of sequence.
If I'm misunderstanding the point of this check, please explain it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After further consideration I think that this check, combined with an assumption that each subproblem (other than the first) starts with a base diagonal is what allows the final steps_between_two_base_diags
steps of each subproblem to work, but also does so in an inefficient way, by falling back to the search bounded by the full subproblem, rather than by base diagonals.
I see a couple issues here:
(1) I'm not sure we can safely assume that subproblems always start with base diagonals. If this assumption is part of our algorithm, we need to make that clear in the setup, and also make it guaranteed that this is always the case.
(2) It seems we could still bound the problem by the left side base diagonal, and the end of the subproblem. As we have it now, this is very inefficient, and at regular intervals throughout the sequence based on subproblems. It probably hurts our performance a lot.
If this isn't how it works then I don't understand how we handle the elements at the end of each subproblem.
….h - additional assert in __merge_sort_global_submitter::operator()
….h - additional assert in __merge_sort_global_submitter::operator()
const _IndexT __steps = oneapi::dpl::__internal::__dpl_ceiling_div(__rng_size, __chunk); | ||
|
||
_IndexT __base_diag_count = get_max_base_diags_count(__exec, __chunk, __n_sorted); | ||
_IndexT __steps_between_two_base_diags = oneapi::dpl::__internal::__dpl_ceiling_div(__steps, __base_diag_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think __steps
is miscalculated here, and should be using __nsorted * 2
instead of __rng_size
to represent the subproblem instead.
Lets look at what is done in this function, and how its adjusted later:
It calculates the __base_diag_count
for the subproblem (merging two __n_sorted
sized sequences).
It calculates the __steps
for the full problem, all subproblems combined, by using the __rng_size
.
It calculates the __steps_between_two_base_diags
combining data from the subproblem (__base_diag_count
) and from the full problem (__steps
).
Then later, outside the function there is a correction which calculates the number of subproblems, and scales two of the fields by the number of subproblems:
__nd_range_params_this.base_diag_count *= __portions;
__nd_range_params_this.steps *= __portions;
From context, it seems this function is supposed to return values for the subproblem only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll note I think it would be better if this function returned values for the full problem rather than the subproblem, which needed no correction later, but that is just my preference.
inline static _merge_split_point_t | ||
__find_start_point_w(const WorkDataArea& __data_area, const DropViews& __views, _Compare __comp) | ||
{ | ||
return __find_start_point(__views.rng1, decltype(__data_area.n1){0}, __data_area.n1, __views.rng2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of __data_area.n1
is known in this context to be _IndexT
. Can we replace decltype(__data_area.n1){0}
with _IndexT{0}
?
|
||
template <typename DropViews, typename _Compare> | ||
inline static _merge_split_point_t | ||
__find_start_point_w(const WorkDataArea& __data_area, const DropViews& __views, _Compare __comp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the _w
suffix in this function's name denote? Could you leave a comment here and for __serial_merge_w
?
sycl::accessor __dst(__temp_buf, __cgh, sycl::read_write, sycl::no_init); | ||
|
||
__cgh.parallel_for<_DiagonalsKernelName...>( | ||
// +1 doesn't required here, because we need to calculate split points for each base diagonal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// +1 doesn't required here, because we need to calculate split points for each base diagonal | |
// +1 is not required here, because we need to calculate split points for each base diagonal |
In this PR we extends the approach from #1933 to
merge_sort
algorithm.