Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize merge_sort algorithm for largest data sizes #1977

Open
wants to merge 143 commits into
base: main
Choose a base branch
from

Conversation

SergeyKopienko
Copy link
Contributor

@SergeyKopienko SergeyKopienko commented Dec 19, 2024

In this PR we extends the approach from #1933 to merge_sort algorithm.

…introduce new function __find_start_point_in

Signed-off-by: Sergey Kopienko <[email protected]>
…introduce __parallel_merge_submitter_large for merge of biggest data sizes

Signed-off-by: Sergey Kopienko <[email protected]>
…using __parallel_merge_submitter_large for merge data equal or greater then 4M items

Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
…rename template parameter names in __parallel_merge_submitter

Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
…introduce __starting_size_limit_for_large_submitter into __parallel_merge

Signed-off-by: Sergey Kopienko <[email protected]>
…introduce _split_point_t type

Signed-off-by: Sergey Kopienko <[email protected]>
…remove usages of std::make_pair

Signed-off-by: Sergey Kopienko <[email protected]>
…optimize evaluation of split-points on base diagonals

Signed-off-by: Sergey Kopienko <[email protected]>
…extract eval_split_points_for_groups function

Signed-off-by: Sergey Kopienko <[email protected]>
…extract run_parallel_merge function

Signed-off-by: Sergey Kopienko <[email protected]>
…using SLM bank size to define chunk in the eval_nd_range_params function

Signed-off-by: Sergey Kopienko <[email protected]>
…using SLM bank size to define chunk in the eval_nd_range_params function (16)

Signed-off-by: Sergey Kopienko <[email protected]>
…restore old implementation of __find_start_point

Signed-off-by: Sergey Kopienko <[email protected]>
…rename: base_diag_part -> steps_between_two_base_diags

Signed-off-by: Sergey Kopienko <[email protected]>
…fix an error in __parallel_merge_submitter_large::eval_split_points_for_groups

Signed-off-by: Sergey Kopienko <[email protected]>
…erge_submitter_large` into one `__parallel_merge_submitter` (#1956)
…fix review comment: remove extra condition check from __find_start_point_in

Signed-off-by: Sergey Kopienko <[email protected]>
…fix review comment: fix condition check in __find_start_point_in

Signed-off-by: Sergey Kopienko <[email protected]>
…apply GitHUB clang format

Signed-off-by: Sergey Kopienko <[email protected]>
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/optimize_merge_sort_V1 branch from 623cec8 to 74455e3 Compare January 17, 2025 11:29
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/optimize_merge_sort_V1 branch from 74455e3 to 991f32a Compare January 17, 2025 12:00
…declare local variables outside of the loop in __serial_merge
…rge_sort.h - Simple using 32 Kb of base diagonals for every merging part + calc max amount of base diagonals"

This reverts commit bdb68d3.
….h - remove duplicated creation of DropViews
# Conflicts:
#	include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_merge_sort.h
….h - log __n_sorted state in __merge_sort_global_submitter::operator()
….h - declare __n_sorted as const in __merge_sort_global_submitter::eval_nd_range_params
….h - fix comments for get_max_base_diags_count
….h - remove hard-coded numbers from get_max_base_diags_count
@SergeyKopienko SergeyKopienko marked this pull request as ready for review January 22, 2025 15:19
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/optimize_merge_sort_V1 branch from 82be174 to d356335 Compare January 22, 2025 16:29
Comment on lines +304 to +310

// Calculate offset to the first element in the subrange (i.e. the first element of the first subrange for merge)
offset = std::min<_IndexT>(i_elem - i_elem_local, __n);

// Calculate size of the first and the second subranges
n1 = std::min<_IndexT>(offset + __n_sorted, __n) - offset;
n2 = std::min<_IndexT>(offset + __n_sorted + n1, __n) - (offset + n1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine you can replace these branches for the vast majority of work-groups / work-items with a single branch to check if it is a "full" subproblem.

Copy link
Contributor Author

@SergeyKopienko SergeyKopienko Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There is no work-group / work-item separation at all, only planar sycl::range
  2. This code exists before too, I simple repacked it into this class too reduce the amount of code duplications.

Comment on lines +417 to +421
__data_area.is_i_elem_local_inside_merge_matrix()
? (__data_in_temp
? __find_start_point_w(__data_area, DropViews(__dst, __data_area), __comp)
: __find_start_point_w(__data_area, DropViews(__rng, __data_area), __comp))
: _merge_split_point_t{__data_area.n1, __data_area.n2};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we might end up with base diagonals at the {0,0} point for individual subproblems, but if the subproblems are big enough, it is not too much wasted effort perhaps...
Nothing really to do here I don't think but I want to point it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am tried to explain this design in commenthttps://github.com//pull/1977#discussion_r1926601240

get_max_base_diags_count(_ExecutionPolicy&& __exec, const _IndexT __chunk, std::size_t __n) const
{
const std::size_t __max_wg_size = oneapi::dpl::__internal::__max_work_group_size(__exec);
return oneapi::dpl::__internal::__dpl_ceiling_div(__n, __chunk * __max_wg_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... should this be oneapi::dpl::__internal::__dpl_ceiling_div(__n, __chunk * __max_wg_size)-1; ?

As you mention in a comment below, we don't need to calculate {0,0}, nor do we need the final diagonal.

Think about the situation with a single subproblem, but
__max_wg_size = 1024
__chunk = 4
__n = 8191

I would think we only want a single base diagonal, a bound for the first workgroup input data. With this function we get 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It simple doesn't make sense to think about this because all split-points which is out of merge matrix calculated as constant {n1, n2}, where n1 is the size of the first merging data and n2 is the size of the second merging data. So in any case no extra calculations here.

About your example - we strongly get sizes starting from '8192' and next sizes multiplied by 2 :
8192, 16384, 32768 and so on.

For base diagonal for the float data type we will start from 2097152:
2097152, 4194304, 8388608 and so on.

So I believe no real problem here.

const _IndexT __steps = oneapi::dpl::__internal::__dpl_ceiling_div(__rng_size, __chunk);

_IndexT __base_diag_count = get_max_base_diags_count(__exec, __chunk, __n_sorted);
_IndexT __steps_between_two_base_diags = oneapi::dpl::__internal::__dpl_ceiling_div(__steps, __base_diag_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes evenly divided base diagonals, but it seems like maybe we would prefer to explicitly place them at the edges of workgroups instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decreasing of amount of base diagonals gives us poor performance, as far as I have checked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional comments.

I am tried to comment this design in

__lookup_sp(const std::size_t __linear_id_in_steps_range, const nd_range_params& __nd_range_params,

There is two reasons:

  1. We save series of base diagonals data in common storage for a lot of logical merges.
  2. When we calculates base diagonals, we works with their indexes in storage through all the data without additional index recalculation.
  3. It follows from (1) and (2) that we cannot save both dates for the first and the last base diagonals of every merge matrix, we can save on one of them.
  4. As far as we know that split-point of first base diagonal always is {0, 0} I made choice to save last base diagonal data.

}
});
});
if (2 * __n_sorted < __get_starting_size_limit_for_large_submitter<__value_type>())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I would suggest for now, is to take a look at the values within __get_starting_size_limit_for_large_submitter, and see if we get reasonable results for different types when using something like this:

template <typename _Tp>
constexpr std::size_t
__get_starting_size_limit_for_large_submitter()
{
    return MagicConstantCacheSize / sizeof(_Tp) ; 
}

for some fixed value of MagicConstantCacheSize . We can have this be empirically driven to start, but at least it would be consistent with itself and have a basis in something that we can understand from the hardware.

const bool __is_base_diagonal =
__linear_id_in_steps_range % __nd_range_params.steps_between_two_base_diags == 0;

if (__sp_right.first + __sp_right.second > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this branch is asking if there is a right (upper) bound base diagonal which has successfully defined.

If this is correct, can we instead replace it with a similar check when initializing __sp_right to bound it by the end of the sequence if it is "missing"?

It seems like we are throwing away the left bound information when we don't have the right bound when we should just keep the left bound and use the right bound as the end of sequence.

If I'm misunderstanding the point of this check, please explain it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further consideration I think that this check, combined with an assumption that each subproblem (other than the first) starts with a base diagonal is what allows the final steps_between_two_base_diags steps of each subproblem to work, but also does so in an inefficient way, by falling back to the search bounded by the full subproblem, rather than by base diagonals.

I see a couple issues here:
(1) I'm not sure we can safely assume that subproblems always start with base diagonals. If this assumption is part of our algorithm, we need to make that clear in the setup, and also make it guaranteed that this is always the case.
(2) It seems we could still bound the problem by the left side base diagonal, and the end of the subproblem. As we have it now, this is very inefficient, and at regular intervals throughout the sequence based on subproblems. It probably hurts our performance a lot.

If this isn't how it works then I don't understand how we handle the elements at the end of each subproblem.

Comment on lines +351 to +354
const _IndexT __steps = oneapi::dpl::__internal::__dpl_ceiling_div(__rng_size, __chunk);

_IndexT __base_diag_count = get_max_base_diags_count(__exec, __chunk, __n_sorted);
_IndexT __steps_between_two_base_diags = oneapi::dpl::__internal::__dpl_ceiling_div(__steps, __base_diag_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think __steps is miscalculated here, and should be using __nsorted * 2 instead of __rng_size to represent the subproblem instead.

Lets look at what is done in this function, and how its adjusted later:

It calculates the __base_diag_count for the subproblem (merging two __n_sorted sized sequences).
It calculates the __steps for the full problem, all subproblems combined, by using the __rng_size.
It calculates the __steps_between_two_base_diags combining data from the subproblem (__base_diag_count) and from the full problem (__steps).

Then later, outside the function there is a correction which calculates the number of subproblems, and scales two of the fields by the number of subproblems:

__nd_range_params_this.base_diag_count *= __portions;
__nd_range_params_this.steps *= __portions;

From context, it seems this function is supposed to return values for the subproblem only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll note I think it would be better if this function returned values for the full problem rather than the subproblem, which needed no correction later, but that is just my preference.

inline static _merge_split_point_t
__find_start_point_w(const WorkDataArea& __data_area, const DropViews& __views, _Compare __comp)
{
return __find_start_point(__views.rng1, decltype(__data_area.n1){0}, __data_area.n1, __views.rng2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of __data_area.n1 is known in this context to be _IndexT. Can we replace decltype(__data_area.n1){0} with _IndexT{0}?


template <typename DropViews, typename _Compare>
inline static _merge_split_point_t
__find_start_point_w(const WorkDataArea& __data_area, const DropViews& __views, _Compare __comp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the _w suffix in this function's name denote? Could you leave a comment here and for __serial_merge_w?

sycl::accessor __dst(__temp_buf, __cgh, sycl::read_write, sycl::no_init);

__cgh.parallel_for<_DiagonalsKernelName...>(
// +1 doesn't required here, because we need to calculate split points for each base diagonal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// +1 doesn't required here, because we need to calculate split points for each base diagonal
// +1 is not required here, because we need to calculate split points for each base diagonal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants