Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement SYCL backend parallel_for to improve bandwidth utilization #1976

Merged
merged 103 commits into from
Jan 31, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
103 commits
Select commit Hold shift + click to select a range
0e2b0c8
Optimize memory transactions in SYCL backend parallel for
mmichel11 Sep 5, 2024
e581887
clang-format
mmichel11 Sep 5, 2024
8918565
Correct comment and error handling.
mmichel11 Sep 6, 2024
2d43c6c
__num_groups bugfix
mmichel11 Sep 10, 2024
70c97bb
Introduce stride recommender for different targets and better distrib…
mmichel11 Sep 16, 2024
0956223
Cleanup
mmichel11 Sep 16, 2024
b774d51
Unroll loop if possible
mmichel11 Sep 18, 2024
3e033a6
Revert "Unroll loop if possible"
mmichel11 Sep 18, 2024
dced316
Use a small and large kernel in parallel for
mmichel11 Sep 20, 2024
faba937
Improve __iters_per_work_item heuristic.
mmichel11 Sep 20, 2024
d7edb2d
Code cleanup
mmichel11 Sep 20, 2024
6cf59d3
Clang format
mmichel11 Sep 23, 2024
fdd169d
Update comments
mmichel11 Sep 23, 2024
87a5df0
Bugfix in comment
mmichel11 Sep 23, 2024
ec5526b
More cleanup and better handle non-full case
mmichel11 Sep 23, 2024
af23e92
Rename __ndi to __item for consistency with codebase
mmichel11 Sep 24, 2024
ce3070b
Update all comments on kernel naming trick
mmichel11 Sep 24, 2024
513e2fd
Handle non-full case in a cleaner way
mmichel11 Sep 24, 2024
36b5510
Switch min tuple type utility to return size of type
mmichel11 Sep 24, 2024
3c2f897
Remove unnecessary template parameter
mmichel11 Sep 24, 2024
a197c54
Make non-template function inline for ODR compliance
mmichel11 Sep 24, 2024
0195183
If the iters per work item is 1, then only compile the basic pfor kernel
mmichel11 Sep 24, 2024
bce68cc
Address several PR comments
mmichel11 Sep 25, 2024
e1f01a4
Remove free function __stride_recommender
mmichel11 Sep 25, 2024
608d765
Accept ranges as forwarding references in __parallel_for_large_submitter
mmichel11 Sep 25, 2024
dd6df76
Address reviewer comments
mmichel11 Nov 6, 2024
79f652a
Introduce vectorized for-path for small types and parallel_backend_sy…
mmichel11 Dec 16, 2024
07fd583
Improve testing and cleanup of code
mmichel11 Dec 16, 2024
8e8bd30
clang format
mmichel11 Dec 16, 2024
b0d6f16
Miscellaneous fixes identified during testing
mmichel11 Dec 17, 2024
2590e6e
clang-format
mmichel11 Dec 17, 2024
7dc2997
Fix ordering to __vector_load call
mmichel11 Dec 17, 2024
778000c
Add support for vectorization with C++20 parallel range APIs
mmichel11 Dec 17, 2024
1459270
Add device copyable specializations for new walk patterns
mmichel11 Dec 17, 2024
b532411
Align vector_walk implementation with other vector functors
mmichel11 Dec 17, 2024
50fe1f3
Add back non-spirv path
mmichel11 Dec 17, 2024
44efb49
Further improve test coverage
mmichel11 Dec 17, 2024
2af0b39
Restore original shift_left due to implicit implementation requiremen…
mmichel11 Dec 17, 2024
408c35e
Fix issues in vectorized rotate
mmichel11 Dec 18, 2024
fdaa0be
Fix fpga parallel for compilation issues
mmichel11 Dec 18, 2024
38ff6c6
Restore initial shift_left_right.pass.cpp
mmichel11 Dec 18, 2024
61db1b9
Fix test side issue when unnamed lambdas are disabled
mmichel11 Dec 18, 2024
4c6d4ca
Add a vector path specialization for std::swap_ranges
mmichel11 Dec 18, 2024
aa512b0
General code cleanup
mmichel11 Dec 18, 2024
94ad89c
Bugfix with __pattern_swap using nanoranges
mmichel11 Dec 18, 2024
cce778f
clang-format
mmichel11 Dec 19, 2024
4a15912
Address applicable comments from PR #1870
mmichel11 Dec 20, 2024
1939a46
Refactor __lazy_ctor_storage deleter
mmichel11 Jan 2, 2025
ee19e4b
Address review comments
mmichel11 Jan 2, 2025
462a821
Remove intrusive test macro and adjust input sizes in test framework
mmichel11 Jan 4, 2025
b5e50fc
Make walk_scalar_base and walk_vector_or_scalar_base structs
mmichel11 Jan 4, 2025
8833f40
Add missing max_n
mmichel11 Jan 4, 2025
5152fa4
Add constructors for for-based bricks
mmichel11 Jan 4, 2025
6c2fcc9
Remove extraneous {} and add constructor to custom_brick
mmichel11 Jan 6, 2025
32833d8
Limit recursive searching of __min_nested_type_size to tuples
mmichel11 Jan 6, 2025
cf7cb72
Work around compiler vectorization issue
mmichel11 Jan 6, 2025
701893c
Add missing decays
mmichel11 Jan 7, 2025
45741cf
Add compile time check to ensure we do not get buffer pointer on host
mmichel11 Jan 7, 2025
4a9be6f
Revert "Work around compiler vectorization issue"
mmichel11 Jan 7, 2025
18ea224
Remove all begin() calls on views in vectorization paths
mmichel11 Jan 7, 2025
017bacf
Remove unused __is_passed_directly_range utility
mmichel11 Jan 7, 2025
a57ce48
Rename __scalar_path / __vector_path to __scalar_path_impl / __vector…
mmichel11 Jan 8, 2025
e2779cf
Correct __vector_walk deleters and a type in __reverse_copy
mmichel11 Jan 8, 2025
39bbac8
Set upper limit of 10,000,000 for get_pattern_for_max_n
mmichel11 Jan 9, 2025
3921505
General cleanup and renaming for consistency
mmichel11 Jan 9, 2025
6c8aa77
Explicitly list template types in specializations of __is_vectorizabl…
mmichel11 Jan 13, 2025
fcc1701
Remove unnecessary local variables
mmichel11 Jan 14, 2025
8eaf940
Remove unnecessary local variables in async and numeric headers
mmichel11 Jan 14, 2025
62a5ea4
Correct optimization in __reverse_functor and improve explanation
mmichel11 Jan 16, 2025
c8fa19c
Rename custom_brick to __custom_brick
mmichel11 Jan 16, 2025
15a7675
Rename __n to __full_range_size in vec utils and fix potential unused…
mmichel11 Jan 17, 2025
1b0658c
Remove unnecessary ternary operator and replace _Idx template with st…
mmichel11 Jan 17, 2025
5df35be
Add note to __reverse_copy, __rotate_copy, and minor cleanup
mmichel11 Jan 21, 2025
cf672c2
Switch runtime check to compile time check in __reverse_copy
mmichel11 Jan 21, 2025
2ef4adc
Update comment in __reverse_copy
mmichel11 Jan 21, 2025
350909d
Remove the usage of __lazy_ctor_storage from all vectorization paths
mmichel11 Jan 16, 2025
3af0a0b
Remove unneeded template
mmichel11 Jan 21, 2025
946a4ec
Remove __lazy_ctor_storage::__get_callable_deleter
mmichel11 Jan 21, 2025
c1c284f
Address review comments
mmichel11 Jan 22, 2025
f38907b
Cleanup some types
mmichel11 Jan 22, 2025
7bc541a
Use __pstl_assign instead of operator= and revert bad change
mmichel11 Jan 22, 2025
763149f
Avoid modulo in loop body of __rotate_copy
mmichel11 Jan 22, 2025
a490706
Make variables const where appropriate
mmichel11 Jan 22, 2025
14dd218
::std -> std changes, add missing include, and clang-format
mmichel11 Jan 23, 2025
030598e
Refactor __vector_path_impl of __brick_shift_left
mmichel11 Jan 24, 2025
fa656b5
Add TODO comment to unify vector and strided loop utils
mmichel11 Jan 24, 2025
a0e95e6
Add a vectorized path for __brick_shift_left
mmichel11 Jan 24, 2025
60a6e23
Update comment in __brick_shift_left
mmichel11 Jan 24, 2025
356505c
Clarify comment
mmichel11 Jan 24, 2025
3b1cd71
Disambiguate kernel names when testing with multiple types and cleanup
mmichel11 Jan 24, 2025
d2ec643
Add comments on access modes and link to GitHub issue
mmichel11 Jan 24, 2025
51d678b
Update __pattern_unique comment
mmichel11 Jan 24, 2025
6f4d701
Limit maximum work-group size in pattern_for_max_n utility
mmichel11 Jan 26, 2025
eaa72a0
Directly call __deferrable_wait() and remove unused variable
mmichel11 Jan 27, 2025
21a53d0
Additional specializations for __is_vectorizable_range
mmichel11 Jan 27, 2025
634f771
Add nanorange checks
mmichel11 Jan 27, 2025
5232c6a
::std -> std
mmichel11 Jan 27, 2025
bff5c15
Fix typo in __is_vectorizable_range
mmichel11 Jan 28, 2025
f19d3b4
Use std::is_pointer in __is_vectorizable_range specialization
mmichel11 Jan 28, 2025
9ae1225
Introduce aliases to simplify vector functor calls and reduce creatio…
mmichel11 Jan 29, 2025
303d4f4
Use ternary in walk_adjacent_difference
mmichel11 Jan 29, 2025
bb950f4
Resolve ambiguous __is_vectorizable_range specializations
mmichel11 Jan 31, 2025
8568fcc
Address review comments
mmichel11 Jan 31, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ namespace oneapi::dpl::experimental::kt::gpu::esimd::__impl
{

//------------------------------------------------------------------------
// Please see the comment for __parallel_for_submitter for optional kernel name explanation
// Please see the comment above __parallel_for_small_submitter for optional kernel name explanation
//------------------------------------------------------------------------

template <bool __is_ascending, ::std::uint8_t __radix_bits, ::std::uint16_t __data_per_work_item,
Expand Down
18 changes: 12 additions & 6 deletions include/oneapi/dpl/internal/async_impl/async_impl_hetero.h
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@ __pattern_walk1_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For

auto __future_obj = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf.all_view());
unseq_backend::walk1_vector_or_scalar<_ExecutionPolicy, _Function, decltype(__buf.all_view())>{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. walk1_vector_or_scalar is so long name...
    M.b. keep walk_n?
    As far as I understand it is just renaming, not the second "walker"?
  2. Probably, it make sense to add constructor for type auto deduction?
    (for example see https://godbolt.org/z/z3Yfhbo5W )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. walk_n is still used in some other places and is currently more generic, so we do need a separate name. I do think something that reflects the different vector / scalar paths is best.
  2. I am not as familiar with CTAD, but my understanding is that all template types must be deduced from the constructor. The problem with this is that _Ranges... is only passed through the class template to establish tuning parameters, so it cannot be deduced from a constructor and must be explicitly specified by the caller. Since there is no partial CTAD as far as I am aware, then I do not think it is possible to implement unless we pass some unused ranges through the constructor to deduce types. Is this correct and if so, do you think it is still the best approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing you could consider is a "make" function which provides the partial deduction for you. You can provide _ExecutionPolicy and _Ranges... types explicitly as template args to a make function, and _Function could be deduced.
I personally think its a bit overkill for little benefit, but its an option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out how this can be done. I agree that it does not save as much and just listing the template types is the most straightforward in this case.

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ragarding

2. The problem with this is that _Ranges... is only passed through the class template to establish tuning parameters

Basically, walk_vector_or_scalar_base is not a base class. It is just for calculation 3 compile time constant, based on input Ranges... types.
There is nothing to inherit - not implementation, not API.

Other words, it is some "Params" type. It can be defined on the fly where you need the mentioned 3 compile time constants __can_vectorize, __preferred_vector_size and __preferred_iters_per_item:

Params<Range1>::__preferred_iters_per_item
or
Params<Range1, Range2>::__preferred_vector_size
or in general case with parameter pack:
Params<Ranges...>::__preferred_vector_size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point @MikeDvorskiy, for the majority of the usage of these compile time constants. They could be calculated inline in the functions based on the input range types. However, there are a few exterior public uses as described in this comment from the parallel_for launch code. This usage requires this derived struct to contain the Range type information prior to the actual function calls, and its easiest if you can just query it like this.

You could instead have traits or helpers where you could pass range info when querying this stuff at the parallel_for launch level. I don't have a strong opinion between the two without having the other full implementation to compare to.

Copy link
Contributor

@danhoeflinger danhoeflinger Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To try to clarify further, it could maybe look like this:

auto [__idx, __stride, __is_full] = __stride_recommender( 
	     __item, __count, __iters_per_work_item, _Fp::__preferred_vector_size<_Ranges...>, __work_group_size); 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the initial point now. My understanding is that this is implementable through a variable template within the brick.

Something like:

template <typename _Rngs>
constexpr static std::uint8_t __preferred_vector_size = ...

There would be duplicated definitions of the variables in each brick, but we would not need to pass the ranges through the brick's template parameters. There are pros and cons with each approach and functionality / performance of each should be identical.

At this point in the review, I believe it is too late to make such a large design decision if we want to make it to the milestone. My suggestion is we defer this to an issue and address in the 2022.9.0 milestone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see.
Thank you for discussion that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue has been created and attached to 2022.9.0: #2023

__f, static_cast<std::size_t>(__n)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_cast<std::size_t> looks suspicious... Why is a reason for doing that?

Copy link
Contributor Author

@mmichel11 mmichel11 Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason here is that __n may be a signed difference type obtained from taking the difference of two iterators while the constructor accepts a std::size_t, so we see compilation errors without this cast.

If it's preferred, I can add a templated type for the size to the constructor, so we can avoid the need for this cast.

__n, __buf.all_view());
return __future_obj;
}

Expand All @@ -67,7 +69,9 @@ __pattern_walk2_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For

auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf1.all_view(), __buf2.all_view());
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__buf1.all_view()),
decltype(__buf2.all_view())>{__f, static_cast<std::size_t>(__n)},
__n, __buf1.all_view(), __buf2.all_view());

return __future.__make_future(__first2 + __n);
}
Expand All @@ -91,10 +95,12 @@ __pattern_walk3_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::write, _ForwardIterator3>();
auto __buf3 = __keep3(__first3, __first3 + __n);

auto __future =
oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n,
__buf1.all_view(), __buf2.all_view(), __buf3.all_view());
auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk3_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__buf1.all_view()),
decltype(__buf2.all_view()), decltype(__buf3.all_view())>{
__f, static_cast<size_t>(__n)},
__n, __buf1.all_view(), __buf2.all_view(), __buf3.all_view());

return __future.__make_future(__first3 + __n);
}
Expand Down
37 changes: 26 additions & 11 deletions include/oneapi/dpl/internal/binary_search_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,19 @@ enum class search_algorithm
binary_search
};

template <typename Comp, typename T, search_algorithm func>
struct custom_brick
#if _ONEDPL_BACKEND_SYCL
template <typename Comp, typename T, typename _Range, search_algorithm func>
struct __custom_brick : oneapi::dpl::unseq_backend::walk_scalar_base<_Range>
{
Comp comp;
T size;
bool use_32bit_indexing;

__custom_brick(Comp comp, T size, bool use_32bit_indexing)
: comp(std::move(comp)), size(size), use_32bit_indexing(use_32bit_indexing)
{
}

template <typename _Size, typename _ItemId, typename _Acc>
void
search_impl(_ItemId idx, _Acc acc) const
Expand All @@ -68,17 +74,23 @@ struct custom_brick
get<2>(acc[idx]) = (value != end_orig) && (get<1>(acc[idx]) == get<0>(acc[value]));
}
}

template <typename _ItemId, typename _Acc>
template <typename _IsFull, typename _ItemId, typename _Acc>
void
operator()(_ItemId idx, _Acc acc) const
__scalar_path_impl(_IsFull, _ItemId idx, _Acc acc) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we may try to improve this code by replacing run-time bool value use_32bit_indexing to compile-time indexing type specialization.
I found only 3 places with the code

const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();

so it's not big deal to add if statement outside and call __parallel_for inside for both branches with the different index types. But inside the brick we exclude condition check at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I will reevaluate performance here and provide an update. The advantage of the current approach is that we only compile a single kernel whereas your suggestion may improve kernel performance with the cost of increased JIT overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-checked performance here, and the results are similar to my initial experimentation. For small problem sizes (e.g. <16k elements) there is a noticeable performance benefit for adding the second kernel. It only saves a few microseconds (e.g. ~10 us with 2 kernels ~13 us with one with runtime dispatch). I would consider this case less important, however, since I do not expect binary search to be used with so few search keys.

For larger inputs, the effect of the runtime dispatch is not measurable. I suspect this is because __custom_brick can be quite heavy for a brick as it performs multiple memory accesses making the impact of the if ... else dispatch less noticeable. For this reason, I suggest we keep the current approach which compiles faster.

We can discuss more if needed, but I suggest it be separate from this PR as we do not touch the implementation details of binary_search here apart from adjusting the brick to work with the new design.

{
if (use_32bit_indexing)
search_impl<std::uint32_t>(idx, acc);
else
search_impl<std::uint64_t>(idx, acc);
}
template <typename _IsFull, typename _ItemId, typename _Acc>
void
operator()(_IsFull __is_full, _ItemId idx, _Acc acc) const
{
__scalar_path_impl(__is_full, idx, acc);
}
};
#endif

template <class _Tag, typename Policy, typename InputIterator1, typename InputIterator2, typename OutputIterator,
typename StrictWeakOrdering>
Expand Down Expand Up @@ -155,7 +167,8 @@ lower_bound_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, InputIt
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();
__bknd::__parallel_for(
_BackendTag{}, ::std::forward<decltype(policy)>(policy),
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::lower_bound>{comp, size, use_32bit_indexing},
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::lower_bound>{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using auto type deduction via constructor of __custom_brick.

comp, size, use_32bit_indexing},
value_size, zip_vw)
.__deferrable_wait();
return result + value_size;
Expand Down Expand Up @@ -187,7 +200,8 @@ upper_bound_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, InputIt
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();
__bknd::__parallel_for(
_BackendTag{}, std::forward<decltype(policy)>(policy),
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::upper_bound>{comp, size, use_32bit_indexing},
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::upper_bound>{
comp, size, use_32bit_indexing},
value_size, zip_vw)
.__deferrable_wait();
return result + value_size;
Expand Down Expand Up @@ -217,10 +231,11 @@ binary_search_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, Input
auto result_buf = keep_result(result, result + value_size);
auto zip_vw = make_zip_view(input_buf.all_view(), value_buf.all_view(), result_buf.all_view());
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();
__bknd::__parallel_for(_BackendTag{}, std::forward<decltype(policy)>(policy),
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::binary_search>{
comp, size, use_32bit_indexing},
value_size, zip_vw)
__bknd::__parallel_for(
_BackendTag{}, std::forward<decltype(policy)>(policy),
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::binary_search>{
comp, size, use_32bit_indexing},
value_size, zip_vw)
.__deferrable_wait();
return result + value_size;
}
Expand Down
Loading
Loading