-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement SYCL backend parallel_for
to improve bandwidth utilization
#1976
Changes from all commits
0e2b0c8
e581887
8918565
2d43c6c
70c97bb
0956223
b774d51
3e033a6
dced316
faba937
d7edb2d
6cf59d3
fdd169d
87a5df0
ec5526b
af23e92
ce3070b
513e2fd
36b5510
3c2f897
a197c54
0195183
bce68cc
e1f01a4
608d765
dd6df76
79f652a
07fd583
8e8bd30
b0d6f16
2590e6e
7dc2997
778000c
1459270
b532411
50fe1f3
44efb49
2af0b39
408c35e
fdaa0be
38ff6c6
61db1b9
4c6d4ca
aa512b0
94ad89c
cce778f
4a15912
1939a46
ee19e4b
462a821
b5e50fc
8833f40
5152fa4
6c2fcc9
32833d8
cf7cb72
701893c
45741cf
4a9be6f
18ea224
017bacf
a57ce48
e2779cf
39bbac8
3921505
6c8aa77
fcc1701
8eaf940
62a5ea4
c8fa19c
15a7675
1b0658c
5df35be
cf672c2
2ef4adc
350909d
3af0a0b
946a4ec
c1c284f
f38907b
7bc541a
763149f
a490706
14dd218
030598e
fa656b5
a0e95e6
60a6e23
356505c
3b1cd71
d2ec643
51d678b
6f4d701
eaa72a0
21a53d0
634f771
5232c6a
bff5c15
f19d3b4
9ae1225
303d4f4
bb950f4
8568fcc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -44,7 +44,9 @@ __pattern_walk1_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For | |
|
||
auto __future_obj = oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf.all_view()); | ||
unseq_backend::walk1_vector_or_scalar<_ExecutionPolicy, _Function, decltype(__buf.all_view())>{ | ||
__f, static_cast<std::size_t>(__n)}, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason here is that If it's preferred, I can add a templated type for the size to the constructor, so we can avoid the need for this cast. |
||
__n, __buf.all_view()); | ||
return __future_obj; | ||
} | ||
|
||
|
@@ -67,7 +69,9 @@ __pattern_walk2_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For | |
|
||
auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf1.all_view(), __buf2.all_view()); | ||
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__buf1.all_view()), | ||
decltype(__buf2.all_view())>{__f, static_cast<std::size_t>(__n)}, | ||
__n, __buf1.all_view(), __buf2.all_view()); | ||
|
||
return __future.__make_future(__first2 + __n); | ||
} | ||
|
@@ -91,10 +95,12 @@ __pattern_walk3_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For | |
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::write, _ForwardIterator3>(); | ||
auto __buf3 = __keep3(__first3, __first3 + __n); | ||
|
||
auto __future = | ||
oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, | ||
__buf1.all_view(), __buf2.all_view(), __buf3.all_view()); | ||
auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::walk3_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__buf1.all_view()), | ||
decltype(__buf2.all_view()), decltype(__buf3.all_view())>{ | ||
__f, static_cast<size_t>(__n)}, | ||
__n, __buf1.all_view(), __buf2.all_view(), __buf3.all_view()); | ||
|
||
return __future.__make_future(__first3 + __n); | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,13 +37,19 @@ enum class search_algorithm | |
binary_search | ||
}; | ||
|
||
template <typename Comp, typename T, search_algorithm func> | ||
struct custom_brick | ||
#if _ONEDPL_BACKEND_SYCL | ||
template <typename Comp, typename T, typename _Range, search_algorithm func> | ||
struct __custom_brick : oneapi::dpl::unseq_backend::walk_scalar_base<_Range> | ||
{ | ||
Comp comp; | ||
T size; | ||
bool use_32bit_indexing; | ||
|
||
__custom_brick(Comp comp, T size, bool use_32bit_indexing) | ||
: comp(std::move(comp)), size(size), use_32bit_indexing(use_32bit_indexing) | ||
{ | ||
} | ||
|
||
template <typename _Size, typename _ItemId, typename _Acc> | ||
void | ||
search_impl(_ItemId idx, _Acc acc) const | ||
|
@@ -68,17 +74,23 @@ struct custom_brick | |
get<2>(acc[idx]) = (value != end_orig) && (get<1>(acc[idx]) == get<0>(acc[value])); | ||
} | ||
} | ||
|
||
template <typename _ItemId, typename _Acc> | ||
template <typename _IsFull, typename _ItemId, typename _Acc> | ||
void | ||
operator()(_ItemId idx, _Acc acc) const | ||
__scalar_path_impl(_IsFull, _ItemId idx, _Acc acc) const | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe that we may try to improve this code by replacing run-time const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max(); so it's not big deal to add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As discussed offline, I will reevaluate performance here and provide an update. The advantage of the current approach is that we only compile a single kernel whereas your suggestion may improve kernel performance with the cost of increased JIT overhead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I re-checked performance here, and the results are similar to my initial experimentation. For small problem sizes (e.g. <16k elements) there is a noticeable performance benefit for adding the second kernel. It only saves a few microseconds (e.g. ~10 us with 2 kernels ~13 us with one with runtime dispatch). I would consider this case less important, however, since I do not expect binary search to be used with so few search keys. For larger inputs, the effect of the runtime dispatch is not measurable. I suspect this is because We can discuss more if needed, but I suggest it be separate from this PR as we do not touch the implementation details of |
||
{ | ||
if (use_32bit_indexing) | ||
search_impl<std::uint32_t>(idx, acc); | ||
else | ||
search_impl<std::uint64_t>(idx, acc); | ||
} | ||
template <typename _IsFull, typename _ItemId, typename _Acc> | ||
void | ||
operator()(_IsFull __is_full, _ItemId idx, _Acc acc) const | ||
{ | ||
__scalar_path_impl(__is_full, idx, acc); | ||
} | ||
}; | ||
#endif | ||
|
||
template <class _Tag, typename Policy, typename InputIterator1, typename InputIterator2, typename OutputIterator, | ||
typename StrictWeakOrdering> | ||
|
@@ -155,7 +167,8 @@ lower_bound_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, InputIt | |
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max(); | ||
__bknd::__parallel_for( | ||
_BackendTag{}, ::std::forward<decltype(policy)>(policy), | ||
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::lower_bound>{comp, size, use_32bit_indexing}, | ||
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::lower_bound>{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest using auto type deduction via constructor of |
||
comp, size, use_32bit_indexing}, | ||
value_size, zip_vw) | ||
.__deferrable_wait(); | ||
return result + value_size; | ||
|
@@ -187,7 +200,8 @@ upper_bound_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, InputIt | |
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max(); | ||
__bknd::__parallel_for( | ||
_BackendTag{}, std::forward<decltype(policy)>(policy), | ||
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::upper_bound>{comp, size, use_32bit_indexing}, | ||
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::upper_bound>{ | ||
comp, size, use_32bit_indexing}, | ||
value_size, zip_vw) | ||
.__deferrable_wait(); | ||
return result + value_size; | ||
|
@@ -217,10 +231,11 @@ binary_search_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, Input | |
auto result_buf = keep_result(result, result + value_size); | ||
auto zip_vw = make_zip_view(input_buf.all_view(), value_buf.all_view(), result_buf.all_view()); | ||
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max(); | ||
__bknd::__parallel_for(_BackendTag{}, std::forward<decltype(policy)>(policy), | ||
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::binary_search>{ | ||
comp, size, use_32bit_indexing}, | ||
value_size, zip_vw) | ||
__bknd::__parallel_for( | ||
_BackendTag{}, std::forward<decltype(policy)>(policy), | ||
__custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::binary_search>{ | ||
comp, size, use_32bit_indexing}, | ||
value_size, zip_vw) | ||
.__deferrable_wait(); | ||
return result + value_size; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
walk1_vector_or_scalar
is so long name...M.b. keep
walk_n
?As far as I understand it is just renaming, not the second "walker"?
(for example see https://godbolt.org/z/z3Yfhbo5W )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
walk_n
is still used in some other places and is currently more generic, so we do need a separate name. I do think something that reflects the different vector / scalar paths is best._Ranges...
is only passed through the class template to establish tuning parameters, so it cannot be deduced from a constructor and must be explicitly specified by the caller. Since there is no partial CTAD as far as I am aware, then I do not think it is possible to implement unless we pass some unused ranges through the constructor to deduce types. Is this correct and if so, do you think it is still the best approach?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing you could consider is a "make" function which provides the partial deduction for you. You can provide
_ExecutionPolicy
and_Ranges...
types explicitly as template args to a make function, and_Function
could be deduced.I personally think its a bit overkill for little benefit, but its an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out how this can be done. I agree that it does not save as much and just listing the template types is the most straightforward in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ragarding
Basically, walk_vector_or_scalar_base is not a base class. It is just for calculation 3 compile time constant, based on input
Ranges...
types.There is nothing to inherit - not implementation, not API.
Other words, it is some "Params" type. It can be defined on the fly where you need the mentioned 3 compile time constants
__can_vectorize
,__preferred_vector_size
and__preferred_iters_per_item
:Params<Range1>::__preferred_iters_per_item
or
Params<Range1, Range2>::__preferred_vector_size
or in general case with parameter pack:
Params<Ranges...>::__preferred_vector_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point @MikeDvorskiy, for the majority of the usage of these compile time constants. They could be calculated inline in the functions based on the input range types. However, there are a few exterior public uses as described in this comment from the parallel_for launch code. This usage requires this derived struct to contain the
Range
type information prior to the actual function calls, and its easiest if you can just query it like this.You could instead have traits or helpers where you could pass range info when querying this stuff at the parallel_for launch level. I don't have a strong opinion between the two without having the other full implementation to compare to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To try to clarify further, it could maybe look like this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the initial point now. My understanding is that this is implementable through a variable template within the brick.
Something like:
There would be duplicated definitions of the variables in each brick, but we would not need to pass the ranges through the brick's template parameters. There are pros and cons with each approach and functionality / performance of each should be identical.
At this point in the review, I believe it is too late to make such a large design decision if we want to make it to the milestone. My suggestion is we defer this to an issue and address in the
2022.9.0
milestone.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see.
Thank you for discussion that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue has been created and attached to
2022.9.0
: #2023