You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CUB kernel templates that implement algorithms have a template parameter, usually named OffsetT, which is used to index into the iterators.
While the kernels and, hence, the implementation is templatized on an OffsetT, the API-entry points of many CUB device-scope algorithms are hard-coded to use int as the offset type.
This becomes problematic as algorithms that, today, are using int as the offset type are limited to processing at most INT_MAX number of items. As device memory capacity keeps increasing, our users are running into this limitation more and more frequently.
Current state of offset type handling in CUB
The CUB algorithms are using a wide range of different ways to handle the offset types with which kernel templates are instantiated:
NumItemsT/OffsetT: Some algorithms use a template parameter on the Device* interface for the num_items parameter and pass that template parameter, NumItemsT or OffsetT, to the corresponding kernel template.
choose_offset_t: Aforementioned NumItemsT/OffsetT method has the downside that users may unconsciously instantiate the kernels for all possible offset types, like int32_t, uint32_t, int64_t, and uint64_t, which drastically increases the compilation time on the user side. To limit the number of kernel template instantiations, we have a choose_offset_t alias template that casts any type of up to four bytes to an uint32_t and all larger types to uint64_t.
int32_t/uint32_t: These algorithms use a hard-coded offset type, usually int32_t and always instantiate the kernel templates with the respective type.
Overview of CUB algorithms and their current state of offset type handling (as of 15 Aug 2024), please refer to #50 (comment) as source-of-truth:
legend for offset type: ✅ considered done | 🟡 considered lower priority | 🟠 considered higher priority, as it prevents usage for larger-than-INT_MAX number of items | ⏳ in progress
legend for testing columns: ✅ considered done | 🟡 to be done | 🟠 needs to support wider offset types first
Enable users to run algorithms on more than INT_MAX number of items on all algorithms that currently do not support that
Secondary goals:
Unify offset types across CUB algorithms to reduce mental load during maintenance and give users a common offset type passed to their potentially custom-built fancy iterators
Potentially improve performance of existing algorithms, in particular when instantiated with a large number of items, e.g., by making use of bit-packed tile states, etc.
Considerations:
Compile time & binary size
Performance
Immediate performance impact without re-tuning algorithms
Long-term achievable performance with re-tuning algorithms
Maintenance cost:
Testing effort
Tuning effort
One-time effort (e.g., to mitigate performance downside after changing the offset types)
Options
This section presents the options that we consider to support a large number of items for the algorithms that currently do not support 64-bit offset types.
Using the user-provided offset type
This option will make the offset type a template parameter of the Device* interface and pass that template parameter through to the kernel template.
Pros
There’s no performance change when the user provides the offset type that matches the previously hard-coded offset type of that algorithm.
Cons
Risk for users to unintentionally instantiate kernel templates with all the different offset types
Some offset types may exhibit quite poor performance for some offset types
We need to test for all offset types
We need to tune for all offset types
Using i32 & i64 offset types
This option will simply instantiate the existing kernel templates with either int32_t or int64_t offset types, depending on the user-provided offset type. The mapping from user-provided NumItemsT to the offset type used in the kernel template looks as follows:
User-provided offset type
Offset type used by kernel
smaller-than-32-bit
int32_t
int32_t
int32_t
uint32_t
int64_t
int64_t
int64_t
uint64_t
int64_t
Pros
No re-tuning for algorithms tuned for i32, iff the user-provided offset type is i32 or of a smaller width
Cons
If the user provides an offset type of u32, we need to dispatch to a kernel template that uses a 64-bit offset type, which usually causes more severe performance degradations than when using an unsigned 32-bit offset type.
We'll probably have to return a cuda error if, at run time, the user provides more than INT64_MAX number of items
Using u32 & u64 offset types
This option will simply instantiate the existing kernel templates with either uint32_t or uint64_t offset types, depending on the user-provided offset type. The mapping from user-provided NumItemsT to the offset type used in the kernel template looks as follows:
User-provided offset type
Offset type used by kernel
smaller-than-32-bit
uint32_t
int32_t
uint32_t
uint32_t
uint32_t
int64_t
uint64_t
uint64_t
uint64_t
Pros
When the user provides a 32-bit offset type, we keep the offset type to 32 bits.
Cons
For algorithms previously using a hard-coded offset type of int32_t, changing to uint32_t may degrade performance.
Using bit-packed tile state for algorithms carrying offset type in decoupled look-back
DevicePartition, DeviceSelect, DeviceReduceByKey, and DeviceRunLengthEncode carry the offset type in the tile value of the decoupled look back, requiring 128-bit wide memory transactions of a tile that comprises a value and status. There are four different statuses that a tile can have, requiring two bits to represent. We can allocate two bits from the offset (i.e., the tile’s value) and keep the memory transactions at 64 bits.
Pros
This may generally be a more efficient approach than using off-the-shelf TileStates for the decoupled look-back of these algorithms
Cons
We need to return a CUDA error at run time, if the number of items exceeds (2^62)-1
Streaming approach using i32 offsets
We can repeatedly invoke the algorithms on partitions of up to INT_MAX number of items at a time. In simple terms, with each invocation we process one partition and we would just offset the input iterators by the previous partitions’ accumulated size.
Variants
Partitioning point:
Specialization on th Dispatch* layer
Specialization on the Device* interface
Implementation:
Only specialize kernels for uint32_t and larger user-provided offset types. For int32_t the sass would remain the same in this case.
One and the same kernel for any offset type: sass for int32_t would be different than it was before the chance, but we have one and the same kernel template for any user-provided offset type.
User-provided offset type
Offset type used by kernel
smaller-than-32-bit
int32_t (a): unchanged code path, (b) using streaming implementation
int32_t
int32_t (a): unchanged code path, (b) using streaming implementation
uint32_t
int32_t
int64_t
int32_t
uint64_t
int32_t
Pros
Supports larger-than 2^32-1 items with a 32-bit offset type.
Possibly, just one kernel for all user-provided offset types.
Lower memory requirements (no need to allocate memory for all tile states but only for the tile states of a single partition)
May serve as a stepping stone for exhibiting a streaming interface that allows users to:
support pipelining (interleaving host-to-device memcpies with algorithm execution)
lower memory requirements (only need to fit one partition of input at a time)
Cons
User invokes one algorithm but may see repetitive kernel invocations
If implemented on the Dispatch layer, this causes a different behavior for users previously instantiating the Dispatch template class with larger offset types.
We may run into load imbalance within one partition, if say, one thread block has a very complex tile of items. That thread block may be working away as the last of its partition, yet the kernel of the subsequent kernel cannot be launched yet. This may be less of a concern for something like DeviceSelect but more of a problem for batched and segmented algorithms.
The text was updated successfully, but these errors were encountered:
Situation
CUB kernel templates that implement algorithms have a template parameter, usually named OffsetT, which is used to index into the iterators.
While the kernels and, hence, the implementation is templatized on an
OffsetT
, the API-entry points of many CUB device-scope algorithms are hard-coded to useint
as the offset type.This becomes problematic as algorithms that, today, are using
int
as the offset type are limited to processing at mostINT_MAX
number of items. As device memory capacity keeps increasing, our users are running into this limitation more and more frequently.Current state of offset type handling in CUB
The CUB algorithms are using a wide range of different ways to handle the offset types with which kernel templates are instantiated:
NumItemsT/OffsetT
: Some algorithms use a template parameter on theDevice*
interface for thenum_items
parameter and pass that template parameter,NumItemsT
orOffsetT
, to the corresponding kernel template.choose_offset_t
: AforementionedNumItemsT/OffsetT
method has the downside that users may unconsciously instantiate the kernels for all possible offset types, likeint32_t
,uint32_t
,int64_t
, anduint64_t
, which drastically increases the compilation time on the user side. To limit the number of kernel template instantiations, we have achoose_offset_t
alias template that casts any type of up to four bytes to anuint32_t
and all larger types touint64_t
.int32_t/uint32_t
: These algorithms use a hard-coded offset type, usuallyint32_t
and always instantiate the kernel templates with the respective type.Overview of CUB algorithms and their current state of offset type handling (as of 15 Aug 2024), please refer to #50 (comment) as source-of-truth:
legend for offset type: ✅ considered done | 🟡 considered lower priority | 🟠 considered higher priority, as it prevents usage for larger-than-
INT_MAX
number of items | ⏳ in progresslegend for testing columns: ✅ considered done | 🟡 to be done | 🟠 needs to support wider offset types first
INT_MAX
[U]INT_MAX
device_adjacent_difference.cuh
choose_offset_t
device_copy.cuh
num_ranges
:uint32_t
🟡
buffer sizes
:iterator_traits<SizeIteratorT>::value_type
device_for.cuh
NumItemsT
: ForEachN, ForEachCopyN, Bulkdifference_type
: ForEach, ForEachCopydevice_histogram.cuh
OffsetT otherwise
device_memcpy.cuh
num_ranges
:uint32_t
🟡
buffer sizes
:iterator_traits<SizeIteratorT>::value_type
device_merge_sort.cuh
NumItemsT
device_partition.cuh
int
device_radix_sort.cuh
choose_offset_t
device_reduce.cuh
choose_offset_t
:Reduce
,Sum
,Min
,Max
,ReduceByKey
,TransformReduce
int
:ArgMin
,ArgMax
device_run_length_encode.cuh
int
device_scan.cuh
int
device_segmented_radix_sort.cuh
num_items
&num_segments
:int
device_segmented_reduce.cuh
common_iterator_value_t({Begin,End}OffsetIteratorT)
:Reduce
,Sum
,Min
,Max
int
:ArgMin
,ArgMax
num_segments
:int
device_segmented_sort.cuh
num_items
&num_segments
:int
device_select.cuh
choose_offset_t
:UniqueByKey
🟠
int
: Flagged,If
,Unique
device_spmv.cuh
int
Summary:
u32/u64
: 4 algorithm groupsi32
: 9 algorithm groupsOffsetT
: 5 algorithm groupsDynamic
: 1 algorithm groupGoals
Primary goals:
INT_MAX
number of items on all algorithms that currently do not support thatSecondary goals:
Considerations:
Options
This section presents the options that we consider to support a large number of items for the algorithms that currently do not support 64-bit offset types.
Using the user-provided offset type
This option will make the offset type a template parameter of the
Device*
interface and pass that template parameter through to the kernel template.Pros
Cons
Using i32 & i64 offset types
This option will simply instantiate the existing kernel templates with either
int32_t
orint64_t
offset types, depending on the user-provided offset type. The mapping from user-provided NumItemsT to the offset type used in the kernel template looks as follows:int32_t
int32_t
int32_t
uint32_t
int64_t
int64_t
int64_t
uint64_t
int64_t
Pros
i32
, iff the user-provided offset type isi32
or of a smaller widthCons
u32
, we need to dispatch to a kernel template that uses a 64-bit offset type, which usually causes more severe performance degradations than when using an unsigned 32-bit offset type.INT64_MAX
number of itemsUsing u32 & u64 offset types
This option will simply instantiate the existing kernel templates with either
uint32_t
oruint64_t
offset types, depending on the user-provided offset type. The mapping from user-providedNumItemsT
to the offset type used in the kernel template looks as follows:uint32_t
int32_t
uint32_t
uint32_t
uint32_t
int64_t
uint64_t
uint64_t
uint64_t
Pros
Cons
int32_t
, changing touint32_t
may degrade performance.Using bit-packed tile state for algorithms carrying offset type in decoupled look-back
DevicePartition
,DeviceSelect
,DeviceReduceByKey
, andDeviceRunLengthEncode
carry the offset type in the tilevalue
of the decoupled look back, requiring 128-bit wide memory transactions of a tile that comprises avalue
andstatus
. There are four different statuses that a tile can have, requiring two bits to represent. We can allocate two bits from the offset (i.e., the tile’svalue
) and keep the memory transactions at 64 bits.Pros
This may generally be a more efficient approach than using off-the-shelf
TileState
s for the decoupled look-back of these algorithmsCons
We need to return a CUDA error at run time, if the number of items exceeds
(2^62)-1
Streaming approach using i32 offsets
We can repeatedly invoke the algorithms on partitions of up to
INT_MAX
number of items at a time. In simple terms, with each invocation we process one partition and we would just offset the input iterators by the previous partitions’ accumulated size.Variants
Partitioning point:
Dispatch*
layerDevice*
interfaceImplementation:
uint32_t
and larger user-provided offset types. Forint32_t
the sass would remain the same in this case.int32_t
would be different than it was before the chance, but we have one and the same kernel template for any user-provided offset type.int32_t
(a): unchanged code path, (b) using streaming implementationint32_t
int32_t
(a): unchanged code path, (b) using streaming implementationuint32_t
int32_t
int64_t
int32_t
uint64_t
int32_t
Pros
Cons
The text was updated successfully, but these errors were encountered: