Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describe potential solutions for algorithm implementations and their pros/cons #1771

Closed
elstehle opened this issue May 23, 2024 · 0 comments
Closed
Assignees

Comments

@elstehle
Copy link
Collaborator

elstehle commented May 23, 2024

Situation

CUB kernel templates that implement algorithms have a template parameter, usually named OffsetT, which is used to index into the iterators.

kernel_interface

kernel_implementation

While the kernels and, hence, the implementation is templatized on an OffsetT, the API-entry points of many CUB device-scope algorithms are hard-coded to use int as the offset type.

device_interface

This becomes problematic as algorithms that, today, are using int as the offset type are limited to processing at most INT_MAX number of items. As device memory capacity keeps increasing, our users are running into this limitation more and more frequently.

Current state of offset type handling in CUB

The CUB algorithms are using a wide range of different ways to handle the offset types with which kernel templates are instantiated:

  • NumItemsT/OffsetT: Some algorithms use a template parameter on the Device* interface for the num_items parameter and pass that template parameter, NumItemsT or OffsetT, to the corresponding kernel template.
  • choose_offset_t: Aforementioned NumItemsT/OffsetT method has the downside that users may unconsciously instantiate the kernels for all possible offset types, like int32_t, uint32_t, int64_t, and uint64_t, which drastically increases the compilation time on the user side. To limit the number of kernel template instantiations, we have a choose_offset_t alias template that casts any type of up to four bytes to an uint32_t and all larger types to uint64_t.
  • int32_t/uint32_t: These algorithms use a hard-coded offset type, usually int32_t and always instantiate the kernel templates with the respective type.

Overview of CUB algorithms and their current state of offset type handling (as of 15 Aug 2024), please refer to #50 (comment) as source-of-truth:

legend for offset type: ✅ considered done | 🟡 considered lower priority | 🟠 considered higher priority, as it prevents usage for larger-than-INT_MAX number of items | ⏳ in progress

legend for testing columns: ✅ considered done | 🟡 to be done | 🟠 needs to support wider offset types first

algorithm offset type tests larger-thanINT_MAX tests close to [U]INT_MAX
device_adjacent_difference.cuh choose_offset_t ✅ 2^33, sanity check, iterators 🟡
device_copy.cuh 🟡num_ranges: uint32_t
🟡buffer sizes: iterator_traits<SizeIteratorT>::value_type
🟡 🟡
device_for.cuh 🟡NumItemsT: ForEachN, ForEachCopyN, Bulk
difference_type: ForEach, ForEachCopy
🟡 🟡
device_histogram.cuh 🟡 dynamic dispatch: int for (num_rows * row_stride_bytes)<INT_MAX;
OffsetT otherwise
🟡 🟡
device_memcpy.cuh 🟡 num_ranges: uint32_t
🟡 buffer sizes: iterator_traits<SizeIteratorT>::value_type
🟡 🟡
device_merge_sort.cuh 🟡 NumItemsT ✅ extensive check ✅ extensive check
device_partition.cuh 🟠 int 🟠 🟠
device_radix_sort.cuh choose_offset_t ✅ extensive check ✅ extensive check
device_reduce.cuh choose_offset_t: Reduce, Sum, Min, Max, ReduceByKey, TransformReduce
⚠️ (note) int: ArgMin, ArgMax
✅ sanity, 2^{30,31,33) ✅ sanity, 2^32-1
device_run_length_encode.cuh 🟠 int 🟠 🟠
device_scan.cuh 🟠 int 🟠 🟠
device_segmented_radix_sort.cuh 🟠 num_items & num_segments:int 🟠 🟠
device_segmented_reduce.cuh 🟡 common_iterator_value_t({Begin,End}OffsetIteratorT): Reduce, Sum, Min, Max
⚠️ (note) int: ArgMin, ArgMax
num_segments: int
✅ sanity, rnd [2^31; 2^33] 🟡
device_segmented_sort.cuh 🟠 num_items & num_segments:int 🟠 🟠
device_select.cuh choose_offset_t: UniqueByKey
🟠 int: Flagged, If, Unique
device_spmv.cuh 🟠 int 🟠 🟠

Summary:

  • u32/u64: 4 algorithm groups
  • i32: 9 algorithm groups
  • OffsetT: 5 algorithm groups
  • Dynamic: 1 algorithm group

Goals

Primary goals:

  • Enable users to run algorithms on more than INT_MAX number of items on all algorithms that currently do not support that

Secondary goals:

  • Unify offset types across CUB algorithms to reduce mental load during maintenance and give users a common offset type passed to their potentially custom-built fancy iterators
  • Potentially improve performance of existing algorithms, in particular when instantiated with a large number of items, e.g., by making use of bit-packed tile states, etc.

Considerations:

  • Compile time & binary size
  • Performance
    • Immediate performance impact without re-tuning algorithms
    • Long-term achievable performance with re-tuning algorithms
  • Maintenance cost:
    • Testing effort
    • Tuning effort
  • One-time effort (e.g., to mitigate performance downside after changing the offset types)

Options

This section presents the options that we consider to support a large number of items for the algorithms that currently do not support 64-bit offset types.

Using the user-provided offset type

This option will make the offset type a template parameter of the Device* interface and pass that template parameter through to the kernel template.

Pros

  • There’s no performance change when the user provides the offset type that matches the previously hard-coded offset type of that algorithm.

Cons

  • Risk for users to unintentionally instantiate kernel templates with all the different offset types
  • Some offset types may exhibit quite poor performance for some offset types
  • We need to test for all offset types
  • We need to tune for all offset types

Using i32 & i64 offset types

This option will simply instantiate the existing kernel templates with either int32_t or int64_t offset types, depending on the user-provided offset type. The mapping from user-provided NumItemsT to the offset type used in the kernel template looks as follows:

User-provided offset type Offset type used by kernel
smaller-than-32-bit int32_t
int32_t int32_t
uint32_t int64_t
int64_t int64_t
uint64_t int64_t

Pros

  • No re-tuning for algorithms tuned for i32, iff the user-provided offset type is i32 or of a smaller width

Cons

  • If the user provides an offset type of u32, we need to dispatch to a kernel template that uses a 64-bit offset type, which usually causes more severe performance degradations than when using an unsigned 32-bit offset type.
  • We'll probably have to return a cuda error if, at run time, the user provides more than INT64_MAX number of items

Using u32 & u64 offset types

This option will simply instantiate the existing kernel templates with either uint32_t or uint64_t offset types, depending on the user-provided offset type. The mapping from user-provided NumItemsT to the offset type used in the kernel template looks as follows:

User-provided offset type Offset type used by kernel
smaller-than-32-bit uint32_t
int32_t uint32_t
uint32_t uint32_t
int64_t uint64_t
uint64_t uint64_t

Pros

  • When the user provides a 32-bit offset type, we keep the offset type to 32 bits.

Cons

  • For algorithms previously using a hard-coded offset type of int32_t, changing to uint32_t may degrade performance.

Using bit-packed tile state for algorithms carrying offset type in decoupled look-back

DevicePartition, DeviceSelect, DeviceReduceByKey, and DeviceRunLengthEncode carry the offset type in the tile value of the decoupled look back, requiring 128-bit wide memory transactions of a tile that comprises a value and status. There are four different statuses that a tile can have, requiring two bits to represent. We can allocate two bits from the offset (i.e., the tile’s value) and keep the memory transactions at 64 bits.

Pros
This may generally be a more efficient approach than using off-the-shelf TileStates for the decoupled look-back of these algorithms

Cons
We need to return a CUDA error at run time, if the number of items exceeds (2^62)-1

Streaming approach using i32 offsets

We can repeatedly invoke the algorithms on partitions of up to INT_MAX number of items at a time. In simple terms, with each invocation we process one partition and we would just offset the input iterators by the previous partitions’ accumulated size.

Variants

Partitioning point:

  • Specialization on th Dispatch* layer
  • Specialization on the Device* interface

Implementation:

  • Only specialize kernels for uint32_t and larger user-provided offset types. For int32_t the sass would remain the same in this case.
  • One and the same kernel for any offset type: sass for int32_t would be different than it was before the chance, but we have one and the same kernel template for any user-provided offset type.
User-provided offset type Offset type used by kernel
smaller-than-32-bit int32_t (a): unchanged code path, (b) using streaming implementation
int32_t int32_t (a): unchanged code path, (b) using streaming implementation
uint32_t int32_t
int64_t int32_t
uint64_t int32_t

Pros

  • Supports larger-than 2^32-1 items with a 32-bit offset type.
    • Possibly, just one kernel for all user-provided offset types.
  • Lower memory requirements (no need to allocate memory for all tile states but only for the tile states of a single partition)
  • May serve as a stepping stone for exhibiting a streaming interface that allows users to:
    • support pipelining (interleaving host-to-device memcpies with algorithm execution)
    • lower memory requirements (only need to fit one partition of input at a time)

Cons

  • User invokes one algorithm but may see repetitive kernel invocations
  • If implemented on the Dispatch layer, this causes a different behavior for users previously instantiating the Dispatch template class with larger offset types.
  • We may run into load imbalance within one partition, if say, one thread block has a very complex tile of items. That thread block may be working away as the last of its partition, yet the kernel of the subsequent kernel cannot be launched yet. This may be less of a concern for something like DeviceSelect but more of a problem for batched and segmented algorithms.
@github-project-automation github-project-automation bot moved this to Todo in CCCL May 23, 2024
@elstehle elstehle self-assigned this May 23, 2024
@wmaxey wmaxey moved this from Todo to In Progress in CCCL Jul 17, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in CCCL Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

1 participant