-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for large number of items in DeviceSelect
and DevicePartition
#2400
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
elstehle
changed the title
Enh/streaming selection
Adds support for large number of items in Sep 11, 2024
DeviceSelect
and DevicePartition
🟨 CI finished in 3h 40m: Pass: 97%/251 | Total: 2d 01h | Avg: 11m 45s | Max: 1h 29m | Hits: 99%/20079
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟩 CI finished in 4h 03m: Pass: 100%/251 | Total: 5d 11h | Avg: 31m 24s | Max: 1h 05m | Hits: 86%/24441
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟨 CI finished in 5h 40m: Pass: 98%/251 | Total: 5d 12h | Avg: 31m 46s | Max: 1h 19m | Hits: 84%/22260
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟩 CI finished in 3h 33m: Pass: 100%/251 | Total: 6d 10h | Avg: 36m 57s | Max: 1h 41m | Hits: 78%/24441
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟨 CI finished in 4h 37m: Pass: 99%/251 | Total: 5d 21h | Avg: 33m 49s | Max: 1h 20m | Hits: 87%/24441
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟨 CI finished in 5h 27m: Pass: 96%/251 | Total: 5d 14h | Avg: 32m 10s | Max: 1h 30m | Hits: 81%/15567
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟩 CI finished in 1h 27m: Pass: 100%/208 | Total: 4d 18h | Avg: 33m 02s | Max: 1h 02m | Hits: 84%/14058
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 208)
# | Runner |
---|---|
171 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
12 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
🟩 CI finished in 1h 59m: Pass: 100%/208 | Total: 5d 18h | Avg: 39m 49s | Max: 1h 11m | Hits: 74%/14058
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 208)
# | Runner |
---|---|
171 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
12 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
🟩 CI finished in 1h 40m: Pass: 100%/208 | Total: 4d 20h | Avg: 33m 31s | Max: 1h 00m | Hits: 84%/14058
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 208)
# | Runner |
---|---|
171 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
12 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
2 tasks
gevtushenko
approved these changes
Oct 8, 2024
🟩 CI finished in 2h 19m: Pass: 100%/208 | Total: 5d 17h | Avg: 39m 39s | Max: 1h 27m | Hits: 53%/16003
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 208)
# | Runner |
---|---|
171 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
12 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
gevtushenko
approved these changes
Oct 8, 2024
🟩 CI finished in 2h 06m: Pass: 100%/208 | Total: 4d 23h | Avg: 34m 30s | Max: 1h 17m | Hits: 86%/16003
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 208)
# | Runner |
---|---|
171 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
12 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
This was referenced Oct 10, 2024
Is this going in the 2.7.0 release? |
There is no (normal) tag yet containing the commit made by this PR, so we will ship it in the next release, CCCL 2.8. |
This was referenced Nov 27, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR implements streaming
DeviceSelect
andDevicePartition
that, for very large inputs exceedingINT_MAX
number of items, splits up the input into partitions of at mostINT_MAX
number of items and processes one partition at a time.Closes #2238
Closes #1422
Closes #1437
Closes #1614
TODOs
DeviceSelect
interface to takenum_items
asint64_t
.DevicePartition
interface to take templatizednum_items
, use the kernel with minimal changes forint32_t
and smallerNumItemsT
and use full-fledged streaming kernel for the remaining types.DeviceSelect::Flagged
.DevicePartition::If
.DevicePartition::Flagged
.DeviceSelect::Unique
.DevicePartition
with two distinct iterators (one for selected and one for rejected items).DeviceSelect::If
.DeviceSelect::Flagged
.DeviceSelect::Unique
.DevicePartition::If
.DevicePartition::Flagged
.DevicePartition::If
with distinct partitions.DeviceSelect::If
.DevicePartition::If
usingThreeWayPartition
and tests.copy_if
et al. for the optimal choice of offset types.Checklist
Latest benchmark results on 95de26f
Approach
DeviceSelect
andDevicePartition
splits up inputs larger thanINT_MAX
into partitions of up toINT_MAX
items each, repeatedly invoking the respective algorithmstreaming_context
object to the algorithm that provides all information about the current partition (i.e., for the current kernel invocation), like offsets into the input and output iterators.DevicePartition::{If,Flagged}
streaming_context
, iff the user-provided offset type isuint_32
or wider than 4 bytesstreaming_context
that basically returns a0
immediate value for the offsets et al. to keep performance impact minimalDeviceSelect::{If,Flagged,Unique}
we always use thestreaming_context
as there's negligible performance downside. Always using thestreaming_context
provides the benefit that we compile just one single kernel no matter the user-provided offset type here. Another benefit is that, in future, we only have to tune one kernel template specialization.Summaries
How to interpret:
Diff i32 vs i32.main
: We use only a dummystreaming_context
. These columns compare the algorithm with a dummystreaming_context
to the performance numbers we got inmain
(usingi32
offsets) today.Diff i64 vs i32.main
: We use astreaming_context
that provides offsets usingi64
. I.e., whenever we need to index into the overall inputs / outputs across partitions, we usei64
offsets, other indexing happens usingi32
. These columns compare the algorithm with a properstreaming_context
to what we have inmain
today usingi32
(!) offsets.What to focus on:
DeviceSelect
is, from now on, always usingi64
, because there is only very limited performance downside from usingi64
instead ofi32
with the streaming approach, while we have the benefit of just having to maintain and tune a single kernel template instantiating going forward. Given that, we want to focus on the rightmost two columns forDeviceSelect
in the following summary.DevicePartition
is using a static dispatch, i.e.,i32
(dummy streaming context) andi64
(full-fledged streaming), depending on the user-providedOffsetT
. A use ofi32
retains sass compatibility to what we have inmain
today. So, basically unchanged fori32
user-provided offset type.Summary on benefits of streaming
i64
versus maini64
offset type:In the following we compare the worst-case slowdown of
2^28
number of items of the two mentioned approaches:DeviceSelect::If
: 4.64% versus 169%DeviceSelect::Flagged
: 1.5% versus 91%DevicePartition::If
: 18.9% versus 30.95%DevicePartition::Flagged
: 11.8% versus 36.9%Select.If
any num items
2^28 num items
any num items
2^28 num items
Select.Flagged
any num items
2^28 num items
any num items
2^28 num items
Select.Unique
any num items
2^28 num items
any num items
2^28 num items
Partition.If
any num items
2^28 num items
any num items
2^28 num items
Partition.Flagged
any num items
2^28 num items
any num items
2^28 num items
Detailed benchmark results
H100 select.if
H100 select.flagged
H100 select.unique
H100 partition.if
H100 partition.flagged