-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix thread-reduce performance regression #2944
base: main
Are you sure you want to change the base?
Conversation
Could you please show a benchmark diff of the three algorithms before #2756 and after this PR? We should see a net benefit then. Instructions how to benchmark in case you need it: https://nvidia.github.io/cccl/cub/benchmarking.html |
Reduce Max[0] NVIDIA H100 80GB HBM3
Select Flagged
Reduce by-Key
|
🟩 CI finished in 3h 35m: Pass: 100%/224 | Total: 6d 16h | Avg: 42m 56s | Max: 1h 18m | Hits: 61%/12288
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 224)
# | Runner |
---|---|
185 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
14 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
{ | ||
return cub::internal::ThreadReduceSequential<AccumT>(input, reduction_op); | ||
} | ||
_CCCL_IF_CONSTEXPR (cuda::std::is_same<ReductionOp, ::cuda::std::plus<>>::value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Sadly, we may want to include more operators here:
_CCCL_IF_CONSTEXPR (cuda::std::is_same<ReductionOp, ::cuda::std::plus<>>::value | |
_CCCL_IF_CONSTEXPR (cuda::std::is_same<ReductionOp, ::cuda::std::plus<>>::value || | |
cuda::std::is_same<ReductionOp, ::cuda::std::plus<ValueT>>::value |
We may also just test whether ReductionOp is any instantiation of ::cuda::std::plus
.
Technically, this also does not include thrust::plus
, but that will be fixed at the next major release.
Thx for reporting the benchmarks. Looks good except for Reduce Max on I8, I32, 2^28. A 14% slowdown is unfortunately below @gevtushenko's rule of "no regressions of more than 2% compared to previous implementation on 2^24+ problem sizes". Could you please investigate the cause of this regression? We should try to fix this. |
Fix nvbug: 4965585
The following routines showed performance regressions after the PR 2756:
The PR includes the following changes:
plus
operator andint/unsigned
data typesint64_t/uint64_t
use a binary-level reduction instead of a ternary reduction