Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix thread-reduce performance regression #2944

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fbusato
Copy link
Contributor

@fbusato fbusato commented Nov 22, 2024

Fix nvbug: 4965585

The following routines showed performance regressions after the PR 2756:

  • Select (flagged)
  • Reduce-by-key
  • Reduce Max/Sum with large value types

The PR includes the following changes:

  • Any non-standard binary operators now use a thread-level sequential algorithm
  • SM90 uses a thread-level sequential algorithm for plus operator and int/unsigned data types
  • int64_t/uint64_t use a binary-level reduction instead of a ternary reduction

@bernhardmgruber
Copy link
Contributor

Could you please show a benchmark diff of the three algorithms before #2756 and after this PR? We should see a net benefit then.

Instructions how to benchmark in case you need it: https://nvidia.github.io/cccl/cub/benchmarking.html

@fbusato
Copy link
Contributor Author

fbusato commented Nov 22, 2024

Reduce Max

[0] NVIDIA H100 80GB HBM3

T{ct} OffsetT{ct} Elements{io} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I32 2^28 107.167 us 0.39% 122.152 us 0.43% 14.986 us 13.98% SLOW
I8 I64 2^28 108.946 us 1.96% 109.357 us 2.03% 0.411 us 0.38% SAME
I16 I32 2^28 188.094 us 1.98% 190.235 us 2.02% 2.142 us 1.14% SAME
I16 I64 2^28 188.058 us 1.88% 189.965 us 1.94% 1.907 us 1.01% SAME
I32 I32 2^28 351.776 us 1.38% 352.035 us 1.45% 0.259 us 0.07% SAME
I32 I64 2^28 352.221 us 1.38% 352.644 us 1.41% 0.424 us 0.12% SAME
I64 I32 2^28 688.110 us 0.76% 687.958 us 0.81% -0.152 us -0.02% SAME
I64 I64 2^28 688.580 us 0.83% 688.623 us 0.85% 0.043 us 0.01% SAME
I128 I32 2^28 1.400 ms 0.27% 1.403 ms 0.28% 2.806 us 0.20% SAME
I128 I64 2^28 1.404 ms 1.43% 1.397 ms 1.56% -6.640 us -0.47% SAME
F32 I32 2^28 359.793 us 3.51% 359.978 us 3.54% 0.185 us 0.05% SAME
F32 I64 2^28 352.525 us 1.47% 352.488 us 1.44% -0.037 us -0.01% SAME
F64 I32 2^28 688.145 us 0.82% 688.022 us 0.78% -0.123 us -0.02% SAME
F64 I64 2^28 688.338 us 0.83% 688.361 us 0.90% 0.023 us 0.00% SAME
C64 I32 2^28 1.479 ms 0.06% 1.468 ms 0.07% -11.115 us -0.75% FAST
C64 I64 2^28 1.550 ms 0.07% 1.524 ms 0.07% -25.813 us -1.67% FAST

Select Flagged

T{ct} OffsetT{ct} IsInPlace{ct} Elements{io} Entropy Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I32 false 2^28 1 666.930 us 0.32% 646.117 us 0.29% -20.812 us -3.12% FAST
I8 I32 false 2^28 0.544 646.175 us 0.29% 624.590 us 0.24% -21.586 us -3.34% FAST
I8 I32 false 2^28 0 539.994 us 0.27% 525.318 us 0.22% -14.675 us -2.72% FAST
I8 I32 true 2^28 1 766.635 us 0.22% 747.091 us 0.21% -19.544 us -2.55% FAST
I8 I32 true 2^28 0.544 752.011 us 0.21% 727.137 us 0.20% -24.873 us -3.31% FAST
I8 I32 true 2^28 0 642.715 us 0.22% 629.653 us 0.20% -13.062 us -2.03% FAST
I8 I64 false 2^28 1 687.225 us 0.27% 655.494 us 0.23% -31.731 us -4.62% FAST
I8 I64 false 2^28 0.544 669.616 us 0.28% 638.069 us 0.21% -31.547 us -4.71% FAST
I8 I64 false 2^28 0 550.467 us 0.28% 530.086 us 0.26% -20.381 us -3.70% FAST
I8 I64 true 2^28 1 791.146 us 0.19% 755.288 us 0.23% -35.858 us -4.53% FAST
I8 I64 true 2^28 0.544 773.481 us 0.20% 739.127 us 0.21% -34.354 us -4.44% FAST
I8 I64 true 2^28 0 653.690 us 0.24% 636.169 us 0.22% -17.521 us -2.68% FAST
I16 I32 false 2^28 1 751.677 us 0.38% 765.957 us 0.36% 14.280 us 1.90% SLOW
I16 I32 false 2^28 0.544 704.825 us 0.30% 723.212 us 0.29% 18.387 us 2.61% SLOW
I16 I32 false 2^28 0 595.444 us 0.26% 563.434 us 0.24% -32.010 us -5.38% FAST
I16 I32 true 2^28 1 850.084 us 0.26% 862.771 us 0.25% 12.687 us 1.49% SLOW
I16 I32 true 2^28 0.544 812.488 us 0.22% 824.056 us 0.23% 11.568 us 1.42% SLOW
I16 I32 true 2^28 0 711.684 us 0.20% 674.914 us 0.21% -36.770 us -5.17% FAST
I16 I64 false 2^28 1 784.931 us 0.28% 783.574 us 0.31% -1.357 us -0.17% SAME
I16 I64 false 2^28 0.544 733.121 us 0.26% 731.942 us 0.28% -1.179 us -0.16% SAME
I16 I64 false 2^28 0 570.107 us 0.25% 571.236 us 0.26% 1.129 us 0.20% SAME
I16 I64 true 2^28 1 876.321 us 0.24% 875.572 us 0.24% -0.749 us -0.09% SAME
I16 I64 true 2^28 0.544 831.914 us 0.20% 832.294 us 0.20% 0.380 us 0.05% SAME
I16 I64 true 2^28 0 685.799 us 0.18% 687.535 us 0.18% 1.736 us 0.25% SLOW
I32 I32 false 2^28 1 1.122 ms 0.43% 1.022 ms 0.44% -100.502 us -8.96% FAST
I32 I32 false 2^28 0.544 1.018 ms 0.27% 892.510 us 0.33% -125.439 us -12.32% FAST
I32 I32 false 2^28 0 799.558 us 0.24% 664.654 us 0.27% -134.904 us -16.87% FAST
I32 I32 true 2^28 1 1.253 ms 0.23% 1.118 ms 0.34% -134.791 us -10.76% FAST
I32 I32 true 2^28 0.544 1.178 ms 0.19% 1.012 ms 0.30% -165.656 us -14.06% FAST
I32 I32 true 2^28 0 984.614 us 0.15% 793.354 us 0.23% -191.260 us -19.42% FAST
I32 I64 false 2^28 1 1.062 ms 0.58% 1.026 ms 0.43% -36.118 us -3.40% FAST
I32 I64 false 2^28 0.544 913.881 us 0.54% 888.767 us 0.36% -25.114 us -2.75% FAST
I32 I64 false 2^28 0 690.710 us 0.35% 668.789 us 0.29% -21.921 us -3.17% FAST
I32 I64 true 2^28 1 1.124 ms 0.43% 1.121 ms 0.35% -3.175 us -0.28% SAME
I32 I64 true 2^28 0.544 1.006 ms 0.31% 1.006 ms 0.30% -0.527 us -0.05% SAME
I32 I64 true 2^28 0 805.761 us 0.23% 798.519 us 0.22% -7.242 us -0.90% FAST
I64 I32 false 2^28 1 1.821 ms 0.43% 1.823 ms 0.45% 1.098 us 0.06% SAME
I64 I32 false 2^28 0.544 1.496 ms 0.61% 1.496 ms 0.59% 0.517 us 0.03% SAME
I64 I32 false 2^28 0 1.010 ms 0.39% 1.009 ms 0.40% -1.132 us -0.11% SAME
I64 I32 true 2^28 1 1.936 ms 0.33% 1.935 ms 0.31% -1.034 us -0.05% SAME
I64 I32 true 2^28 0.544 1.639 ms 0.40% 1.639 ms 0.43% 0.101 us 0.01% SAME
I64 I32 true 2^28 0 1.192 ms 0.26% 1.191 ms 0.26% -0.858 us -0.07% SAME
I64 I64 false 2^28 1 1.819 ms 0.41% 1.816 ms 0.41% -3.146 us -0.17% SAME
I64 I64 false 2^28 0.544 1.496 ms 0.60% 1.493 ms 0.60% -3.054 us -0.20% SAME
I64 I64 false 2^28 0 1.021 ms 0.43% 1.019 ms 0.46% -2.154 us -0.21% SAME
I64 I64 true 2^28 1 1.936 ms 0.33% 1.932 ms 0.32% -4.479 us -0.23% SAME
I64 I64 true 2^28 0.544 1.638 ms 0.41% 1.633 ms 0.43% -4.543 us -0.28% SAME
I64 I64 true 2^28 0 1.200 ms 0.28% 1.202 ms 0.26% 2.752 us 0.23% SAME
I128 I32 false 2^28 1 3.603 ms 0.46% 3.604 ms 0.45% 1.313 us 0.04% SAME
I128 I32 false 2^28 0.544 2.859 ms 0.84% 2.858 ms 0.82% -0.687 us -0.02% SAME
I128 I32 false 2^28 0 1.943 ms 0.69% 1.944 ms 0.68% 0.080 us 0.00% SAME
I128 I32 true 2^28 1 3.820 ms 0.44% 3.820 ms 0.45% 0.541 us 0.01% SAME
I128 I32 true 2^28 0.544 3.192 ms 0.59% 3.192 ms 0.56% 0.359 us 0.01% SAME
I128 I32 true 2^28 0 2.421 ms 0.40% 2.421 ms 0.43% -0.185 us -0.01% SAME
I128 I64 false 2^28 1 3.609 ms 0.59% 3.609 ms 0.59% -0.008 us -0.00% SAME
I128 I64 false 2^28 0.544 2.864 ms 0.82% 2.864 ms 0.84% 0.521 us 0.02% SAME
I128 I64 false 2^28 0 1.953 ms 0.69% 1.954 ms 0.72% 0.404 us 0.02% SAME
I128 I64 true 2^28 1 3.832 ms 0.44% 3.831 ms 0.43% -0.698 us -0.02% SAME
I128 I64 true 2^28 0.544 3.203 ms 0.57% 3.203 ms 0.56% -0.236 us -0.01% SAME
I128 I64 true 2^28 0 2.435 ms 0.40% 2.435 ms 0.39% -0.436 us -0.02% SAME
F32 I32 false 2^28 1 1.123 ms 0.85% 1.024 ms 1.03% -99.082 us -8.82% FAST
F32 I32 false 2^28 0.544 1.018 ms 0.27% 892.420 us 0.34% -125.529 us -12.33% FAST
F32 I32 false 2^28 0 799.450 us 0.22% 664.718 us 0.27% -134.732 us -16.85% FAST
F32 I32 true 2^28 1 1.253 ms 0.25% 1.117 ms 0.34% -136.310 us -10.88% FAST
F32 I32 true 2^28 0.544 1.178 ms 0.20% 1.011 ms 0.28% -166.523 us -14.14% FAST
F32 I32 true 2^28 0 984.513 us 0.15% 793.035 us 0.23% -191.478 us -19.45% FAST
F32 I64 false 2^28 1 1.062 ms 0.59% 1.025 ms 0.41% -36.465 us -3.43% FAST
F32 I64 false 2^28 0.544 913.749 us 0.53% 888.043 us 0.37% -25.705 us -2.81% FAST
F32 I64 false 2^28 0 689.779 us 0.36% 668.831 us 0.30% -20.948 us -3.04% FAST
F32 I64 true 2^28 1 1.123 ms 0.44% 1.120 ms 0.36% -2.997 us -0.27% SAME
F32 I64 true 2^28 0.544 1.006 ms 0.31% 1.005 ms 0.28% -0.527 us -0.05% SAME
F32 I64 true 2^28 0 805.759 us 0.24% 798.182 us 0.23% -7.577 us -0.94% FAST
F64 I32 false 2^28 1 1.822 ms 0.46% 1.823 ms 0.46% 0.258 us 0.01% SAME
F64 I32 false 2^28 0.544 1.496 ms 0.61% 1.497 ms 0.59% 0.757 us 0.05% SAME
F64 I32 false 2^28 0 1.010 ms 0.38% 1.009 ms 0.40% -0.575 us -0.06% SAME
F64 I32 true 2^28 1 1.936 ms 0.30% 1.935 ms 0.31% -1.120 us -0.06% SAME
F64 I32 true 2^28 0.544 1.639 ms 0.41% 1.639 ms 0.42% -0.057 us -0.00% SAME
F64 I32 true 2^28 0 1.192 ms 0.26% 1.192 ms 0.26% -0.607 us -0.05% SAME
F64 I64 false 2^28 1 1.819 ms 0.41% 1.816 ms 0.38% -2.957 us -0.16% SAME
F64 I64 false 2^28 0.544 1.496 ms 0.60% 1.493 ms 0.61% -3.213 us -0.21% SAME
F64 I64 false 2^28 0 1.021 ms 0.45% 1.019 ms 0.44% -2.082 us -0.20% SAME
F64 I64 true 2^28 1 1.936 ms 0.33% 1.931 ms 0.31% -4.830 us -0.25% SAME
F64 I64 true 2^28 0.544 1.638 ms 0.41% 1.633 ms 0.42% -4.501 us -0.27% SAME
F64 I64 true 2^28 0 1.200 ms 0.25% 1.203 ms 0.27% 2.579 us 0.21% SAME

Reduce by-Key

KeyT{ct} ValueT{ct} OffsetT{ct} Elements{io} MaxSegSize Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I8 I32 2^28 2^1 1.042 ms 0.48% 1.070 ms 0.47% 27.182 us 2.61% SLOW
I8 I8 I32 2^28 2^4 912.502 us 0.45% 936.005 us 0.43% 23.503 us 2.58% SLOW
I8 I8 I32 2^28 2^8 905.978 us 0.29% 928.088 us 0.30% 22.110 us 2.44% SLOW
I8 I16 I32 2^28 2^1 1.595 ms 0.28% 1.592 ms 0.29% -2.567 us -0.16% SAME
I8 I16 I32 2^28 2^4 1.262 ms 0.38% 1.240 ms 0.43% -22.581 us -1.79% FAST
I8 I16 I32 2^28 2^8 1.194 ms 0.29% 1.172 ms 0.28% -22.151 us -1.86% FAST
I8 I32 I32 2^28 2^1 1.332 ms 0.59% 1.316 ms 0.60% -16.514 us -1.24% FAST
I8 I32 I32 2^28 2^4 1.015 ms 0.49% 1.002 ms 0.51% -13.342 us -1.31% FAST
I8 I32 I32 2^28 2^8 946.310 us 0.39% 938.284 us 0.42% -8.026 us -0.85% FAST
I8 I64 I32 2^28 2^1 2.028 ms 0.46% 2.032 ms 0.47% 4.866 us 0.24% SAME
I8 I64 I32 2^28 2^4 1.442 ms 0.36% 1.450 ms 0.34% 8.921 us 0.62% SLOW
I8 I64 I32 2^28 2^8 1.308 ms 0.21% 1.320 ms 0.21% 11.956 us 0.91% SLOW
I8 I128 I32 2^28 2^1 5.109 ms 0.17% 5.127 ms 0.18% 17.669 us 0.35% SLOW
I8 I128 I32 2^28 2^4 4.173 ms 0.27% 4.195 ms 0.27% 21.329 us 0.51% SLOW
I8 I128 I32 2^28 2^8 4.015 ms 0.31% 4.034 ms 0.32% 18.978 us 0.47% SLOW
I8 F32 I32 2^28 2^1 1.337 ms 0.75% 1.322 ms 0.78% -14.694 us -1.10% FAST
I8 F32 I32 2^28 2^4 1.018 ms 0.52% 1.007 ms 0.49% -10.626 us -1.04% FAST
I8 F32 I32 2^28 2^8 952.242 us 0.41% 947.759 us 0.42% -4.483 us -0.47% FAST
I8 F64 I32 2^28 2^1 2.043 ms 0.50% 2.046 ms 0.47% 2.364 us 0.12% SAME
I8 F64 I32 2^28 2^4 1.459 ms 0.35% 1.465 ms 0.33% 6.160 us 0.42% SLOW
I8 F64 I32 2^28 2^8 1.325 ms 0.19% 1.332 ms 0.19% 7.488 us 0.57% SLOW
I8 C64 I32 2^28 2^1 5.539 ms 0.18% 5.340 ms 0.18% -199.691 us -3.60% FAST
I8 C64 I32 2^28 2^4 4.998 ms 0.10% 4.802 ms 0.10% -195.590 us -3.91% FAST
I8 C64 I32 2^28 2^8 4.848 ms 0.09% 4.655 ms 0.09% -193.759 us -4.00% FAST
I16 I8 I32 2^28 2^1 1.228 ms 0.54% 1.382 ms 0.44% 153.882 us 12.53% SLOW
I16 I8 I32 2^28 2^4 961.367 us 0.50% 1.107 ms 0.47% 145.262 us 15.11% SLOW
I16 I8 I32 2^28 2^8 928.405 us 0.38% 1.070 ms 0.28% 141.610 us 15.25% SLOW
I16 I16 I32 2^28 2^1 1.136 ms 0.48% 1.139 ms 0.48% 2.700 us 0.24% SAME
I16 I16 I32 2^28 2^4 988.966 us 0.26% 990.565 us 0.31% 1.599 us 0.16% SAME
I16 I16 I32 2^28 2^8 961.984 us 0.16% 963.443 us 0.20% 1.459 us 0.15% SAME
I16 I32 I32 2^28 2^1 1.395 ms 0.54% 1.391 ms 0.60% -4.513 us -0.32% SAME
I16 I32 I32 2^28 2^4 994.327 us 0.44% 994.688 us 0.48% 0.361 us 0.04% SAME
I16 I32 I32 2^28 2^8 930.997 us 0.28% 931.975 us 0.29% 0.978 us 0.11% SAME
I16 I64 I32 2^28 2^1 1.954 ms 0.61% 1.956 ms 0.60% 1.804 us 0.09% SAME
I16 I64 I32 2^28 2^4 1.367 ms 0.35% 1.378 ms 0.35% 10.904 us 0.80% SLOW
I16 I64 I32 2^28 2^8 1.241 ms 0.25% 1.254 ms 0.23% 12.715 us 1.02% SLOW
I16 I128 I32 2^28 2^1 5.191 ms 0.17% 5.202 ms 0.17% 10.509 us 0.20% SLOW
I16 I128 I32 2^28 2^4 4.221 ms 0.29% 4.239 ms 0.30% 17.871 us 0.42% SLOW
I16 I128 I32 2^28 2^8 4.044 ms 0.32% 4.061 ms 0.32% 16.941 us 0.42% SLOW
I16 F32 I32 2^28 2^1 1.394 ms 0.73% 1.257 ms 0.94% -136.863 us -9.82% FAST
I16 F32 I32 2^28 2^4 993.771 us 0.46% 861.820 us 0.55% -131.951 us -13.28% FAST
I16 F32 I32 2^28 2^8 938.546 us 0.25% 791.470 us 0.41% -147.076 us -15.67% FAST
I16 F64 I32 2^28 2^1 1.967 ms 0.55% 1.962 ms 0.63% -4.929 us -0.25% SAME
I16 F64 I32 2^28 2^4 1.397 ms 0.33% 1.382 ms 0.34% -14.949 us -1.07% FAST
I16 F64 I32 2^28 2^8 1.274 ms 0.20% 1.259 ms 0.23% -14.515 us -1.14% FAST
I16 C64 I32 2^28 2^1 5.203 ms 0.18% 4.865 ms 0.20% -337.605 us -6.49% FAST
I16 C64 I32 2^28 2^4 4.583 ms 0.11% 4.219 ms 0.12% -364.973 us -7.96% FAST
I16 C64 I32 2^28 2^8 4.399 ms 0.10% 4.021 ms 0.11% -377.759 us -8.59% FAST
I32 I8 I32 2^28 2^1 1.267 ms 0.53% 1.275 ms 0.52% 8.074 us 0.64% SLOW
I32 I8 I32 2^28 2^4 921.451 us 0.41% 934.302 us 0.40% 12.851 us 1.39% SLOW
I32 I8 I32 2^28 2^8 868.209 us 0.31% 876.769 us 0.33% 8.560 us 0.99% SLOW
I32 I16 I32 2^28 2^1 1.667 ms 0.53% 1.423 ms 0.75% -244.123 us -14.65% FAST
I32 I16 I32 2^28 2^4 1.285 ms 0.33% 975.869 us 0.61% -308.814 us -24.04% FAST
I32 I16 I32 2^28 2^8 1.178 ms 0.32% 902.246 us 0.56% -275.914 us -23.42% FAST
I32 I32 I32 2^28 2^1 1.436 ms 0.86% 1.445 ms 0.93% 8.957 us 0.62% SAME
I32 I32 I32 2^28 2^4 965.283 us 0.63% 966.743 us 0.64% 1.460 us 0.15% SAME
I32 I32 I32 2^28 2^8 887.400 us 0.46% 911.902 us 0.40% 24.502 us 2.76% SLOW
I32 I64 I32 2^28 2^1 2.147 ms 0.64% 2.150 ms 0.67% 3.266 us 0.15% SAME
I32 I64 I32 2^28 2^4 1.453 ms 0.43% 1.460 ms 0.42% 6.317 us 0.43% SLOW
I32 I64 I32 2^28 2^8 1.294 ms 0.27% 1.303 ms 0.25% 8.674 us 0.67% SLOW
I32 I128 I32 2^28 2^1 5.310 ms 0.20% 5.322 ms 0.20% 11.899 us 0.22% SLOW
I32 I128 I32 2^28 2^4 4.277 ms 0.28% 4.293 ms 0.28% 15.860 us 0.37% SLOW
I32 I128 I32 2^28 2^8 4.080 ms 0.33% 4.096 ms 0.35% 15.467 us 0.38% SLOW
I32 F32 I32 2^28 2^1 1.628 ms 0.78% 1.625 ms 0.78% -3.238 us -0.20% SAME
I32 F32 I32 2^28 2^4 1.239 ms 0.53% 1.230 ms 0.56% -8.616 us -0.70% FAST
I32 F32 I32 2^28 2^8 1.135 ms 0.89% 1.118 ms 0.62% -16.887 us -1.49% FAST
I32 F64 I32 2^28 2^1 2.246 ms 0.57% 2.247 ms 0.58% 0.285 us 0.01% SAME
I32 F64 I32 2^28 2^4 1.570 ms 0.34% 1.573 ms 0.32% 3.117 us 0.20% SAME
I32 F64 I32 2^28 2^8 1.418 ms 0.18% 1.422 ms 0.18% 3.940 us 0.28% SLOW
I32 C64 I32 2^28 2^1 5.787 ms 0.19% 5.586 ms 0.20% -201.635 us -3.48% FAST
I32 C64 I32 2^28 2^4 5.119 ms 0.10% 4.926 ms 0.10% -192.607 us -3.76% FAST
I32 C64 I32 2^28 2^8 4.917 ms 0.10% 4.724 ms 0.10% -193.088 us -3.93% FAST
I64 I8 I32 2^28 2^1 2.092 ms 0.58% 2.059 ms 0.62% -32.968 us -1.58% FAST
I64 I8 I32 2^28 2^4 1.534 ms 0.35% 1.520 ms 0.37% -13.774 us -0.90% FAST
I64 I8 I32 2^28 2^8 1.404 ms 0.26% 1.387 ms 0.28% -16.891 us -1.20% FAST
I64 I16 I32 2^28 2^1 2.022 ms 0.66% 2.017 ms 0.68% -5.418 us -0.27% SAME
I64 I16 I32 2^28 2^4 1.422 ms 0.38% 1.423 ms 0.40% 0.243 us 0.02% SAME
I64 I16 I32 2^28 2^8 1.299 ms 0.22% 1.293 ms 0.23% -5.245 us -0.40% FAST
I64 I32 I32 2^28 2^1 2.166 ms 0.68% 2.166 ms 0.69% 0.117 us 0.01% SAME
I64 I32 I32 2^28 2^4 1.406 ms 0.46% 1.407 ms 0.47% 0.556 us 0.04% SAME
I64 I32 I32 2^28 2^8 1.235 ms 0.19% 1.236 ms 0.19% 0.649 us 0.05% SAME
I64 I64 I32 2^28 2^1 2.670 ms 0.62% 2.688 ms 0.63% 18.454 us 0.69% SLOW
I64 I64 I32 2^28 2^4 1.751 ms 0.42% 1.763 ms 0.49% 12.425 us 0.71% SLOW
I64 I64 I32 2^28 2^8 1.555 ms 0.29% 1.570 ms 0.56% 15.361 us 0.99% SLOW
I64 I128 I32 2^28 2^1 6.180 ms 0.16% 6.196 ms 0.16% 16.385 us 0.27% SLOW
I64 I128 I32 2^28 2^4 4.997 ms 0.26% 5.018 ms 0.25% 21.122 us 0.42% SLOW
I64 I128 I32 2^28 2^8 4.747 ms 0.32% 4.772 ms 0.32% 25.223 us 0.53% SLOW
I64 F32 I32 2^28 2^1 2.169 ms 0.92% 2.171 ms 0.91% 1.402 us 0.06% SAME
I64 F32 I32 2^28 2^4 1.407 ms 0.48% 1.408 ms 0.48% 0.848 us 0.06% SAME
I64 F32 I32 2^28 2^8 1.235 ms 0.19% 1.238 ms 0.18% 3.321 us 0.27% SLOW
I64 F64 I32 2^28 2^1 2.681 ms 0.65% 2.700 ms 0.66% 18.499 us 0.69% SLOW
I64 F64 I32 2^28 2^4 1.778 ms 0.39% 1.763 ms 0.51% -15.560 us -0.88% FAST
I64 F64 I32 2^28 2^8 1.577 ms 0.28% 1.564 ms 0.56% -13.492 us -0.86% FAST
I64 C64 I32 2^28 2^1 5.736 ms 0.20% 5.633 ms 0.22% -103.435 us -1.80% FAST
I64 C64 I32 2^28 2^4 4.897 ms 0.12% 4.777 ms 0.13% -120.014 us -2.45% FAST
I64 C64 I32 2^28 2^8 4.579 ms 0.12% 4.445 ms 0.12% -133.984 us -2.93% FAST
I128 I8 I32 2^28 2^1 3.566 ms 0.59% 3.560 ms 0.55% -5.921 us -0.17% SAME
I128 I8 I32 2^28 2^4 2.428 ms 0.78% 2.445 ms 0.78% 16.930 us 0.70% SAME
I128 I8 I32 2^28 2^8 2.194 ms 1.10% 2.195 ms 1.15% 0.625 us 0.03% SAME
I128 I16 I32 2^28 2^1 3.649 ms 0.78% 3.641 ms 0.74% -7.779 us -0.21% SAME
I128 I16 I32 2^28 2^4 2.362 ms 0.88% 2.361 ms 0.86% -0.855 us -0.04% SAME
I128 I16 I32 2^28 2^8 2.134 ms 1.08% 2.120 ms 1.15% -14.144 us -0.66% SAME
I128 I32 I32 2^28 2^1 3.539 ms 0.64% 3.537 ms 0.71% -1.413 us -0.04% SAME
I128 I32 I32 2^28 2^4 2.379 ms 0.91% 2.381 ms 0.86% 1.736 us 0.07% SAME
I128 I32 I32 2^28 2^8 2.149 ms 1.20% 2.151 ms 1.22% 2.119 us 0.10% SAME
I128 I64 I32 2^28 2^1 4.331 ms 0.65% 4.325 ms 0.64% -6.174 us -0.14% SAME
I128 I64 I32 2^28 2^4 3.016 ms 0.67% 3.016 ms 0.71% -0.017 us -0.00% SAME
I128 I64 I32 2^28 2^8 2.733 ms 1.06% 2.731 ms 1.03% -1.767 us -0.06% SAME
I128 I128 I32 2^28 2^1 6.869 ms 0.35% 6.895 ms 0.38% 26.568 us 0.39% SLOW
I128 I128 I32 2^28 2^4 5.371 ms 0.32% 5.399 ms 0.31% 27.652 us 0.51% SLOW
I128 I128 I32 2^28 2^8 5.069 ms 0.39% 5.088 ms 0.39% 18.667 us 0.37% SAME
I128 F32 I32 2^28 2^1 3.543 ms 0.64% 3.544 ms 0.68% 1.352 us 0.04% SAME
I128 F32 I32 2^28 2^4 2.381 ms 0.87% 2.385 ms 0.84% 3.690 us 0.15% SAME
I128 F32 I32 2^28 2^8 2.157 ms 1.23% 2.160 ms 1.20% 2.648 us 0.12% SAME
I128 F64 I32 2^28 2^1 4.089 ms 0.73% 4.097 ms 0.71% 8.141 us 0.20% SAME
I128 F64 I32 2^28 2^4 2.718 ms 1.00% 2.730 ms 1.06% 11.515 us 0.42% SAME
I128 F64 I32 2^28 2^8 2.407 ms 1.50% 2.415 ms 1.42% 7.752 us 0.32% SAME
I128 C64 I32 2^28 2^1 12.900 ms 0.17% 12.900 ms 0.17% 0.389 us 0.00% SAME
I128 C64 I32 2^28 2^4 11.663 ms 0.24% 11.663 ms 0.24% 0.153 us 0.00% SAME
I128 C64 I32 2^28 2^8 11.350 ms 0.28% 11.350 ms 0.28% -0.591 us -0.01% SAME

@fbusato fbusato enabled auto-merge (squash) November 22, 2024 23:18
Copy link
Contributor

🟩 CI finished in 3h 35m: Pass: 100%/224 | Total: 6d 16h | Avg: 42m 56s | Max: 1h 18m | Hits: 61%/12288
  • 🟩 thrust: Pass: 100%/111 | Total: 2d 13h | Avg: 33m 10s | Max: 1h 03m | Hits: 70%/9260

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 49m 29s | Avg: 24m 44s | Max: 34m 47s
    🟩 cpu
      🟩 amd64              Pass: 100%/103 | Total:  2d 09h | Avg: 33m 19s | Max:  1h 03m | Hits:  70%/9260  
      🟩 arm64              Pass: 100%/8   | Total:  4h 08m | Avg: 31m 07s | Max: 37m 26s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  7h 55m | Avg: 31m 40s | Max: 56m 44s | Hits:  62%/1852  
      🟩 11.8               Pass: 100%/3   | Total:  2h 04m | Avg: 41m 37s | Max: 46m 35s
      🟩 12.5               Pass: 100%/4   | Total:  3h 39m | Avg: 54m 47s | Max: 57m 00s
      🟩 12.6               Pass: 100%/89  | Total:  1d 23h | Avg: 32m 10s | Max:  1h 03m | Hits:  71%/7408  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  1h 47m | Avg: 26m 54s | Max: 29m 35s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  7h 55m | Avg: 31m 40s | Max: 56m 44s | Hits:  62%/1852  
      🟩 nvcc11.8           Pass: 100%/3   | Total:  2h 04m | Avg: 41m 37s | Max: 46m 35s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  3h 39m | Avg: 54m 47s | Max: 57m 00s
      🟩 nvcc12.6           Pass: 100%/85  | Total:  1d 21h | Avg: 32m 24s | Max:  1h 03m | Hits:  71%/7408  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  1h 47m | Avg: 26m 54s | Max: 29m 35s
      🟩 nvcc               Pass: 100%/107 | Total:  2d 11h | Avg: 33m 24s | Max:  1h 03m | Hits:  70%/9260  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  3h 09m | Avg: 31m 39s | Max: 36m 46s
      🟩 Clang10            Pass: 100%/3   | Total:  1h 48m | Avg: 36m 06s | Max: 40m 07s
      🟩 Clang11            Pass: 100%/4   | Total:  2h 06m | Avg: 31m 41s | Max: 34m 09s
      🟩 Clang12            Pass: 100%/4   | Total:  2h 15m | Avg: 33m 58s | Max: 36m 53s
      🟩 Clang13            Pass: 100%/4   | Total:  2h 18m | Avg: 34m 36s | Max: 37m 45s
      🟩 Clang14            Pass: 100%/4   | Total:  2h 09m | Avg: 32m 18s | Max: 33m 36s
      🟩 Clang15            Pass: 100%/4   | Total:  2h 16m | Avg: 34m 01s | Max: 36m 53s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 14m | Avg: 33m 30s | Max: 37m 37s
      🟩 Clang17            Pass: 100%/4   | Total:  2h 16m | Avg: 34m 01s | Max: 37m 50s
      🟩 Clang18            Pass: 100%/11  | Total:  4h 50m | Avg: 26m 23s | Max: 35m 45s
      🟩 GCC6               Pass: 100%/2   | Total: 55m 39s | Avg: 27m 49s | Max: 31m 27s
      🟩 GCC7               Pass: 100%/6   | Total:  3h 17m | Avg: 32m 58s | Max: 38m 30s
      🟩 GCC8               Pass: 100%/6   | Total:  3h 10m | Avg: 31m 43s | Max: 37m 15s
      🟩 GCC9               Pass: 100%/6   | Total:  3h 13m | Avg: 32m 16s | Max: 38m 34s
      🟩 GCC10              Pass: 100%/4   | Total:  2h 17m | Avg: 34m 26s | Max: 38m 45s
      🟩 GCC11              Pass: 100%/7   | Total:  4h 25m | Avg: 37m 59s | Max: 46m 35s
      🟩 GCC12              Pass: 100%/4   | Total:  2h 13m | Avg: 33m 29s | Max: 36m 10s
      🟩 GCC13              Pass: 100%/16  | Total:  6h 09m | Avg: 23m 06s | Max: 41m 12s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 02m | Avg: 40m 53s | Max: 46m 14s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 56m 44s | Avg: 56m 44s | Max: 56m 44s | Hits:  62%/1852  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 03m | Hits:  62%/3704  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 28m | Avg: 44m 01s | Max:  1h 03m | Hits:  81%/3704  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  3h 39m | Avg: 54m 47s | Max: 57m 00s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  1d 01h | Avg: 31m 46s | Max: 40m 07s
      🟩 GCC                Pass: 100%/51  | Total:  1d 01h | Avg: 30m 17s | Max: 46m 35s
      🟩 Intel              Pass: 100%/3   | Total:  2h 02m | Avg: 40m 53s | Max: 46m 14s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 30m | Avg: 54m 03s | Max:  1h 03m | Hits:  70%/9260  
      🟩 NVHPC              Pass: 100%/4   | Total:  3h 39m | Avg: 54m 47s | Max: 57m 00s
    🟩 gpu
      🟩 v100               Pass: 100%/111 | Total:  2d 13h | Avg: 33m 10s | Max:  1h 03m | Hits:  70%/9260  
    🟩 jobs
      🟩 Build              Pass: 100%/103 | Total:  2d 11h | Avg: 34m 39s | Max:  1h 03m | Hits:  62%/7408  
      🟩 TestCPU            Pass: 100%/4   | Total: 46m 39s | Avg: 11m 39s | Max: 24m 21s | Hits:  99%/1852  
      🟩 TestGPU            Pass: 100%/4   | Total:  1h 04m | Avg: 16m 14s | Max: 21m 54s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  2h 04m | Avg: 41m 37s | Max: 46m 35s
      🟩 90a                Pass: 100%/4   | Total:  1h 15m | Avg: 18m 47s | Max: 21m 39s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total: 13h 27m | Avg: 26m 55s | Max: 50m 46s
      🟩 14                 Pass: 100%/29  | Total: 17h 56m | Avg: 37m 06s | Max:  1h 03m | Hits:  62%/3704  
      🟩 17                 Pass: 100%/27  | Total: 16h 35m | Avg: 36m 52s | Max:  1h 01m | Hits:  62%/1852  
      🟩 20                 Pass: 100%/23  | Total: 12h 32m | Avg: 32m 43s | Max:  1h 03m | Hits:  81%/3704  
    
  • 🟩 cub: Pass: 100%/110 | Total: 4d 02h | Avg: 53m 43s | Max: 1h 18m | Hits: 36%/3028

    🟩 cpu
      🟩 amd64              Pass: 100%/102 | Total:  3d 18h | Avg: 53m 30s | Max:  1h 18m | Hits:  36%/3028  
      🟩 arm64              Pass: 100%/8   | Total:  7h 31m | Avg: 56m 27s | Max: 57m 35s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total: 12h 32m | Avg: 50m 09s | Max: 59m 43s | Hits:  36%/757   
      🟩 11.8               Pass: 100%/3   | Total:  3h 45m | Avg:  1h 15m | Max:  1h 18m
      🟩 12.5               Pass: 100%/4   | Total:  4h 30m | Avg:  1h 07m | Max:  1h 14m
      🟩 12.6               Pass: 100%/88  | Total:  3d 05h | Avg: 52m 58s | Max:  1h 11m | Hits:  36%/2271  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  4h 09m | Avg:  1h 02m | Max:  1h 03m
      🟩 nvcc11.1           Pass: 100%/15  | Total: 12h 32m | Avg: 50m 09s | Max: 59m 43s | Hits:  36%/757   
      🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 45m | Avg:  1h 15m | Max:  1h 18m
      🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 30m | Avg:  1h 07m | Max:  1h 14m
      🟩 nvcc12.6           Pass: 100%/84  | Total:  3d 01h | Avg: 52m 31s | Max:  1h 11m | Hits:  36%/2271  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  4h 09m | Avg:  1h 02m | Max:  1h 03m
      🟩 nvcc               Pass: 100%/106 | Total:  3d 22h | Avg: 53m 24s | Max:  1h 18m | Hits:  36%/3028  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  5h 17m | Avg: 52m 55s | Max: 58m 37s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 47m | Avg: 55m 45s | Max: 58m 04s
      🟩 Clang11            Pass: 100%/4   | Total:  3h 43m | Avg: 55m 55s | Max:  1h 00m
      🟩 Clang12            Pass: 100%/4   | Total:  3h 51m | Avg: 57m 55s | Max:  1h 01m
      🟩 Clang13            Pass: 100%/4   | Total:  3h 36m | Avg: 54m 02s | Max: 58m 47s
      🟩 Clang14            Pass: 100%/4   | Total:  3h 37m | Avg: 54m 18s | Max: 58m 45s
      🟩 Clang15            Pass: 100%/4   | Total:  3h 44m | Avg: 56m 11s | Max:  1h 00m
      🟩 Clang16            Pass: 100%/4   | Total:  3h 45m | Avg: 56m 24s | Max: 59m 03s
      🟩 Clang17            Pass: 100%/4   | Total:  3h 39m | Avg: 54m 55s | Max: 58m 14s
      🟩 Clang18            Pass: 100%/11  | Total:  9h 55m | Avg: 54m 08s | Max:  1h 03m
      🟩 GCC6               Pass: 100%/2   | Total:  1h 36m | Avg: 48m 29s | Max: 49m 48s
      🟩 GCC7               Pass: 100%/6   | Total:  5h 16m | Avg: 52m 40s | Max: 58m 11s
      🟩 GCC8               Pass: 100%/6   | Total:  5h 24m | Avg: 54m 05s | Max: 58m 30s
      🟩 GCC9               Pass: 100%/6   | Total:  5h 09m | Avg: 51m 38s | Max: 55m 37s
      🟩 GCC10              Pass: 100%/4   | Total:  4h 00m | Avg:  1h 00m | Max:  1h 02m
      🟩 GCC11              Pass: 100%/7   | Total:  7h 28m | Avg:  1h 04m | Max:  1h 18m
      🟩 GCC12              Pass: 100%/4   | Total:  3h 38m | Avg: 54m 36s | Max: 58m 17s
      🟩 GCC13              Pass: 100%/16  | Total:  9h 55m | Avg: 37m 14s | Max: 57m 51s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  3h 02m | Avg:  1h 00m | Max:  1h 03m
      🟩 MSVC14.16          Pass: 100%/1   | Total: 59m 43s | Avg: 59m 43s | Max: 59m 43s | Hits:  36%/757   
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 11m | Hits:  36%/1514  
      🟩 MSVC14.39          Pass: 100%/1   | Total:  1h 10m | Avg:  1h 10m | Max:  1h 10m | Hits:  36%/757   
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 30m | Avg:  1h 07m | Max:  1h 14m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  1d 19h | Avg: 54m 59s | Max:  1h 03m
      🟩 GCC                Pass: 100%/51  | Total:  1d 18h | Avg: 50m 01s | Max:  1h 18m
      🟩 Intel              Pass: 100%/3   | Total:  3h 02m | Avg:  1h 00m | Max:  1h 03m
      🟩 MSVC               Pass: 100%/4   | Total:  4h 26m | Avg:  1h 06m | Max:  1h 11m | Hits:  36%/3028  
      🟩 NVHPC              Pass: 100%/4   | Total:  4h 30m | Avg:  1h 07m | Max:  1h 14m
    🟩 gpu
      🟩 v100               Pass: 100%/110 | Total:  4d 02h | Avg: 53m 43s | Max:  1h 18m | Hits:  36%/3028  
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total:  3d 22h | Avg: 55m 43s | Max:  1h 18m | Hits:  36%/3028  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 21m 52s | Avg: 21m 52s | Max: 21m 52s
      🟩 GraphCapture       Pass: 100%/1   | Total: 15m 26s | Avg: 15m 26s | Max: 15m 26s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 09m | Avg: 23m 14s | Max: 26m 15s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 59m | Avg: 39m 44s | Max: 40m 29s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 45m | Avg:  1h 15m | Max:  1h 18m
      🟩 90a                Pass: 100%/4   | Total:  1h 37m | Avg: 24m 22s | Max: 25m 25s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  1d 02h | Avg: 52m 02s | Max:  1h 12m
      🟩 14                 Pass: 100%/29  | Total:  1d 02h | Avg: 55m 23s | Max:  1h 18m | Hits:  36%/1514  
      🟩 17                 Pass: 100%/27  | Total:  1d 01h | Avg: 57m 21s | Max:  1h 15m | Hits:  36%/757   
      🟩 20                 Pass: 100%/24  | Total: 19h 53m | Avg: 49m 44s | Max:  1h 11m | Hits:  36%/757   
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 05s | Avg: 5m 02s | Max: 7m 41s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 10m 05s | Avg:  5m 02s | Max:  7m 41s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 10m 05s | Avg:  5m 02s | Max:  7m 41s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 05s | Avg:  5m 02s | Max:  7m 41s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 10m 05s | Avg:  5m 02s | Max:  7m 41s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 10m 05s | Avg:  5m 02s | Max:  7m 41s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 10m 05s | Avg:  5m 02s | Max:  7m 41s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 10m 05s | Avg:  5m 02s | Max:  7m 41s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 24s | Avg:  2m 24s | Max:  2m 24s
      🟩 Test               Pass: 100%/1   | Total:  7m 41s | Avg:  7m 41s | Max:  7m 41s
    
  • 🟩 python: Pass: 100%/1 | Total: 16m 41s | Avg: 16m 41s | Max: 16m 41s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 16m 41s | Avg: 16m 41s | Max: 16m 41s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 16m 41s | Avg: 16m 41s | Max: 16m 41s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 16m 41s | Avg: 16m 41s | Max: 16m 41s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 16m 41s | Avg: 16m 41s | Max: 16m 41s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 16m 41s | Avg: 16m 41s | Max: 16m 41s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 16m 41s | Avg: 16m 41s | Max: 16m 41s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 16m 41s | Avg: 16m 41s | Max: 16m 41s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 16m 41s | Avg: 16m 41s | Max: 16m 41s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 224)

# Runner
185 linux-amd64-cpu16
16 linux-arm64-cpu16
14 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16

{
return cub::internal::ThreadReduceSequential<AccumT>(input, reduction_op);
}
_CCCL_IF_CONSTEXPR (cuda::std::is_same<ReductionOp, ::cuda::std::plus<>>::value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Sadly, we may want to include more operators here:

Suggested change
_CCCL_IF_CONSTEXPR (cuda::std::is_same<ReductionOp, ::cuda::std::plus<>>::value
_CCCL_IF_CONSTEXPR (cuda::std::is_same<ReductionOp, ::cuda::std::plus<>>::value ||
cuda::std::is_same<ReductionOp, ::cuda::std::plus<ValueT>>::value

We may also just test whether ReductionOp is any instantiation of ::cuda::std::plus.

Technically, this also does not include thrust::plus, but that will be fixed at the next major release.

@bernhardmgruber
Copy link
Contributor

Thx for reporting the benchmarks. Looks good except for Reduce Max on I8, I32, 2^28. A 14% slowdown is unfortunately below @gevtushenko's rule of "no regressions of more than 2% compared to previous implementation on 2^24+ problem sizes". Could you please investigate the cause of this regression? We should try to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants