Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port thrust::merge[_by_key] to CUB #1817

Merged
merged 12 commits into from
Jul 23, 2024

Conversation

bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Jun 6, 2024

This PR ports the CUDA implementation of thrust::merge[_by_key] to CUB.

CUB already has an implementation of merge sort, also based on the merge path algorithm, so some infrastructure is taken from there, replacing the thrust implementation. However, whenever I was not sure what to pick, I preferred the thrust implementation and will attempt a unification in a separate PR after benchmarks are available. This especially concerns the agents.

The following new constructs are added:

  • Public API entry points: cub::DeviceMerge with MergeKeys and MergePairs, including documentation.
  • Internal dispatch layer: cub::detail::DispatchMerge with policy selection, fallback agent, vsmem implementation. Including a set of tuning policies taken over from the thrust implementation.
  • Internal agent cub::detail::AgentPartitionMergePath implementing merge path partitioning.
  • Internal agent cub::detail::AgentMergeNoSort implementing the device side merging. Very similar to merge sorts AgentMerge and I plan to unifying the implementations in a separate step.
  • Extensive unit tests for the new CUB API.
  • A few comments, drive-by fixes and refactorings.

A full tuning is probably necessary which will be conducted in a separate PR.

Compile-time of thrust.test.merge:

Before: 0m26.798s
After:  0m36.763s
Benchmark of `thrust.bench.merge` :

Running nvbench_compare.py base.json branch.json

## [0] NVIDIA H100 PCIe

|  T{ct}  |  Elements  |  Entropy  |  InputSizeRatio  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|-----------|------------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |    2^16    |     1     |        25        |  21.842 us |       1.43% |  22.034 us |       5.27% |   0.192 us |   0.88% |   PASS   |
|   I8    |    2^20    |     1     |        25        |  27.602 us |       1.56% |  27.763 us |       2.02% |   0.161 us |   0.58% |   PASS   |
|   I8    |    2^24    |     1     |        25        |  66.583 us |       0.82% |  64.971 us |       0.88% |  -1.613 us |  -2.42% |   FAIL   |
|   I8    |    2^28    |     1     |        25        | 634.330 us |       0.31% | 613.761 us |       0.34% | -20.570 us |  -3.24% |   FAIL   |
|   I8    |    2^16    |   0.201   |        25        |  21.873 us |       1.78% |  22.052 us |       2.17% |   0.179 us |   0.82% |   PASS   |
|   I8    |    2^20    |   0.201   |        25        |  27.326 us |       3.29% |  27.322 us |       2.38% |  -0.004 us |  -0.02% |   PASS   |
|   I8    |    2^24    |   0.201   |        25        |  65.255 us |       0.64% |  63.658 us |       1.00% |  -1.597 us |  -2.45% |   FAIL   |
|   I8    |    2^28    |   0.201   |        25        | 630.302 us |       0.32% | 609.649 us |       0.36% | -20.653 us |  -3.28% |   FAIL   |
|   I8    |    2^16    |     1     |        50        |  22.411 us |       1.52% |  22.510 us |       3.64% |   0.099 us |   0.44% |   PASS   |
|   I8    |    2^20    |     1     |        50        |  28.515 us |       2.31% |  28.510 us |       1.29% |  -0.005 us |  -0.02% |   PASS   |
|   I8    |    2^24    |     1     |        50        |  67.800 us |       0.64% |  66.197 us |       0.66% |  -1.604 us |  -2.37% |   FAIL   |
|   I8    |    2^28    |     1     |        50        | 649.773 us |       0.46% | 629.247 us |       0.41% | -20.526 us |  -3.16% |   FAIL   |
|   I8    |    2^16    |   0.201   |        50        |  22.496 us |       2.60% |  22.547 us |       3.03% |   0.051 us |   0.23% |   PASS   |
|   I8    |    2^20    |   0.201   |        50        |  27.792 us |       1.56% |  27.946 us |       1.89% |   0.154 us |   0.56% |   PASS   |
|   I8    |    2^24    |   0.201   |        50        |  66.455 us |       0.60% |  64.921 us |       1.09% |  -1.534 us |  -2.31% |   FAIL   |
|   I8    |    2^28    |   0.201   |        50        | 637.270 us |       0.34% | 616.673 us |       0.38% | -20.597 us |  -3.23% |   FAIL   |
|   I8    |    2^16    |     1     |        75        |  22.178 us |       1.84% |  22.339 us |       2.51% |   0.161 us |   0.73% |   PASS   |
|   I8    |    2^20    |     1     |        75        |  28.112 us |       2.04% |  28.157 us |       1.54% |   0.045 us |   0.16% |   PASS   |
|   I8    |    2^24    |     1     |        75        |  66.745 us |       0.60% |  65.294 us |       1.18% |  -1.450 us |  -2.17% |   FAIL   |
|   I8    |    2^28    |     1     |        75        | 633.745 us |       0.31% | 612.842 us |       0.38% | -20.904 us |  -3.30% |   FAIL   |
|   I8    |    2^16    |   0.201   |        75        |  22.198 us |       2.97% |  22.337 us |       2.12% |   0.140 us |   0.63% |   PASS   |
|   I8    |    2^20    |   0.201   |        75        |  27.540 us |       2.01% |  27.694 us |       2.33% |   0.154 us |   0.56% |   PASS   |
|   I8    |    2^24    |   0.201   |        75        |  65.302 us |       0.94% |  64.032 us |       0.69% |  -1.270 us |  -1.94% |   FAIL   |
|   I8    |    2^28    |   0.201   |        75        | 626.013 us |       0.31% | 605.203 us |       0.35% | -20.810 us |  -3.32% |   FAIL   |
|   I16   |    2^16    |     1     |        25        |  23.348 us |       4.09% |  23.067 us |       1.57% |  -0.280 us |  -1.20% |   PASS   |
|   I16   |    2^20    |     1     |        25        |  30.352 us |       1.47% |  30.365 us |       1.18% |   0.013 us |   0.04% |   PASS   |
|   I16   |    2^24    |     1     |        25        |  82.447 us |       0.58% |  81.960 us |       0.47% |  -0.488 us |  -0.59% |   FAIL   |
|   I16   |    2^28    |     1     |        25        | 886.267 us |       0.25% | 879.144 us |       0.27% |  -7.123 us |  -0.80% |   FAIL   |
|   I16   |    2^16    |   0.201   |        25        |  23.129 us |       2.16% |  23.080 us |       1.97% |  -0.049 us |  -0.21% |   PASS   |
|   I16   |    2^20    |   0.201   |        25        |  30.261 us |       2.50% |  30.202 us |       1.33% |  -0.058 us |  -0.19% |   PASS   |
|   I16   |    2^24    |   0.201   |        25        |  75.287 us |       0.67% |  75.266 us |       0.66% |  -0.021 us |  -0.03% |   PASS   |
|   I16   |    2^28    |   0.201   |        25        | 748.692 us |       0.74% | 746.226 us |       0.78% |  -2.466 us |  -0.33% |   PASS   |
|   I16   |    2^16    |     1     |        50        |  23.782 us |       2.75% |  23.685 us |       2.17% |  -0.097 us |  -0.41% |   PASS   |
|   I16   |    2^20    |     1     |        50        |  30.969 us |       1.89% |  31.105 us |       1.34% |   0.137 us |   0.44% |   PASS   |
|   I16   |    2^24    |     1     |        50        |  84.724 us |       0.46% |  83.747 us |       0.93% |  -0.978 us |  -1.15% |   FAIL   |
|   I16   |    2^28    |     1     |        50        | 920.192 us |       0.19% | 901.230 us |       0.27% | -18.962 us |  -2.06% |   FAIL   |
|   I16   |    2^16    |   0.201   |        50        |  23.521 us |       3.51% |  23.697 us |       3.24% |   0.176 us |   0.75% |   PASS   |
|   I16   |    2^20    |   0.201   |        50        |  30.280 us |       1.76% |  30.764 us |       2.17% |   0.484 us |   1.60% |   PASS   |
|   I16   |    2^24    |   0.201   |        50        |  75.895 us |       0.59% |  76.160 us |       0.72% |   0.264 us |   0.35% |   PASS   |
|   I16   |    2^28    |   0.201   |        50        | 757.585 us |       0.77% | 755.773 us |       0.82% |  -1.812 us |  -0.24% |   PASS   |
|   I16   |    2^16    |     1     |        75        |  22.821 us |       2.00% |  23.086 us |       1.89% |   0.265 us |   1.16% |   PASS   |
|   I16   |    2^20    |     1     |        75        |  30.086 us |       1.32% |  30.490 us |       1.69% |   0.404 us |   1.34% |   FAIL   |
|   I16   |    2^24    |     1     |        75        |  82.047 us |       0.54% |  82.065 us |       0.57% |   0.018 us |   0.02% |   PASS   |
|   I16   |    2^28    |     1     |        75        | 886.565 us |       0.25% | 880.541 us |       0.28% |  -6.024 us |  -0.68% |   FAIL   |
|   I16   |    2^16    |   0.201   |        75        |  22.921 us |       1.86% |  23.136 us |       1.78% |   0.215 us |   0.94% |   PASS   |
|   I16   |    2^20    |   0.201   |        75        |  29.565 us |       1.29% |  30.108 us |       1.75% |   0.542 us |   1.83% |   FAIL   |
|   I16   |    2^24    |   0.201   |        75        |  75.021 us |       0.65% |  75.377 us |       0.79% |   0.355 us |   0.47% |   PASS   |
|   I16   |    2^28    |   0.201   |        75        | 744.530 us |       0.75% | 742.591 us |       0.80% |  -1.939 us |  -0.26% |   PASS   |
|   I32   |    2^16    |     1     |        25        |  23.432 us |       2.73% |  23.592 us |       2.23% |   0.159 us |   0.68% |   PASS   |
|   I32   |    2^20    |     1     |        25        |  31.594 us |       1.67% |  31.813 us |       1.21% |   0.219 us |   0.69% |   PASS   |
|   I32   |    2^24    |     1     |        25        | 103.510 us |       0.76% | 103.372 us |       0.64% |  -0.138 us |  -0.13% |   PASS   |
|   I32   |    2^28    |     1     |        25        |   1.263 ms |       0.18% |   1.262 ms |       0.14% |  -1.065 us |  -0.08% |   PASS   |
|   I32   |    2^16    |   0.201   |        25        |  24.005 us |       2.79% |  23.727 us |       3.65% |  -0.278 us |  -1.16% |   PASS   |
|   I32   |    2^20    |   0.201   |        25        |  31.509 us |       1.40% |  31.738 us |       1.17% |   0.229 us |   0.73% |   PASS   |
|   I32   |    2^24    |   0.201   |        25        | 103.135 us |       0.91% | 103.225 us |       0.77% |   0.090 us |   0.09% |   PASS   |
|   I32   |    2^28    |   0.201   |        25        |   1.244 ms |       0.12% |   1.245 ms |       0.12% |   1.002 us |   0.08% |   PASS   |
|   I32   |    2^16    |     1     |        50        |  24.019 us |       2.15% |  24.208 us |       2.70% |   0.189 us |   0.79% |   PASS   |
|   I32   |    2^20    |     1     |        50        |  32.874 us |       1.61% |  32.321 us |       1.25% |  -0.553 us |  -1.68% |   FAIL   |
|   I32   |    2^24    |     1     |        50        | 105.442 us |       0.78% | 104.331 us |       0.74% |  -1.111 us |  -1.05% |   FAIL   |
|   I32   |    2^28    |     1     |        50        |   1.280 ms |       0.19% |   1.279 ms |       0.16% |  -1.701 us |  -0.13% |   PASS   |
|   I32   |    2^16    |   0.201   |        50        |  25.211 us |       2.87% |  24.050 us |       2.10% |  -1.161 us |  -4.61% |   FAIL   |
|   I32   |    2^20    |   0.201   |        50        |  32.644 us |       2.14% |  32.125 us |       2.52% |  -0.519 us |  -1.59% |   PASS   |
|   I32   |    2^24    |   0.201   |        50        | 104.596 us |       0.73% | 104.011 us |       0.79% |  -0.586 us |  -0.56% |   PASS   |
|   I32   |    2^28    |   0.201   |        50        |   1.258 ms |       0.12% |   1.259 ms |       0.13% |   0.624 us |   0.05% |   PASS   |
|   I32   |    2^16    |     1     |        75        |  23.757 us |       2.02% |  23.200 us |       1.98% |  -0.557 us |  -2.34% |   FAIL   |
|   I32   |    2^20    |     1     |        75        |  32.017 us |       2.02% |  31.441 us |       1.17% |  -0.575 us |  -1.80% |   FAIL   |
|   I32   |    2^24    |     1     |        75        | 103.727 us |       0.80% | 102.959 us |       0.66% |  -0.768 us |  -0.74% |   FAIL   |
|   I32   |    2^28    |     1     |        75        |   1.264 ms |       0.20% |   1.261 ms |       0.14% |  -2.664 us |  -0.21% |   FAIL   |
|   I32   |    2^16    |   0.201   |        75        |  24.320 us |       2.78% |  23.597 us |       2.68% |  -0.723 us |  -2.97% |   FAIL   |
|   I32   |    2^20    |   0.201   |        75        |  32.081 us |       2.07% |  31.419 us |       1.12% |  -0.662 us |  -2.06% |   FAIL   |
|   I32   |    2^24    |   0.201   |        75        | 103.497 us |       0.77% | 102.911 us |       0.73% |  -0.585 us |  -0.57% |   PASS   |
|   I32   |    2^28    |   0.201   |        75        |   1.243 ms |       0.12% |   1.243 ms |       0.11% |   0.659 us |   0.05% |   PASS   |
|   I64   |    2^16    |     1     |        25        |  24.667 us |       2.80% |  24.259 us |       2.35% |  -0.408 us |  -1.66% |   PASS   |
|   I64   |    2^20    |     1     |        25        |  36.047 us |       1.21% |  35.505 us |       2.18% |  -0.543 us |  -1.51% |   FAIL   |
|   I64   |    2^24    |     1     |        25        | 179.325 us |       0.71% | 178.824 us |       0.63% |  -0.501 us |  -0.28% |   PASS   |
|   I64   |    2^28    |     1     |        25        |   2.508 ms |       0.10% |   2.507 ms |       0.07% |  -0.441 us |  -0.02% |   PASS   |
|   I64   |    2^16    |   0.201   |        25        |  25.525 us |       2.59% |  24.087 us |       2.99% |  -1.438 us |  -5.63% |   FAIL   |
|   I64   |    2^20    |   0.201   |        25        |  36.092 us |       1.31% |  35.179 us |       1.18% |  -0.913 us |  -2.53% |   FAIL   |
|   I64   |    2^24    |   0.201   |        25        | 178.690 us |       0.68% | 178.080 us |       0.68% |  -0.610 us |  -0.34% |   PASS   |
|   I64   |    2^28    |   0.201   |        25        |   2.474 ms |       0.06% |   2.474 ms |       0.06% |  -0.062 us |  -0.00% |   PASS   |
|   I64   |    2^16    |     1     |        50        |  25.163 us |       2.83% |  24.989 us |       2.70% |  -0.174 us |  -0.69% |   PASS   |
|   I64   |    2^20    |     1     |        50        |  37.091 us |       1.96% |  36.944 us |       1.84% |  -0.146 us |  -0.39% |   PASS   |
|   I64   |    2^24    |     1     |        50        | 181.273 us |       0.74% | 180.916 us |       0.71% |  -0.357 us |  -0.20% |   PASS   |
|   I64   |    2^28    |     1     |        50        |   2.548 ms |       0.12% |   2.546 ms |       0.08% |  -2.261 us |  -0.09% |   FAIL   |
|   I64   |    2^16    |   0.201   |        50        |  25.618 us |       3.17% |  25.167 us |       3.04% |  -0.450 us |  -1.76% |   PASS   |
|   I64   |    2^20    |   0.201   |        50        |  36.969 us |       1.10% |  36.682 us |       1.18% |  -0.287 us |  -0.78% |   PASS   |
|   I64   |    2^24    |   0.201   |        50        | 180.366 us |       0.73% | 180.155 us |       0.72% |  -0.211 us |  -0.12% |   PASS   |
|   I64   |    2^28    |   0.201   |        50        |   2.516 ms |       0.06% |   2.516 ms |       0.06% |   0.382 us |   0.02% |   PASS   |
|   I64   |    2^16    |     1     |        75        |  24.480 us |       2.20% |  24.336 us |       2.50% |  -0.144 us |  -0.59% |   PASS   |
|   I64   |    2^20    |     1     |        75        |  35.954 us |       1.26% |  35.794 us |       1.48% |  -0.160 us |  -0.45% |   PASS   |
|   I64   |    2^24    |     1     |        75        | 179.168 us |       0.70% | 178.804 us |       0.66% |  -0.365 us |  -0.20% |   PASS   |
|   I64   |    2^28    |     1     |        75        |   2.509 ms |       0.10% |   2.508 ms |       0.08% |  -1.246 us |  -0.05% |   PASS   |
|   I64   |    2^16    |   0.201   |        75        |  25.163 us |       3.20% |  24.693 us |       1.95% |  -0.469 us |  -1.87% |   PASS   |
|   I64   |    2^20    |   0.201   |        75        |  36.059 us |       1.94% |  35.744 us |       1.72% |  -0.314 us |  -0.87% |   PASS   |
|   I64   |    2^24    |   0.201   |        75        | 178.525 us |       0.65% | 178.218 us |       0.64% |  -0.307 us |  -0.17% |   PASS   |
|   I64   |    2^28    |   0.201   |        75        |   2.475 ms |       0.06% |   2.476 ms |       0.06% |   0.435 us |   0.02% |   PASS   |
|  I128   |    2^16    |     1     |        25        |  25.868 us |       1.83% |  25.704 us |       2.39% |  -0.164 us |  -0.63% |   PASS   |
|  I128   |    2^20    |     1     |        25        |  45.469 us |       1.38% |  45.148 us |       1.43% |  -0.321 us |  -0.71% |   PASS   |
|  I128   |    2^24    |     1     |        25        | 333.161 us |       0.32% | 333.008 us |       0.34% |  -0.154 us |  -0.05% |   PASS   |
|  I128   |    2^28    |     1     |        25        |   5.051 ms |       0.08% |   5.043 ms |       0.04% |  -8.013 us |  -0.16% |   FAIL   |
|  I128   |    2^16    |   0.201   |        25        |  26.753 us |       2.04% |  25.709 us |       1.69% |  -1.044 us |  -3.90% |   FAIL   |
|  I128   |    2^20    |   0.201   |        25        |  46.680 us |       1.72% |  46.337 us |       1.33% |  -0.342 us |  -0.73% |   PASS   |
|  I128   |    2^24    |   0.201   |        25        | 334.717 us |       2.16% | 334.350 us |       2.14% |  -0.368 us |  -0.11% |   PASS   |
|  I128   |    2^28    |   0.201   |        25        |   4.961 ms |       0.89% |   4.961 ms |       0.87% |  -0.137 us |  -0.00% |   PASS   |
|  I128   |    2^16    |     1     |        50        |  27.132 us |       2.11% |  26.928 us |       2.47% |  -0.204 us |  -0.75% |   PASS   |
|  I128   |    2^20    |     1     |        50        |  46.695 us |       1.25% |  46.189 us |       1.32% |  -0.506 us |  -1.08% |   PASS   |
|  I128   |    2^24    |     1     |        50        | 349.007 us |       1.83% | 348.394 us |       1.83% |  -0.614 us |  -0.18% |   PASS   |
|  I128   |    2^28    |     1     |        50        |   5.222 ms |       0.57% |   5.168 ms |       0.18% | -54.041 us |  -1.03% |   FAIL   |
|  I128   |    2^16    |   0.201   |        50        |  28.194 us |       3.60% |  27.417 us |       2.57% |  -0.778 us |  -2.76% |   FAIL   |
|  I128   |    2^20    |   0.201   |        50        |  47.612 us |       1.44% |  47.502 us |       1.90% |  -0.110 us |  -0.23% |   PASS   |
|  I128   |    2^24    |   0.201   |        50        | 342.038 us |       2.15% | 342.103 us |       2.17% |   0.065 us |   0.02% |   PASS   |
|  I128   |    2^28    |   0.201   |        50        |   5.072 ms |       0.74% |   5.072 ms |       0.88% |   0.732 us |   0.01% |   PASS   |
|  I128   |    2^16    |     1     |        75        |  26.335 us |       2.80% |  26.007 us |       1.66% |  -0.328 us |  -1.25% |   PASS   |
|  I128   |    2^20    |     1     |        75        |  45.377 us |       1.49% |  45.100 us |       1.43% |  -0.277 us |  -0.61% |   PASS   |
|  I128   |    2^24    |     1     |        75        | 339.315 us |       1.72% | 336.495 us |       1.85% |  -2.820 us |  -0.83% |   PASS   |
|  I128   |    2^28    |     1     |        75        |   5.067 ms |       0.10% |   5.052 ms |       0.10% | -14.962 us |  -0.30% |   FAIL   |
|  I128   |    2^16    |   0.201   |        75        |  27.758 us |       4.32% |  27.036 us |       1.78% |  -0.722 us |  -2.60% |   FAIL   |
|  I128   |    2^20    |   0.201   |        75        |  46.437 us |       1.21% |  46.439 us |       1.51% |   0.003 us |   0.01% |   PASS   |
|  I128   |    2^24    |   0.201   |        75        | 334.894 us |       2.10% | 334.996 us |       2.08% |   0.101 us |   0.03% |   PASS   |
|  I128   |    2^28    |   0.201   |        75        |   4.962 ms |       0.75% |   4.962 ms |       0.84% |  -0.037 us |  -0.00% |   PASS   |
|   F32   |    2^16    |     1     |        25        |  24.468 us |       2.21% |  23.690 us |       2.03% |  -0.778 us |  -3.18% |   FAIL   |
|   F32   |    2^20    |     1     |        25        |  31.909 us |       2.10% |  31.222 us |       1.67% |  -0.687 us |  -2.15% |   FAIL   |
|   F32   |    2^24    |     1     |        25        | 107.083 us |       1.02% | 104.880 us |       0.69% |  -2.203 us |  -2.06% |   FAIL   |
|   F32   |    2^28    |     1     |        25        |   1.294 ms |       1.44% |   1.272 ms |       2.79% | -21.351 us |  -1.65% |   FAIL   |
|   F32   |    2^16    |   0.201   |        25        |  28.905 us |       7.76% |  26.563 us |       6.99% |  -2.342 us |  -8.10% |   FAIL   |
|   F32   |    2^20    |   0.201   |        25        |  32.249 us |       1.66% |  31.679 us |       1.64% |  -0.570 us |  -1.77% |   FAIL   |
|   F32   |    2^24    |   0.201   |        25        | 103.881 us |       0.80% | 102.963 us |       0.68% |  -0.918 us |  -0.88% |   FAIL   |
|   F32   |    2^28    |   0.201   |        25        |   1.248 ms |       0.15% |   1.247 ms |       0.14% |  -1.469 us |  -0.12% |   PASS   |
|   F32   |    2^16    |     1     |        50        |  26.928 us |       1.97% |  24.492 us |       3.35% |  -2.436 us |  -9.04% |   FAIL   |
|   F32   |    2^20    |     1     |        50        |  33.311 us |       1.48% |  32.632 us |       1.30% |  -0.679 us |  -2.04% |   FAIL   |
|   F32   |    2^24    |     1     |        50        | 106.953 us |       1.16% | 104.617 us |       0.80% |  -2.336 us |  -2.18% |   FAIL   |
|   F32   |    2^28    |     1     |        50        |   1.293 ms |       0.23% |   1.285 ms |       0.16% |  -7.857 us |  -0.61% |   FAIL   |
|   F32   |    2^16    |   0.201   |        50        |  27.595 us |       7.31% |  25.489 us |       4.77% |  -2.106 us |  -7.63% |   FAIL   |
|   F32   |    2^20    |   0.201   |        50        |  32.840 us |       1.39% |  32.344 us |       1.98% |  -0.495 us |  -1.51% |   FAIL   |
|   F32   |    2^24    |   0.201   |        50        | 104.970 us |       0.91% | 104.079 us |       0.69% |  -0.891 us |  -0.85% |   FAIL   |
|   F32   |    2^28    |   0.201   |        50        |   1.263 ms |       0.15% |   1.261 ms |       0.15% |  -1.815 us |  -0.14% |   PASS   |
|   F32   |    2^16    |     1     |        75        |  25.405 us |       2.59% |  24.367 us |       2.35% |  -1.038 us |  -4.08% |   FAIL   |
|   F32   |    2^20    |     1     |        75        |  32.721 us |       2.47% |  32.151 us |       1.91% |  -0.570 us |  -1.74% |   PASS   |
|   F32   |    2^24    |     1     |        75        | 105.707 us |       1.15% | 104.060 us |       0.88% |  -1.647 us |  -1.56% |   FAIL   |
|   F32   |    2^28    |     1     |        75        |   1.303 ms |       1.39% |   1.270 ms |       0.15% | -32.851 us |  -2.52% |   FAIL   |
|   F32   |    2^16    |   0.201   |        75        |  26.951 us |       9.55% |  25.223 us |       5.42% |  -1.727 us |  -6.41% |   FAIL   |
|   F32   |    2^20    |   0.201   |        75        |  32.148 us |       1.67% |  31.612 us |       1.36% |  -0.535 us |  -1.66% |   FAIL   |
|   F32   |    2^24    |   0.201   |        75        | 104.034 us |       0.93% | 102.767 us |       0.67% |  -1.267 us |  -1.22% |   FAIL   |
|   F32   |    2^28    |   0.201   |        75        |   1.250 ms |       0.15% |   1.248 ms |       0.14% |  -1.766 us |  -0.14% |   PASS   |
|   F64   |    2^16    |     1     |        25        |  26.085 us |       2.88% |  24.945 us |       3.05% |  -1.140 us |  -4.37% |   FAIL   |
|   F64   |    2^20    |     1     |        25        |  36.421 us |       1.92% |  35.957 us |       1.28% |  -0.464 us |  -1.27% |   PASS   |
|   F64   |    2^24    |     1     |        25        | 181.732 us |       0.83% | 180.366 us |       0.75% |  -1.365 us |  -0.75% |   PASS   |
|   F64   |    2^28    |     1     |        25        |   2.559 ms |       1.11% |   2.520 ms |       0.14% | -38.735 us |  -1.51% |   FAIL   |
|   F64   |    2^16    |   0.201   |        25        |  28.710 us |      11.78% |  26.538 us |       6.57% |  -2.172 us |  -7.57% |   FAIL   |
|   F64   |    2^20    |   0.201   |        25        |  35.939 us |       1.46% |  35.834 us |       1.35% |  -0.106 us |  -0.29% |   PASS   |
|   F64   |    2^24    |   0.201   |        25        | 180.093 us |       0.76% | 179.075 us |       0.68% |  -1.017 us |  -0.56% |   PASS   |
|   F64   |    2^28    |   0.201   |        25        |   2.490 ms |       0.06% |   2.487 ms |       0.07% |  -3.278 us |  -0.13% |   FAIL   |
|   F64   |    2^16    |     1     |        50        |  26.299 us |       3.27% |  25.883 us |       2.51% |  -0.416 us |  -1.58% |   PASS   |
|   F64   |    2^20    |     1     |        50        |  37.501 us |       1.55% |  37.377 us |       1.36% |  -0.125 us |  -0.33% |   PASS   |
|   F64   |    2^24    |     1     |        50        | 184.810 us |       1.03% | 183.148 us |       0.87% |  -1.662 us |  -0.90% |   FAIL   |
|   F64   |    2^28    |     1     |        50        |   2.636 ms |       1.66% |   2.565 ms |       0.13% | -70.859 us |  -2.69% |   FAIL   |
|   F64   |    2^16    |   0.201   |        50        |  28.315 us |       9.15% |  27.161 us |       6.87% |  -1.154 us |  -4.08% |   PASS   |
|   F64   |    2^20    |   0.201   |        50        |  37.211 us |       1.41% |  36.963 us |       1.43% |  -0.249 us |  -0.67% |   PASS   |
|   F64   |    2^24    |   0.201   |        50        | 182.227 us |       0.80% | 181.027 us |       0.73% |  -1.200 us |  -0.66% |   PASS   |
|   F64   |    2^28    |   0.201   |        50        |   2.532 ms |       0.08% |   2.528 ms |       0.07% |  -4.706 us |  -0.19% |   FAIL   |
|   F64   |    2^16    |     1     |        75        |  26.496 us |       2.94% |  25.177 us |       2.75% |  -1.319 us |  -4.98% |   FAIL   |
|   F64   |    2^20    |     1     |        75        |  36.590 us |       1.36% |  36.486 us |       2.13% |  -0.104 us |  -0.29% |   PASS   |
|   F64   |    2^24    |     1     |        75        | 182.859 us |       1.08% | 181.260 us |       0.79% |  -1.599 us |  -0.87% |   FAIL   |
|   F64   |    2^28    |     1     |        75        |   2.620 ms |       3.19% |   2.525 ms |       0.21% | -95.083 us |  -3.63% |   FAIL   |
|   F64   |    2^16    |   0.201   |        75        |  29.545 us |      12.10% |  27.989 us |      10.39% |  -1.555 us |  -5.26% |   PASS   |
|   F64   |    2^20    |   0.201   |        75        |  36.625 us |       1.47% |  36.318 us |       1.25% |  -0.307 us |  -0.84% |   PASS   |
|   F64   |    2^24    |   0.201   |        75        | 180.201 us |       0.76% | 179.373 us |       0.66% |  -0.828 us |  -0.46% |   PASS   |
|   F64   |    2^28    |   0.201   |        75        |   2.492 ms |       0.07% |   2.487 ms |       0.06% |  -5.111 us |  -0.21% |   FAIL   |

Fixes: #1763

Please merge before:

TODO:

  • Check that cub::DeviceMerge correctly uses the fallback policy
  • Check that cub::DeviceMerge correctly uses virtual shared memory
  • Check thrust::merge compile-time before and after this PR
  • Since this PR changes the SASS, show a benchmark of thrust merge on H100.

@bernhardmgruber bernhardmgruber added cub For all items related to CUB thrust For all items related to Thrust. labels Jun 6, 2024
@bernhardmgruber bernhardmgruber force-pushed the ref_merge branch 5 times, most recently from a7a72b0 to f0cdf6a Compare June 7, 2024 09:01
@bernhardmgruber bernhardmgruber changed the title Port thrust::merge to CUB Port thrust::merge[_by_key] to CUB Jun 7, 2024
@bernhardmgruber bernhardmgruber force-pushed the ref_merge branch 3 times, most recently from 153ced5 to 2b64f25 Compare June 7, 2024 13:29
@bernhardmgruber bernhardmgruber force-pushed the ref_merge branch 6 times, most recently from 868412d to 797ba58 Compare June 10, 2024 14:00
@bernhardmgruber bernhardmgruber mentioned this pull request Jun 10, 2024
13 tasks
@bernhardmgruber bernhardmgruber force-pushed the ref_merge branch 2 times, most recently from 8cdf145 to 253b8fb Compare June 10, 2024 18:00
@bernhardmgruber bernhardmgruber linked an issue Jun 10, 2024 that may be closed by this pull request
13 tasks
@bernhardmgruber bernhardmgruber force-pushed the ref_merge branch 2 times, most recently from df3cbe7 to 7851e7c Compare June 11, 2024 09:02
@bernhardmgruber bernhardmgruber marked this pull request as ready for review June 11, 2024 09:12
@bernhardmgruber bernhardmgruber requested review from a team as code owners June 11, 2024 09:12
@bernhardmgruber
Copy link
Contributor Author

/ok to test

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Left some final comments, let's merge when those are addressed.

Copy link
Contributor

🟨 CI finished in 8h 56m: Pass: 99%/250 | Total: 6d 01h | Avg: 34m 58s | Max: 1h 13m | Hits: 62%/248858
  • 🟨 thrust: Pass: 99%/118 | Total: 2d 00h | Avg: 24m 33s | Max: 54m 32s | Hits: 69%/137734

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  99%/110 | Total:  1d 20h | Avg: 24m 24s | Max: 54m 32s | Hits:  69%/128314
      🟩 arm64              Pass: 100%/8   | Total:  3h 32m | Avg: 26m 33s | Max: 34m 50s | Hits:  63%/9420  
    🔍 ctk: 12.5 🔍
      🟩 11.1               Pass: 100%/15  | Total:  6h 05m | Avg: 24m 20s | Max: 45m 49s | Hits:  63%/17660 
      🟩 11.8               Pass: 100%/3   | Total:  1h 42m | Avg: 34m 09s | Max: 37m 12s | Hits:  63%/3534  
      🔍 12.5               Pass:  99%/100 | Total:  1d 16h | Avg: 24m 17s | Max: 54m 32s | Hits:  70%/116540
    🔍 cudacxx: nvcc12.5 🔍
      🟩 ClangCUDA17        Pass: 100%/2   | Total: 51m 35s | Avg: 25m 47s | Max: 26m 08s | Hits:  62%/2354  
      🟩 nvcc11.1           Pass: 100%/15  | Total:  6h 05m | Avg: 24m 20s | Max: 45m 49s | Hits:  63%/17660 
      🟩 nvcc11.8           Pass: 100%/3   | Total:  1h 42m | Avg: 34m 09s | Max: 37m 12s | Hits:  63%/3534  
      🔍 nvcc12.5           Pass:  98%/98  | Total:  1d 15h | Avg: 24m 16s | Max: 54m 32s | Hits:  70%/114186
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total: 51m 35s | Avg: 25m 47s | Max: 26m 08s | Hits:  62%/2354  
      🔍 nvcc               Pass:  99%/116 | Total:  1d 23h | Avg: 24m 32s | Max: 54m 32s | Hits:  69%/135380
    🔍 cxx: GCC10 🔍
      🟩 Clang9             Pass: 100%/6   | Total:  2h 26m | Avg: 24m 22s | Max: 28m 46s | Hits:  63%/7062  
      🟩 Clang10            Pass: 100%/3   | Total:  1h 17m | Avg: 25m 48s | Max: 27m 43s | Hits:  63%/3531  
      🟩 Clang11            Pass: 100%/4   | Total:  1h 47m | Avg: 26m 46s | Max: 30m 09s | Hits:  63%/4708  
      🟩 Clang12            Pass: 100%/4   | Total:  1h 44m | Avg: 26m 05s | Max: 28m 07s | Hits:  63%/4708  
      🟩 Clang13            Pass: 100%/4   | Total:  1h 43m | Avg: 25m 50s | Max: 27m 51s | Hits:  63%/4708  
      🟩 Clang14            Pass: 100%/4   | Total:  1h 42m | Avg: 25m 32s | Max: 28m 59s | Hits:  63%/4708  
      🟩 Clang15            Pass: 100%/4   | Total:  1h 46m | Avg: 26m 42s | Max: 28m 44s | Hits:  63%/4708  
      🟩 Clang16            Pass: 100%/4   | Total:  1h 44m | Avg: 26m 12s | Max: 28m 27s | Hits:  63%/4708  
      🟩 Clang17            Pass: 100%/18  | Total:  5h 30m | Avg: 18m 20s | Max: 31m 06s | Hits:  79%/21186 
      🟩 GCC6               Pass: 100%/2   | Total: 43m 34s | Avg: 21m 47s | Max: 24m 06s | Hits:  63%/2354  
      🟩 GCC7               Pass: 100%/6   | Total:  2h 26m | Avg: 24m 27s | Max: 28m 21s | Hits:  63%/7068  
      🟩 GCC8               Pass: 100%/6   | Total:  2h 27m | Avg: 24m 31s | Max: 27m 59s | Hits:  63%/7068  
      🟩 GCC9               Pass: 100%/6   | Total:  2h 36m | Avg: 26m 04s | Max: 29m 58s | Hits:  63%/7068  
      🔍 GCC10              Pass:  75%/4   | Total:  1h 22m | Avg: 20m 38s | Max: 30m 49s | Hits:  63%/3534  
      🟩 GCC11              Pass: 100%/7   | Total:  3h 39m | Avg: 31m 25s | Max: 37m 12s | Hits:  63%/8246  
      🟩 GCC12              Pass: 100%/4   | Total:  1h 51m | Avg: 27m 57s | Max: 30m 07s | Hits:  63%/4712  
      🟩 GCC13              Pass: 100%/20  | Total:  5h 52m | Avg: 17m 38s | Max: 34m 50s | Hits:  77%/23560 
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  1h 38m | Avg: 32m 41s | Max: 35m 56s | Hits:  63%/3540  
      🟩 MSVC14.16          Pass: 100%/1   | Total: 45m 49s | Avg: 45m 49s | Max: 45m 49s | Hits:  61%/1173  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 42m | Avg: 51m 20s | Max: 54m 32s | Hits:  61%/2346  
      🟩 MSVC14.39          Pass: 100%/6   | Total:  3h 27m | Avg: 34m 33s | Max: 53m 04s | Hits:  80%/7038  
    🔍 cxx_family: GCC 🔍
      🟩 Clang              Pass: 100%/51  | Total: 19h 42m | Avg: 23m 11s | Max: 31m 06s | Hits:  69%/60027 
      🔍 GCC                Pass:  98%/55  | Total: 21h 01m | Avg: 22m 55s | Max: 37m 12s | Hits:  68%/63610 
      🟩 Intel              Pass: 100%/3   | Total:  1h 38m | Avg: 32m 41s | Max: 35m 56s | Hits:  63%/3540  
      🟩 MSVC               Pass: 100%/9   | Total:  5h 55m | Avg: 39m 32s | Max: 54m 32s | Hits:  73%/10557 
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  98%/99  | Total:  1d 21h | Avg: 27m 16s | Max: 54m 32s | Hits:  63%/115375
      🟩 TestCPU            Pass: 100%/11  | Total:  1h 42m | Avg:  9m 18s | Max: 18m 43s | Hits:  99%/12939 
      🟩 TestGPU            Pass: 100%/8   | Total:  1h 34m | Avg: 11m 45s | Max: 13m 27s | Hits:  99%/9420  
    🔍 std: 20 🔍
      🟩 11                 Pass: 100%/30  | Total: 10h 12m | Avg: 20m 25s | Max: 28m 06s | Hits:  69%/35328 
      🟩 14                 Pass: 100%/34  | Total: 15h 02m | Avg: 26m 32s | Max: 53m 04s | Hits:  67%/40020 
      🟩 17                 Pass: 100%/33  | Total: 14h 42m | Avg: 26m 43s | Max: 54m 32s | Hits:  68%/38847 
      🔍 20                 Pass:  95%/21  | Total:  8h 20m | Avg: 23m 49s | Max: 50m 18s | Hits:  71%/23539 
    🟨 gpu
      🟨 v100               Pass:  99%/118 | Total:  2d 00h | Avg: 24m 33s | Max: 54m 32s | Hits:  69%/137734
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  1h 42m | Avg: 34m 09s | Max: 37m 12s | Hits:  63%/3534  
      🟩 90a                Pass: 100%/4   | Total:  1h 02m | Avg: 15m 32s | Max: 17m 01s | Hits:  63%/4712  
    
  • 🟩 cub: Pass: 100%/131 | Total: 4d 01h | Avg: 44m 32s | Max: 1h 13m | Hits: 54%/111124

    🟩 cpu
      🟩 amd64              Pass: 100%/123 | Total:  3d 17h | Avg: 43m 42s | Max:  1h 13m | Hits:  55%/104188
      🟩 arm64              Pass: 100%/8   | Total:  7h 38m | Avg: 57m 15s | Max:  1h 02m | Hits:  40%/6936  
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total: 11h 16m | Avg: 45m 07s | Max: 53m 24s | Hits:  38%/11792 
      🟩 11.8               Pass: 100%/3   | Total:  3h 33m | Avg:  1h 11m | Max:  1h 13m | Hits:  40%/2601  
      🟩 12.5               Pass: 100%/113 | Total:  3d 10h | Avg: 43m 45s | Max:  1h 10m | Hits:  57%/96731 
    🟩 cudacxx
      🟩 ClangCUDA17        Pass: 100%/2   | Total: 46m 28s | Avg: 23m 14s | Max: 24m 58s | Hits:  42%/1436  
      🟩 nvcc11.1           Pass: 100%/15  | Total: 11h 16m | Avg: 45m 07s | Max: 53m 24s | Hits:  38%/11792 
      🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 33m | Avg:  1h 11m | Max:  1h 13m | Hits:  40%/2601  
      🟩 nvcc12.5           Pass: 100%/111 | Total:  3d 09h | Avg: 44m 07s | Max:  1h 10m | Hits:  57%/95295 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 46m 28s | Avg: 23m 14s | Max: 24m 58s | Hits:  42%/1436  
      🟩 nvcc               Pass: 100%/129 | Total:  4d 00h | Avg: 44m 52s | Max:  1h 13m | Hits:  55%/109688
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  4h 54m | Avg: 49m 00s | Max: 57m 33s | Hits:  38%/4980  
      🟩 Clang10            Pass: 100%/3   | Total:  2h 39m | Avg: 53m 12s | Max: 55m 13s | Hits:  39%/2607  
      🟩 Clang11            Pass: 100%/4   | Total:  3h 28m | Avg: 52m 02s | Max: 54m 59s | Hits:  39%/3476  
      🟩 Clang12            Pass: 100%/4   | Total:  3h 36m | Avg: 54m 06s | Max: 57m 49s | Hits:  39%/3476  
      🟩 Clang13            Pass: 100%/4   | Total:  3h 33m | Avg: 53m 21s | Max: 57m 27s | Hits:  39%/3476  
      🟩 Clang14            Pass: 100%/4   | Total:  3h 44m | Avg: 56m 11s | Max:  1h 01m | Hits:  41%/3476  
      🟩 Clang15            Pass: 100%/4   | Total:  3h 33m | Avg: 53m 29s | Max: 54m 37s | Hits:  41%/3468  
      🟩 Clang16            Pass: 100%/4   | Total:  3h 38m | Avg: 54m 34s | Max: 58m 11s | Hits:  41%/3468  
      🟩 Clang17            Pass: 100%/26  | Total: 13h 32m | Avg: 31m 14s | Max:  1h 00m | Hits:  77%/22244 
      🟩 GCC6               Pass: 100%/2   | Total:  1h 29m | Avg: 44m 50s | Max: 46m 31s | Hits:  38%/1582  
      🟩 GCC7               Pass: 100%/6   | Total:  4h 49m | Avg: 48m 12s | Max: 54m 14s | Hits:  39%/4983  
      🟩 GCC8               Pass: 100%/6   | Total:  4h 52m | Avg: 48m 41s | Max: 55m 16s | Hits:  39%/4983  
      🟩 GCC9               Pass: 100%/6   | Total:  5h 00m | Avg: 50m 00s | Max: 56m 58s | Hits:  39%/4983  
      🟩 GCC10              Pass: 100%/4   | Total:  3h 44m | Avg: 56m 03s | Max: 59m 55s | Hits:  40%/3476  
      🟩 GCC11              Pass: 100%/7   | Total:  7h 13m | Avg:  1h 01m | Max:  1h 13m | Hits:  40%/6069  
      🟩 GCC12              Pass: 100%/4   | Total:  3h 44m | Avg: 56m 02s | Max: 59m 31s | Hits:  40%/3468  
      🟩 GCC13              Pass: 100%/28  | Total: 14h 32m | Avg: 31m 08s | Max:  1h 02m | Hits:  74%/24276 
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 56m | Avg: 58m 42s | Max:  1h 01m | Hits:  36%/2379  
      🟩 MSVC14.16          Pass: 100%/1   | Total: 53m 24s | Avg: 53m 24s | Max: 53m 24s | Hits:  37%/709   
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 10m | Hits:  37%/1418  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  3h 06m | Avg:  1h 02m | Max:  1h 03m | Hits:  37%/2127  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/59  | Total:  1d 18h | Avg: 43m 24s | Max:  1h 01m | Hits:  56%/50671 
      🟩 GCC                Pass: 100%/63  | Total:  1d 21h | Avg: 43m 15s | Max:  1h 13m | Hits:  55%/53820 
      🟩 Intel              Pass: 100%/3   | Total:  2h 56m | Avg: 58m 42s | Max:  1h 01m | Hits:  36%/2379  
      🟩 MSVC               Pass: 100%/6   | Total:  6h 12m | Avg:  1h 02m | Max:  1h 10m | Hits:  37%/4254  
    🟩 gpu
      🟩 v100               Pass: 100%/131 | Total:  4d 01h | Avg: 44m 32s | Max:  1h 13m | Hits:  54%/111124
    🟩 jobs
      🟩 Build              Pass: 100%/99  | Total:  3d 14h | Avg: 52m 22s | Max:  1h 13m | Hits:  39%/83380 
      🟩 DeviceLaunch       Pass: 100%/8   | Total:  2h 29m | Avg: 18m 38s | Max: 23m 51s | Hits:  99%/6936  
      🟩 GraphCapture       Pass: 100%/8   | Total:  2h 14m | Avg: 16m 45s | Max: 20m 50s | Hits:  99%/6936  
      🟩 HostLaunch         Pass: 100%/8   | Total:  2h 34m | Avg: 19m 18s | Max: 23m 28s | Hits:  99%/6936  
      🟩 TestGPU            Pass: 100%/8   | Total:  3h 31m | Avg: 26m 22s | Max: 30m 00s | Hits:  99%/6936  
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 33m | Avg:  1h 11m | Max:  1h 13m | Hits:  40%/2601  
      🟩 90a                Pass: 100%/4   | Total:  1h 29m | Avg: 22m 17s | Max: 22m 47s | Hits:  40%/3468  
    🟩 std
      🟩 11                 Pass: 100%/34  | Total:  1d 01h | Avg: 45m 17s | Max:  1h 13m | Hits:  54%/29047 
      🟩 14                 Pass: 100%/37  | Total:  1d 05h | Avg: 47m 03s | Max:  1h 11m | Hits:  53%/31174 
      🟩 17                 Pass: 100%/36  | Total:  1d 02h | Avg: 43m 46s | Max:  1h 08m | Hits:  53%/30392 
      🟩 20                 Pass: 100%/24  | Total: 16h 17m | Avg: 40m 43s | Max:  1h 01m | Hits:  60%/20511 
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
pycuda

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- pycuda

🏃‍ Runner counts (total jobs: 250)

# Runner
178 linux-amd64-cpu16
41 linux-amd64-gpu-v100-latest-1
16 linux-arm64-cpu16
15 windows-amd64-cpu16

Copy link
Contributor

🟩 CI finished in 11h 04m: Pass: 100%/250 | Total: 6d 02h | Avg: 35m 05s | Max: 1h 13m | Hits: 62%/250036
  • 🟩 cub: Pass: 100%/131 | Total: 4d 01h | Avg: 44m 32s | Max: 1h 13m | Hits: 54%/111124

    🟩 cpu
      🟩 amd64              Pass: 100%/123 | Total:  3d 17h | Avg: 43m 42s | Max:  1h 13m | Hits:  55%/104188
      🟩 arm64              Pass: 100%/8   | Total:  7h 38m | Avg: 57m 15s | Max:  1h 02m | Hits:  40%/6936  
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total: 11h 16m | Avg: 45m 07s | Max: 53m 24s | Hits:  38%/11792 
      🟩 11.8               Pass: 100%/3   | Total:  3h 33m | Avg:  1h 11m | Max:  1h 13m | Hits:  40%/2601  
      🟩 12.5               Pass: 100%/113 | Total:  3d 10h | Avg: 43m 45s | Max:  1h 10m | Hits:  57%/96731 
    🟩 cudacxx
      🟩 ClangCUDA17        Pass: 100%/2   | Total: 46m 28s | Avg: 23m 14s | Max: 24m 58s | Hits:  42%/1436  
      🟩 nvcc11.1           Pass: 100%/15  | Total: 11h 16m | Avg: 45m 07s | Max: 53m 24s | Hits:  38%/11792 
      🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 33m | Avg:  1h 11m | Max:  1h 13m | Hits:  40%/2601  
      🟩 nvcc12.5           Pass: 100%/111 | Total:  3d 09h | Avg: 44m 07s | Max:  1h 10m | Hits:  57%/95295 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 46m 28s | Avg: 23m 14s | Max: 24m 58s | Hits:  42%/1436  
      🟩 nvcc               Pass: 100%/129 | Total:  4d 00h | Avg: 44m 52s | Max:  1h 13m | Hits:  55%/109688
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  4h 54m | Avg: 49m 00s | Max: 57m 33s | Hits:  38%/4980  
      🟩 Clang10            Pass: 100%/3   | Total:  2h 39m | Avg: 53m 12s | Max: 55m 13s | Hits:  39%/2607  
      🟩 Clang11            Pass: 100%/4   | Total:  3h 28m | Avg: 52m 02s | Max: 54m 59s | Hits:  39%/3476  
      🟩 Clang12            Pass: 100%/4   | Total:  3h 36m | Avg: 54m 06s | Max: 57m 49s | Hits:  39%/3476  
      🟩 Clang13            Pass: 100%/4   | Total:  3h 33m | Avg: 53m 21s | Max: 57m 27s | Hits:  39%/3476  
      🟩 Clang14            Pass: 100%/4   | Total:  3h 44m | Avg: 56m 11s | Max:  1h 01m | Hits:  41%/3476  
      🟩 Clang15            Pass: 100%/4   | Total:  3h 33m | Avg: 53m 29s | Max: 54m 37s | Hits:  41%/3468  
      🟩 Clang16            Pass: 100%/4   | Total:  3h 38m | Avg: 54m 34s | Max: 58m 11s | Hits:  41%/3468  
      🟩 Clang17            Pass: 100%/26  | Total: 13h 32m | Avg: 31m 14s | Max:  1h 00m | Hits:  77%/22244 
      🟩 GCC6               Pass: 100%/2   | Total:  1h 29m | Avg: 44m 50s | Max: 46m 31s | Hits:  38%/1582  
      🟩 GCC7               Pass: 100%/6   | Total:  4h 49m | Avg: 48m 12s | Max: 54m 14s | Hits:  39%/4983  
      🟩 GCC8               Pass: 100%/6   | Total:  4h 52m | Avg: 48m 41s | Max: 55m 16s | Hits:  39%/4983  
      🟩 GCC9               Pass: 100%/6   | Total:  5h 00m | Avg: 50m 00s | Max: 56m 58s | Hits:  39%/4983  
      🟩 GCC10              Pass: 100%/4   | Total:  3h 44m | Avg: 56m 03s | Max: 59m 55s | Hits:  40%/3476  
      🟩 GCC11              Pass: 100%/7   | Total:  7h 13m | Avg:  1h 01m | Max:  1h 13m | Hits:  40%/6069  
      🟩 GCC12              Pass: 100%/4   | Total:  3h 44m | Avg: 56m 02s | Max: 59m 31s | Hits:  40%/3468  
      🟩 GCC13              Pass: 100%/28  | Total: 14h 32m | Avg: 31m 08s | Max:  1h 02m | Hits:  74%/24276 
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 56m | Avg: 58m 42s | Max:  1h 01m | Hits:  36%/2379  
      🟩 MSVC14.16          Pass: 100%/1   | Total: 53m 24s | Avg: 53m 24s | Max: 53m 24s | Hits:  37%/709   
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 10m | Hits:  37%/1418  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  3h 06m | Avg:  1h 02m | Max:  1h 03m | Hits:  37%/2127  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/59  | Total:  1d 18h | Avg: 43m 24s | Max:  1h 01m | Hits:  56%/50671 
      🟩 GCC                Pass: 100%/63  | Total:  1d 21h | Avg: 43m 15s | Max:  1h 13m | Hits:  55%/53820 
      🟩 Intel              Pass: 100%/3   | Total:  2h 56m | Avg: 58m 42s | Max:  1h 01m | Hits:  36%/2379  
      🟩 MSVC               Pass: 100%/6   | Total:  6h 12m | Avg:  1h 02m | Max:  1h 10m | Hits:  37%/4254  
    🟩 gpu
      🟩 v100               Pass: 100%/131 | Total:  4d 01h | Avg: 44m 32s | Max:  1h 13m | Hits:  54%/111124
    🟩 jobs
      🟩 Build              Pass: 100%/99  | Total:  3d 14h | Avg: 52m 22s | Max:  1h 13m | Hits:  39%/83380 
      🟩 DeviceLaunch       Pass: 100%/8   | Total:  2h 29m | Avg: 18m 38s | Max: 23m 51s | Hits:  99%/6936  
      🟩 GraphCapture       Pass: 100%/8   | Total:  2h 14m | Avg: 16m 45s | Max: 20m 50s | Hits:  99%/6936  
      🟩 HostLaunch         Pass: 100%/8   | Total:  2h 34m | Avg: 19m 18s | Max: 23m 28s | Hits:  99%/6936  
      🟩 TestGPU            Pass: 100%/8   | Total:  3h 31m | Avg: 26m 22s | Max: 30m 00s | Hits:  99%/6936  
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 33m | Avg:  1h 11m | Max:  1h 13m | Hits:  40%/2601  
      🟩 90a                Pass: 100%/4   | Total:  1h 29m | Avg: 22m 17s | Max: 22m 47s | Hits:  40%/3468  
    🟩 std
      🟩 11                 Pass: 100%/34  | Total:  1d 01h | Avg: 45m 17s | Max:  1h 13m | Hits:  54%/29047 
      🟩 14                 Pass: 100%/37  | Total:  1d 05h | Avg: 47m 03s | Max:  1h 11m | Hits:  53%/31174 
      🟩 17                 Pass: 100%/36  | Total:  1d 02h | Avg: 43m 46s | Max:  1h 08m | Hits:  53%/30392 
      🟩 20                 Pass: 100%/24  | Total: 16h 17m | Avg: 40m 43s | Max:  1h 01m | Hits:  60%/20511 
    
  • 🟩 thrust: Pass: 100%/118 | Total: 2d 00h | Avg: 24m 49s | Max: 54m 32s | Hits: 69%/138912

    🟩 cpu
      🟩 amd64              Pass: 100%/110 | Total:  1d 21h | Avg: 24m 41s | Max: 54m 32s | Hits:  69%/129492
      🟩 arm64              Pass: 100%/8   | Total:  3h 32m | Avg: 26m 33s | Max: 34m 50s | Hits:  63%/9420  
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  6h 05m | Avg: 24m 20s | Max: 45m 49s | Hits:  63%/17660 
      🟩 11.8               Pass: 100%/3   | Total:  1h 42m | Avg: 34m 09s | Max: 37m 12s | Hits:  63%/3534  
      🟩 12.5               Pass: 100%/100 | Total:  1d 17h | Avg: 24m 37s | Max: 54m 32s | Hits:  70%/117718
    🟩 cudacxx
      🟩 ClangCUDA17        Pass: 100%/2   | Total: 51m 35s | Avg: 25m 47s | Max: 26m 08s | Hits:  62%/2354  
      🟩 nvcc11.1           Pass: 100%/15  | Total:  6h 05m | Avg: 24m 20s | Max: 45m 49s | Hits:  63%/17660 
      🟩 nvcc11.8           Pass: 100%/3   | Total:  1h 42m | Avg: 34m 09s | Max: 37m 12s | Hits:  63%/3534  
      🟩 nvcc12.5           Pass: 100%/98  | Total:  1d 16h | Avg: 24m 35s | Max: 54m 32s | Hits:  70%/115364
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 51m 35s | Avg: 25m 47s | Max: 26m 08s | Hits:  62%/2354  
      🟩 nvcc               Pass: 100%/116 | Total:  1d 23h | Avg: 24m 48s | Max: 54m 32s | Hits:  69%/136558
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  2h 26m | Avg: 24m 22s | Max: 28m 46s | Hits:  63%/7062  
      🟩 Clang10            Pass: 100%/3   | Total:  1h 17m | Avg: 25m 48s | Max: 27m 43s | Hits:  63%/3531  
      🟩 Clang11            Pass: 100%/4   | Total:  1h 47m | Avg: 26m 46s | Max: 30m 09s | Hits:  63%/4708  
      🟩 Clang12            Pass: 100%/4   | Total:  1h 44m | Avg: 26m 05s | Max: 28m 07s | Hits:  63%/4708  
      🟩 Clang13            Pass: 100%/4   | Total:  1h 43m | Avg: 25m 50s | Max: 27m 51s | Hits:  63%/4708  
      🟩 Clang14            Pass: 100%/4   | Total:  1h 42m | Avg: 25m 32s | Max: 28m 59s | Hits:  63%/4708  
      🟩 Clang15            Pass: 100%/4   | Total:  1h 46m | Avg: 26m 42s | Max: 28m 44s | Hits:  63%/4708  
      🟩 Clang16            Pass: 100%/4   | Total:  1h 44m | Avg: 26m 12s | Max: 28m 27s | Hits:  63%/4708  
      🟩 Clang17            Pass: 100%/18  | Total:  5h 30m | Avg: 18m 20s | Max: 31m 06s | Hits:  79%/21186 
      🟩 GCC6               Pass: 100%/2   | Total: 43m 34s | Avg: 21m 47s | Max: 24m 06s | Hits:  63%/2354  
      🟩 GCC7               Pass: 100%/6   | Total:  2h 26m | Avg: 24m 27s | Max: 28m 21s | Hits:  63%/7068  
      🟩 GCC8               Pass: 100%/6   | Total:  2h 27m | Avg: 24m 31s | Max: 27m 59s | Hits:  63%/7068  
      🟩 GCC9               Pass: 100%/6   | Total:  2h 36m | Avg: 26m 04s | Max: 29m 58s | Hits:  63%/7068  
      🟩 GCC10              Pass: 100%/4   | Total:  1h 54m | Avg: 28m 36s | Max: 31m 52s | Hits:  63%/4712  
      🟩 GCC11              Pass: 100%/7   | Total:  3h 39m | Avg: 31m 25s | Max: 37m 12s | Hits:  63%/8246  
      🟩 GCC12              Pass: 100%/4   | Total:  1h 51m | Avg: 27m 57s | Max: 30m 07s | Hits:  63%/4712  
      🟩 GCC13              Pass: 100%/20  | Total:  5h 52m | Avg: 17m 38s | Max: 34m 50s | Hits:  77%/23560 
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  1h 38m | Avg: 32m 41s | Max: 35m 56s | Hits:  63%/3540  
      🟩 MSVC14.16          Pass: 100%/1   | Total: 45m 49s | Avg: 45m 49s | Max: 45m 49s | Hits:  61%/1173  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 42m | Avg: 51m 20s | Max: 54m 32s | Hits:  61%/2346  
      🟩 MSVC14.39          Pass: 100%/6   | Total:  3h 27m | Avg: 34m 33s | Max: 53m 04s | Hits:  80%/7038  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/51  | Total: 19h 42m | Avg: 23m 11s | Max: 31m 06s | Hits:  69%/60027 
      🟩 GCC                Pass: 100%/55  | Total: 21h 33m | Avg: 23m 30s | Max: 37m 12s | Hits:  68%/64788 
      🟩 Intel              Pass: 100%/3   | Total:  1h 38m | Avg: 32m 41s | Max: 35m 56s | Hits:  63%/3540  
      🟩 MSVC               Pass: 100%/9   | Total:  5h 55m | Avg: 39m 32s | Max: 54m 32s | Hits:  73%/10557 
    🟩 gpu
      🟩 v100               Pass: 100%/118 | Total:  2d 00h | Avg: 24m 49s | Max: 54m 32s | Hits:  69%/138912
    🟩 jobs
      🟩 Build              Pass: 100%/99  | Total:  1d 21h | Avg: 27m 36s | Max: 54m 32s | Hits:  63%/116553
      🟩 TestCPU            Pass: 100%/11  | Total:  1h 42m | Avg:  9m 18s | Max: 18m 43s | Hits:  99%/12939 
      🟩 TestGPU            Pass: 100%/8   | Total:  1h 34m | Avg: 11m 45s | Max: 13m 27s | Hits:  99%/9420  
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  1h 42m | Avg: 34m 09s | Max: 37m 12s | Hits:  63%/3534  
      🟩 90a                Pass: 100%/4   | Total:  1h 02m | Avg: 15m 32s | Max: 17m 01s | Hits:  63%/4712  
    🟩 std
      🟩 11                 Pass: 100%/30  | Total: 10h 12m | Avg: 20m 25s | Max: 28m 06s | Hits:  69%/35328 
      🟩 14                 Pass: 100%/34  | Total: 15h 02m | Avg: 26m 32s | Max: 53m 04s | Hits:  67%/40020 
      🟩 17                 Pass: 100%/33  | Total: 14h 42m | Avg: 26m 43s | Max: 54m 32s | Hits:  68%/38847 
      🟩 20                 Pass: 100%/21  | Total:  8h 52m | Avg: 25m 20s | Max: 50m 18s | Hits:  71%/24717 
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 11m 05s | Avg: 11m 05s | Max: 11m 05s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
pycuda

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- pycuda

🏃‍ Runner counts (total jobs: 250)

# Runner
178 linux-amd64-cpu16
41 linux-amd64-gpu-v100-latest-1
16 linux-arm64-cpu16
15 windows-amd64-cpu16

@bernhardmgruber bernhardmgruber merged commit 8635429 into NVIDIA:main Jul 23, 2024
262 of 264 checks passed
@bernhardmgruber bernhardmgruber deleted the ref_merge branch July 23, 2024 07:08
Comment on lines +224 to +227
// Cannot check output iterators, since they could be discard iterators, which do not have the right value_type
static_assert(::cuda::std::is_same<cub::detail::value_t<KeyIt2>, key_t>::value, "");
static_assert(::cuda::std::is_same<cub::detail::value_t<ValueIt2>, value_t>::value, "");
static_assert(::cuda::std::__invokable<CompareOp, key_t, key_t>::value,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe those are not the right constraints and they have broken cuDF

Why do we require all keys / values to have the exact same type? Shouldnt it be sufficient to ensure that the CompareOp is invocable with the passed in types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is what the documentation of thrust::merge says:

InputIterator1 and InputIterator2 have the same value_type

Apparently, the iterators for the values don't need to have the same value_types.

bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this pull request Jul 25, 2024
As long as they are convertible to the value type of the first iterator. This weakens the publicly documented guarantees of equal value types to restore the old behavior of the thrust implementation replaced in NVIDIA#1817.
bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this pull request Jul 25, 2024
As long as they are convertible to the value type of the first iterator. This weakens the publicly documented guarantees of equal value types to restore the old behavior of the thrust implementation replaced in NVIDIA#1817.
bernhardmgruber added a commit that referenced this pull request Jul 26, 2024
* Add a cuDF inspired test for merge_by_key
* Allow CUB MergePath to support iterators with different value types
* Allow different input value types for merge, as long as they are convertible to the value type of the first iterator. This weakens the publicly documented guarantees of equal value types to restore the old behavior of the thrust implementation replaced in #1817.
pciolkosz pushed a commit to pciolkosz/cccl that referenced this pull request Aug 4, 2024
* Refactor thrust/CUB merge
* Port thurst::merge[_by_key] to cub::DeviceMerge

Fixes NVIDIA#1763

Co-authored-by: Georgii Evtushenko <[email protected]>
pciolkosz pushed a commit to pciolkosz/cccl that referenced this pull request Aug 4, 2024
* Add a cuDF inspired test for merge_by_key
* Allow CUB MergePath to support iterators with different value types
* Allow different input value types for merge, as long as they are convertible to the value type of the first iterator. This weakens the publicly documented guarantees of equal value types to restore the old behavior of the thrust implementation replaced in NVIDIA#1817.
pciolkosz pushed a commit to pciolkosz/cccl that referenced this pull request Aug 4, 2024
* Refactor thrust/CUB merge
* Port thurst::merge[_by_key] to cub::DeviceMerge

Fixes NVIDIA#1763

Co-authored-by: Georgii Evtushenko <[email protected]>
pciolkosz pushed a commit to pciolkosz/cccl that referenced this pull request Aug 4, 2024
* Add a cuDF inspired test for merge_by_key
* Allow CUB MergePath to support iterators with different value types
* Allow different input value types for merge, as long as they are convertible to the value type of the first iterator. This weakens the publicly documented guarantees of equal value types to restore the old behavior of the thrust implementation replaced in NVIDIA#1817.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cub For all items related to CUB thrust For all items related to Thrust.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Port thrust::merge to CUB
4 participants