Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move radix sort kernels to separate NVRTC compilable header #3803

Merged
merged 1 commit into from
Feb 14, 2025

Conversation

NaderAlAwar
Copy link
Contributor

Description

Closes #3796

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@NaderAlAwar NaderAlAwar requested a review from a team as a code owner February 13, 2025 21:01
@NaderAlAwar NaderAlAwar requested a review from fbusato February 13, 2025 21:01
Copy link
Contributor

🟩 CI finished in 2h 15m: Pass: 100%/93 | Total: 2d 01h | Avg: 31m 41s | Max: 1h 12m | Hits: 89%/134553
  • 🟩 cub: Pass: 100%/45 | Total: 1d 13h | Avg: 49m 22s | Max: 1h 12m | Hits: 84%/53761

    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total:  1d 11h | Avg: 48m 55s | Max:  1h 12m | Hits:  85%/51319 
      🟩 arm64              Pass: 100%/2   | Total:  1h 58m | Avg: 59m 10s | Max:  1h 08m | Hits:  79%/2442  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  4h 07m | Avg: 49m 30s | Max:  1h 04m | Hits:  77%/5939  
      🟩 12.5               Pass: 100%/2   | Total:  1h 48m | Avg: 54m 19s | Max: 56m 54s | Hits:  87%/2260  
      🟩 12.8               Pass: 100%/38  | Total:  1d 07h | Avg: 49m 05s | Max:  1h 12m | Hits:  85%/45562 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 51m | Avg: 55m 38s | Max: 55m 42s | Hits:  91%/2114  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  4h 07m | Avg: 49m 30s | Max:  1h 04m | Hits:  77%/5939  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 48m | Avg: 54m 19s | Max: 56m 54s | Hits:  87%/2260  
      🟩 nvcc12.8           Pass: 100%/36  | Total:  1d 05h | Avg: 48m 43s | Max:  1h 12m | Hits:  85%/43448 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 51m | Avg: 55m 38s | Max: 55m 42s | Hits:  91%/2114  
      🟩 nvcc               Pass: 100%/43  | Total:  1d 11h | Avg: 49m 04s | Max:  1h 12m | Hits:  84%/51647 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 57m | Avg: 44m 20s | Max: 44m 58s | Hits:  91%/4892  
      🟩 Clang15            Pass: 100%/2   | Total:  1h 30m | Avg: 45m 04s | Max: 46m 20s | Hits:  91%/2442  
      🟩 Clang16            Pass: 100%/2   | Total:  1h 35m | Avg: 47m 50s | Max: 48m 47s | Hits:  91%/2442  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 29m | Avg: 44m 45s | Max: 45m 23s | Hits:  91%/2442  
      🟩 Clang18            Pass: 100%/7   | Total:  4h 57m | Avg: 42m 28s | Max: 55m 42s | Hits:  93%/8219  
      🟩 GCC7               Pass: 100%/2   | Total:  1h 32m | Avg: 46m 13s | Max: 48m 24s | Hits:  90%/2446  
      🟩 GCC8               Pass: 100%/1   | Total: 43m 48s | Avg: 43m 48s | Max: 43m 48s | Hits:  90%/1223  
      🟩 GCC9               Pass: 100%/2   | Total:  1h 38m | Avg: 49m 21s | Max: 50m 02s | Hits:  90%/2446  
      🟩 GCC10              Pass: 100%/2   | Total:  1h 35m | Avg: 47m 58s | Max: 51m 08s | Hits:  90%/2446  
      🟩 GCC11              Pass: 100%/2   | Total:  1h 35m | Avg: 47m 54s | Max: 48m 13s | Hits:  90%/2442  
      🟩 GCC12              Pass: 100%/2   | Total:  1h 35m | Avg: 47m 35s | Max: 48m 18s | Hits:  90%/2442  
      🟩 GCC13              Pass: 100%/11  | Total:  9h 33m | Avg: 52m 10s | Max:  1h 12m | Hits:  89%/13431 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 07m | Hits:  15%/2094  
      🟩 MSVC14.42          Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 08m | Hits:  15%/2094  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 48m | Avg: 54m 19s | Max: 56m 54s | Hits:  87%/2260  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total: 12h 30m | Avg: 44m 07s | Max: 55m 42s | Hits:  92%/20437 
      🟩 GCC                Pass: 100%/22  | Total: 18h 15m | Avg: 49m 48s | Max:  1h 12m | Hits:  90%/26876 
      🟩 MSVC               Pass: 100%/4   | Total:  4h 27m | Avg:  1h 06m | Max:  1h 08m | Hits:  15%/4188  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 48m | Avg: 54m 19s | Max: 56m 54s | Hits:  87%/2260  
    🟩 gpu
      🟩 h100               Pass: 100%/3   | Total:  1h 06m | Avg: 22m 04s | Max: 24m 15s | Hits:  96%/3663  
      🟩 rtx2080            Pass: 100%/34  | Total:  1d 04h | Avg: 50m 43s | Max:  1h 08m | Hits:  82%/40330 
      🟩 rtxa6000           Pass: 100%/8   | Total:  7h 10m | Avg: 53m 51s | Max:  1h 12m | Hits:  91%/9768  
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total:  1d 06h | Avg: 50m 05s | Max:  1h 08m | Hits:  82%/43993 
      🟩 DeviceLaunch       Pass: 100%/1   | Total:  1h 12m | Avg:  1h 12m | Max:  1h 12m | Hits:  94%/1221  
      🟩 GraphCapture       Pass: 100%/1   | Total:  1h 07m | Avg:  1h 07m | Max:  1h 07m | Hits:  94%/1221  
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 56m | Avg: 38m 58s | Max:  1h 06m | Hits:  97%/3663  
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 50m | Avg: 36m 52s | Max:  1h 10m | Hits:  97%/3663  
    🟩 sm
      🟩 90                 Pass: 100%/3   | Total:  1h 06m | Avg: 22m 04s | Max: 24m 15s | Hits:  96%/3663  
      🟩 90;90a;100         Pass: 100%/1   | Total: 55m 41s | Avg: 55m 41s | Max: 55m 41s | Hits:  90%/1221  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 16h 36m | Avg: 49m 50s | Max:  1h 07m | Hits:  80%/23659 
      🟩 20                 Pass: 100%/25  | Total: 20h 24m | Avg: 48m 59s | Max:  1h 12m | Hits:  88%/30102 
    
  • 🟩 thrust: Pass: 100%/45 | Total: 11h 20m | Avg: 15m 07s | Max: 34m 53s | Hits: 92%/80496

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 22m 03s | Avg: 11m 01s | Max: 11m 02s | Hits:  97%/3580  
    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total: 10h 57m | Avg: 15m 17s | Max: 34m 53s | Hits:  92%/76917 
      🟩 arm64              Pass: 100%/2   | Total: 23m 12s | Avg: 11m 36s | Max: 12m 12s | Hits:  94%/3579  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  1h 20m | Avg: 16m 10s | Max: 30m 31s | Hits:  89%/8941  
      🟩 12.5               Pass: 100%/2   | Total: 51m 52s | Avg: 25m 56s | Max: 25m 59s | Hits:  93%/3578  
      🟩 12.8               Pass: 100%/38  | Total:  9h 07m | Avg: 14m 24s | Max: 34m 53s | Hits:  92%/67977 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 23m 51s | Avg: 11m 55s | Max: 11m 59s | Hits:  94%/3578  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 20m | Avg: 16m 10s | Max: 30m 31s | Hits:  89%/8941  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 51m 52s | Avg: 25m 56s | Max: 25m 59s | Hits:  93%/3578  
      🟩 nvcc12.8           Pass: 100%/36  | Total:  8h 43m | Avg: 14m 33s | Max: 34m 53s | Hits:  92%/64399 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 23m 51s | Avg: 11m 55s | Max: 11m 59s | Hits:  94%/3578  
      🟩 nvcc               Pass: 100%/43  | Total: 10h 56m | Avg: 15m 16s | Max: 34m 53s | Hits:  92%/76918 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 51m 54s | Avg: 12m 58s | Max: 13m 46s | Hits:  94%/7156  
      🟩 Clang15            Pass: 100%/2   | Total: 26m 22s | Avg: 13m 11s | Max: 13m 36s | Hits:  94%/3578  
      🟩 Clang16            Pass: 100%/2   | Total: 24m 49s | Avg: 12m 24s | Max: 12m 29s | Hits:  94%/3578  
      🟩 Clang17            Pass: 100%/2   | Total: 24m 51s | Avg: 12m 25s | Max: 12m 49s | Hits:  94%/3578  
      🟩 Clang18            Pass: 100%/7   | Total:  1h 19m | Avg: 11m 20s | Max: 13m 37s | Hits:  96%/12523 
      🟩 GCC7               Pass: 100%/2   | Total: 25m 12s | Avg: 12m 36s | Max: 13m 13s | Hits:  94%/3580  
      🟩 GCC8               Pass: 100%/1   | Total: 13m 36s | Avg: 13m 36s | Max: 13m 36s | Hits:  94%/1790  
      🟩 GCC9               Pass: 100%/2   | Total: 26m 54s | Avg: 13m 27s | Max: 13m 56s | Hits:  94%/3580  
      🟩 GCC10              Pass: 100%/2   | Total: 25m 58s | Avg: 12m 59s | Max: 13m 31s | Hits:  94%/3580  
      🟩 GCC11              Pass: 100%/2   | Total: 25m 45s | Avg: 12m 52s | Max: 13m 19s | Hits:  94%/3580  
      🟩 GCC12              Pass: 100%/2   | Total: 27m 35s | Avg: 13m 47s | Max: 14m 45s | Hits:  94%/3580  
      🟩 GCC13              Pass: 100%/10  | Total:  1h 55m | Avg: 11m 30s | Max: 14m 15s | Hits:  96%/17900 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 05m | Avg: 32m 42s | Max: 34m 53s | Hits:  66%/3566  
      🟩 MSVC14.42          Pass: 100%/3   | Total:  1h 35m | Avg: 31m 54s | Max: 33m 46s | Hits:  67%/5349  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 51m 52s | Avg: 25m 56s | Max: 25m 59s | Hits:  93%/3578  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  3h 27m | Avg: 12m 11s | Max: 13m 46s | Hits:  95%/30413 
      🟩 GCC                Pass: 100%/21  | Total:  4h 20m | Avg: 12m 23s | Max: 14m 45s | Hits:  95%/37590 
      🟩 MSVC               Pass: 100%/5   | Total:  2h 41m | Avg: 32m 13s | Max: 34m 53s | Hits:  67%/8915  
      🟩 NVHPC              Pass: 100%/2   | Total: 51m 52s | Avg: 25m 56s | Max: 25m 59s | Hits:  93%/3578  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 19m 54s | Avg:  9m 57s | Max: 11m 09s | Hits:  97%/3580  
      🟩 rtx2080            Pass: 100%/33  | Total:  8h 30m | Avg: 15m 27s | Max: 34m 53s | Hits:  92%/59033 
      🟩 rtx4090            Pass: 100%/10  | Total:  2h 30m | Avg: 15m 01s | Max: 33m 46s | Hits:  92%/17883 
    🟩 jobs
      🟩 Build              Pass: 100%/38  | Total:  9h 50m | Avg: 15m 33s | Max: 34m 53s | Hits:  91%/67975 
      🟩 TestCPU            Pass: 100%/3   | Total: 45m 49s | Avg: 15m 16s | Max: 30m 01s | Hits:  90%/5362  
      🟩 TestGPU            Pass: 100%/4   | Total: 43m 39s | Avg: 10m 54s | Max: 11m 18s | Hits:  99%/7159  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 19m 54s | Avg:  9m 57s | Max: 11m 09s | Hits:  97%/3580  
      🟩 90;90a;100         Pass: 100%/1   | Total: 14m 08s | Avg: 14m 08s | Max: 14m 08s | Hits:  94%/1790  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total:  5h 32m | Avg: 16m 36s | Max: 34m 53s | Hits:  90%/35771 
      🟩 20                 Pass: 100%/23  | Total:  5h 26m | Avg: 14m 10s | Max: 33m 46s | Hits:  93%/41145 
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 12m 59s | Avg: 6m 29s | Max: 10m 45s | Hits: 98%/296

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 12m 59s | Avg:  6m 29s | Max: 10m 45s | Hits:  98%/296   
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total: 12m 59s | Avg:  6m 29s | Max: 10m 45s | Hits:  98%/296   
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total: 12m 59s | Avg:  6m 29s | Max: 10m 45s | Hits:  98%/296   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 12m 59s | Avg:  6m 29s | Max: 10m 45s | Hits:  98%/296   
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 12m 59s | Avg:  6m 29s | Max: 10m 45s | Hits:  98%/296   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 12m 59s | Avg:  6m 29s | Max: 10m 45s | Hits:  98%/296   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total: 12m 59s | Avg:  6m 29s | Max: 10m 45s | Hits:  98%/296   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 14s | Avg:  2m 14s | Max:  2m 14s | Hits:  98%/148   
      🟩 Test               Pass: 100%/1   | Total: 10m 45s | Avg: 10m 45s | Max: 10m 45s | Hits:  98%/148   
    
  • 🟩 python: Pass: 100%/1 | Total: 32m 17s | Avg: 32m 17s | Max: 32m 17s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 32m 17s | Avg: 32m 17s | Max: 32m 17s
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total: 32m 17s | Avg: 32m 17s | Max: 32m 17s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total: 32m 17s | Avg: 32m 17s | Max: 32m 17s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 32m 17s | Avg: 32m 17s | Max: 32m 17s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 32m 17s | Avg: 32m 17s | Max: 32m 17s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 32m 17s | Avg: 32m 17s | Max: 32m 17s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 32m 17s | Avg: 32m 17s | Max: 32m 17s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 32m 17s | Avg: 32m 17s | Max: 32m 17s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 93)

# Runner
66 linux-amd64-cpu16
9 windows-amd64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-arm64-cpu16
3 linux-amd64-gpu-h100-latest-1
3 linux-amd64-gpu-rtx4090-latest-1
2 linux-amd64-gpu-rtx2080-latest-1

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I cannot clearly see in the diff whether any changes were made when moving the code, I think we should have a SASS diff for a radix sort unit test or the benchmark. Thx!

Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my side.

I verified that the content of the new file is identical to the extracted content

@NaderAlAwar
Copy link
Contributor Author

Since I cannot clearly see in the diff whether any changes were made when moving the code, I think we should have a SASS diff for a radix sort unit test or the benchmark. Thx!

My bad, I think you mentioned this before and I forgot. I checked the SASS for cub.test.device_radix_sort_custom.lid_0 and there were no differences.

@NaderAlAwar NaderAlAwar merged commit 3daa036 into NVIDIA:main Feb 14, 2025
109 of 111 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Extract radix sort kernels to NVRTC compilable header
3 participants