Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add internal wrapper for cuda driver APIs #2070

Merged

Conversation

pciolkosz
Copy link
Contributor

Adds internal header that loads CUDA driver API functions from the cuda runtime.
It also adds a few first entries needed for current context management.

Each new function should be a function that loads the driver entry point with CUDAX_GET_DRIVER_FUNCTION and then calls it with proper arguments.

@pciolkosz pciolkosz requested review from a team as code owners July 24, 2024 23:47
@pciolkosz pciolkosz self-assigned this Jul 24, 2024
@pciolkosz pciolkosz linked an issue Jul 25, 2024 that may be closed by this pull request
Copy link
Contributor

🟩 CI finished in 2h 47m: Pass: 100%/56 | Total: 2h 35m | Avg: 2m 46s | Max: 11m 50s | Hits: 90%/1693
  • 🟩 cudax: Pass: 100%/55 | Total: 2h 23m | Avg: 2m 36s | Max: 8m 06s | Hits: 90%/1693

    🟩 cpu
      🟩 amd64              Pass: 100%/51  | Total:  2h 14m | Avg:  2m 37s | Max:  8m 06s | Hits:  90%/1569  
      🟩 arm64              Pass: 100%/4   | Total:  9m 29s | Avg:  2m 22s | Max:  2m 43s | Hits:  90%/124   
    🟩 ctk
      🟩 12.0               Pass: 100%/23  | Total:  1h 00m | Avg:  2m 38s | Max:  8m 06s | Hits:  90%/707   
      🟩 12.5               Pass: 100%/32  | Total:  1h 22m | Avg:  2m 35s | Max:  6m 31s | Hits:  91%/986   
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/23  | Total:  1h 00m | Avg:  2m 38s | Max:  8m 06s | Hits:  90%/707   
      🟩 nvcc12.5           Pass: 100%/32  | Total:  1h 22m | Avg:  2m 35s | Max:  6m 31s | Hits:  91%/986   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/55  | Total:  2h 23m | Avg:  2m 36s | Max:  8m 06s | Hits:  90%/1693  
    🟩 cxx
      🟩 Clang9             Pass: 100%/2   | Total:  4m 20s | Avg:  2m 10s | Max:  2m 13s | Hits:  93%/62    
      🟩 Clang10            Pass: 100%/2   | Total:  4m 16s | Avg:  2m 08s | Max:  2m 08s | Hits:  93%/62    
      🟩 Clang11            Pass: 100%/4   | Total:  7m 59s | Avg:  1m 59s | Max:  2m 10s | Hits:  93%/124   
      🟩 Clang12            Pass: 100%/4   | Total:  8m 58s | Avg:  2m 14s | Max:  2m 28s | Hits:  93%/124   
      🟩 Clang13            Pass: 100%/4   | Total:  8m 32s | Avg:  2m 08s | Max:  2m 22s | Hits:  93%/124   
      🟩 Clang14            Pass: 100%/6   | Total: 16m 30s | Avg:  2m 45s | Max:  4m 14s | Hits:  95%/186   
      🟩 Clang15            Pass: 100%/2   | Total:  4m 29s | Avg:  2m 14s | Max:  2m 18s | Hits:  93%/62    
      🟩 Clang16            Pass: 100%/6   | Total: 18m 37s | Avg:  3m 06s | Max:  4m 28s | Hits:  95%/186   
      🟩 GCC9               Pass: 100%/2   | Total:  4m 09s | Avg:  2m 04s | Max:  2m 06s | Hits:  87%/62    
      🟩 GCC10              Pass: 100%/4   | Total:  7m 46s | Avg:  1m 56s | Max:  2m 12s | Hits:  87%/124   
      🟩 GCC11              Pass: 100%/4   | Total:  8m 10s | Avg:  2m 02s | Max:  2m 11s | Hits:  87%/124   
      🟩 GCC12              Pass: 100%/12  | Total: 32m 39s | Avg:  2m 43s | Max:  4m 38s | Hits:  89%/372   
      🟩 Intel2023.2.0      Pass: 100%/1   | Total:  2m 37s | Avg:  2m 37s | Max:  2m 37s | Hits:  93%/31    
      🟩 MSVC14.36          Pass: 100%/1   | Total:  8m 06s | Avg:  8m 06s | Max:  8m 06s | Hits:  60%/25    
      🟩 MSVC14.39          Pass: 100%/1   | Total:  6m 31s | Avg:  6m 31s | Max:  6m 31s | Hits:  60%/25    
    🟩 cxx_family
      🟩 Clang              Pass: 100%/30  | Total:  1h 13m | Avg:  2m 27s | Max:  4m 28s | Hits:  94%/930   
      🟩 GCC                Pass: 100%/22  | Total: 52m 44s | Avg:  2m 23s | Max:  4m 38s | Hits:  88%/682   
      🟩 Intel              Pass: 100%/1   | Total:  2m 37s | Avg:  2m 37s | Max:  2m 37s | Hits:  93%/31    
      🟩 MSVC               Pass: 100%/2   | Total: 14m 37s | Avg:  7m 18s | Max:  8m 06s | Hits:  60%/50    
    🟩 gpu
      🟩 v100               Pass: 100%/55  | Total:  2h 23m | Avg:  2m 36s | Max:  8m 06s | Hits:  90%/1693  
    🟩 jobs
      🟩 Build              Pass: 100%/47  | Total:  1h 50m | Avg:  2m 21s | Max:  8m 06s | Hits:  89%/1445  
      🟩 Test               Pass: 100%/8   | Total: 33m 10s | Avg:  4m 08s | Max:  4m 38s | Hits:  96%/248   
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  1m 49s | Avg:  1m 49s | Max:  1m 49s | Hits:  87%/31    
      🟩 90a                Pass: 100%/1   | Total:  1m 57s | Avg:  1m 57s | Max:  1m 57s | Hits:  87%/31    
    🟩 std
      🟩 17                 Pass: 100%/31  | Total:  1h 13m | Avg:  2m 22s | Max:  4m 28s | Hits:  91%/961   
      🟩 20                 Pass: 100%/24  | Total:  1h 10m | Avg:  2m 55s | Max:  8m 06s | Hits:  89%/732   
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 11m 50s | Avg: 11m 50s | Max: 11m 50s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 11m 50s | Avg: 11m 50s | Max: 11m 50s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 11m 50s | Avg: 11m 50s | Max: 11m 50s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 11m 50s | Avg: 11m 50s | Max: 11m 50s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 11m 50s | Avg: 11m 50s | Max: 11m 50s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 11m 50s | Avg: 11m 50s | Max: 11m 50s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 11m 50s | Avg: 11m 50s | Max: 11m 50s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 11m 50s | Avg: 11m 50s | Max: 11m 50s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 11m 50s | Avg: 11m 50s | Max: 11m 50s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
pycuda

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
+/- pycuda

🏃‍ Runner counts (total jobs: 56)

# Runner
41 linux-amd64-cpu16
9 linux-amd64-gpu-v100-latest-1
4 linux-arm64-cpu16
2 windows-amd64-cpu16

Comment on lines +47 to +50
if (status != CUDA_SUCCESS)
{
::cuda::__throw_cuda_error(static_cast<cudaError_t>(status), err_msg);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want something like _CCCL_TRY_CUDA_API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also be a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should be more or less equivalent to _CCCL_TRY_CUDA_API, am I missing some key difference here? I would have no issues turning it into a macro instead if its preffered

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe a function is "cleaner" than a macro, but the macro cannot go as we cannot depend on cudax.

Otherwise we would need to move the function into libcu++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need two separate functions/macros because driver API returns CUresult and runtime returns cudaError_t.

But these have the same values, so maybe we can add a cast to _CCCL_TRY_CUDA_API and remove this function 🤔

{
static auto driver_fn = CUDAX_GET_DRIVER_FUNCTION(cuCtxPushCurrent);
call_driver_fn(driver_fn, "Failed to push context", ctx);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we dynamically loading these functions instead of including <cuda.h> and linking to libcuda?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to require -lcuda compilation flag otherwise. This is more in line with the current CUDA runtime which does not require the compilation flag. There are compatibility reasons why current CUDA runtime does that and we probably want the same thing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly linking to libcuda.so means that any consuming library would only run on machines with the CUDA driver installed. This would mean that any application with runtime logic to dispatch to CUDA vs CPU based on HW support would fail to load when launched on a machine without the CUDA driver.

From a build engineer standpoint linking to libcuda.so should never happen

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool thanks. i knew there must be a reason. TIL

@@ -57,4 +57,8 @@ foreach(cn_target IN LISTS cudax_TARGETS)
launch/configuration.cu
)
target_compile_options(${test_target} PRIVATE $<$<COMPILE_LANG_AND_ID:CUDA,NVIDIA>:--extended-lambda>)

Cudax_add_catch2_test(test_target misc_tests ${cn_target}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Cudax_add_catch2_test(test_target misc_tests ${cn_target}
cudax_add_catch2_test(test_target misc_tests ${cn_target}

Copy link
Contributor

🟩 CI finished in 2h 28m: Pass: 100%/56 | Total: 2h 31m | Avg: 2m 42s | Max: 12m 40s | Hits: 97%/2408
  • 🟩 cudax: Pass: 100%/55 | Total: 2h 19m | Avg: 2m 31s | Max: 6m 44s | Hits: 97%/2408

    🟩 cpu
      🟩 amd64              Pass: 100%/51  | Total:  2h 10m | Avg:  2m 33s | Max:  6m 44s | Hits:  97%/2232  
      🟩 arm64              Pass: 100%/4   | Total:  8m 33s | Avg:  2m 08s | Max:  3m 00s | Hits:  97%/176   
    🟩 ctk
      🟩 12.0               Pass: 100%/23  | Total: 59m 01s | Avg:  2m 33s | Max:  6m 24s | Hits:  97%/1006  
      🟩 12.5               Pass: 100%/32  | Total:  1h 20m | Avg:  2m 30s | Max:  6m 44s | Hits:  97%/1402  
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/23  | Total: 59m 01s | Avg:  2m 33s | Max:  6m 24s | Hits:  97%/1006  
      🟩 nvcc12.5           Pass: 100%/32  | Total:  1h 20m | Avg:  2m 30s | Max:  6m 44s | Hits:  97%/1402  
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/55  | Total:  2h 19m | Avg:  2m 31s | Max:  6m 44s | Hits:  97%/2408  
    🟩 cxx
      🟩 Clang9             Pass: 100%/2   | Total:  4m 13s | Avg:  2m 06s | Max:  2m 07s | Hits: 100%/88    
      🟩 Clang10            Pass: 100%/2   | Total:  4m 05s | Avg:  2m 02s | Max:  2m 04s | Hits: 100%/88    
      🟩 Clang11            Pass: 100%/4   | Total:  8m 00s | Avg:  2m 00s | Max:  2m 05s | Hits: 100%/176   
      🟩 Clang12            Pass: 100%/4   | Total:  8m 12s | Avg:  2m 03s | Max:  2m 13s | Hits: 100%/176   
      🟩 Clang13            Pass: 100%/4   | Total:  8m 33s | Avg:  2m 08s | Max:  2m 14s | Hits: 100%/176   
      🟩 Clang14            Pass: 100%/6   | Total: 16m 33s | Avg:  2m 45s | Max:  4m 36s | Hits: 100%/264   
      🟩 Clang15            Pass: 100%/2   | Total:  4m 16s | Avg:  2m 08s | Max:  2m 10s | Hits: 100%/88    
      🟩 Clang16            Pass: 100%/6   | Total: 18m 48s | Avg:  3m 08s | Max:  4m 51s | Hits: 100%/264   
      🟩 GCC9               Pass: 100%/2   | Total:  3m 37s | Avg:  1m 48s | Max:  1m 52s | Hits:  95%/88    
      🟩 GCC10              Pass: 100%/4   | Total:  7m 59s | Avg:  1m 59s | Max:  2m 04s | Hits:  95%/176   
      🟩 GCC11              Pass: 100%/4   | Total:  7m 26s | Avg:  1m 51s | Max:  2m 06s | Hits:  95%/176   
      🟩 GCC12              Pass: 100%/12  | Total: 31m 52s | Avg:  2m 39s | Max:  4m 56s | Hits:  95%/528   
      🟩 Intel2023.2.0      Pass: 100%/1   | Total:  2m 30s | Avg:  2m 30s | Max:  2m 30s | Hits: 100%/44    
      🟩 MSVC14.36          Pass: 100%/1   | Total:  6m 24s | Avg:  6m 24s | Max:  6m 24s | Hits:  78%/38    
      🟩 MSVC14.39          Pass: 100%/1   | Total:  6m 44s | Avg:  6m 44s | Max:  6m 44s | Hits:  78%/38    
    🟩 cxx_family
      🟩 Clang              Pass: 100%/30  | Total:  1h 12m | Avg:  2m 25s | Max:  4m 51s | Hits: 100%/1320  
      🟩 GCC                Pass: 100%/22  | Total: 50m 54s | Avg:  2m 18s | Max:  4m 56s | Hits:  95%/968   
      🟩 Intel              Pass: 100%/1   | Total:  2m 30s | Avg:  2m 30s | Max:  2m 30s | Hits: 100%/44    
      🟩 MSVC               Pass: 100%/2   | Total: 13m 08s | Avg:  6m 34s | Max:  6m 44s | Hits:  78%/76    
    🟩 gpu
      🟩 v100               Pass: 100%/55  | Total:  2h 19m | Avg:  2m 31s | Max:  6m 44s | Hits:  97%/2408  
    🟩 jobs
      🟩 Build              Pass: 100%/47  | Total:  1h 44m | Avg:  2m 13s | Max:  6m 44s | Hits:  97%/2056  
      🟩 Test               Pass: 100%/8   | Total: 34m 16s | Avg:  4m 17s | Max:  4m 56s | Hits:  97%/352   
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  1m 48s | Avg:  1m 48s | Max:  1m 48s | Hits:  95%/44    
      🟩 90a                Pass: 100%/1   | Total:  2m 01s | Avg:  2m 01s | Max:  2m 01s | Hits:  95%/44    
    🟩 std
      🟩 17                 Pass: 100%/31  | Total:  1h 11m | Avg:  2m 18s | Max:  4m 21s | Hits:  98%/1364  
      🟩 20                 Pass: 100%/24  | Total:  1h 07m | Avg:  2m 49s | Max:  6m 44s | Hits:  96%/1044  
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 12m 40s | Avg: 12m 40s | Max: 12m 40s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 12m 40s | Avg: 12m 40s | Max: 12m 40s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 12m 40s | Avg: 12m 40s | Max: 12m 40s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 12m 40s | Avg: 12m 40s | Max: 12m 40s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 12m 40s | Avg: 12m 40s | Max: 12m 40s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 12m 40s | Avg: 12m 40s | Max: 12m 40s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 12m 40s | Avg: 12m 40s | Max: 12m 40s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 12m 40s | Avg: 12m 40s | Max: 12m 40s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 12m 40s | Avg: 12m 40s | Max: 12m 40s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
pycuda

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
+/- pycuda

🏃‍ Runner counts (total jobs: 56)

# Runner
41 linux-amd64-cpu16
9 linux-amd64-gpu-v100-latest-1
4 linux-arm64-cpu16
2 windows-amd64-cpu16

@pciolkosz pciolkosz merged commit 7a3dae7 into NVIDIA:main Jul 31, 2024
70 checks passed
pciolkosz added a commit to pciolkosz/cccl that referenced this pull request Aug 4, 2024
* Add a header to interact with driver APIs

* Add a test for the driver API interaction

* Format

* Fix formatting
pciolkosz added a commit to pciolkosz/cccl that referenced this pull request Aug 4, 2024
* Add a header to interact with driver APIs

* Add a test for the driver API interaction

* Format

* Fix formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Add internal wrapper for CUDA driver APIs
5 participants