Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Bfloat16 Benchmark and Benchmark Suite #71

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

isVoid
Copy link
Collaborator

@isVoid isVoid commented Aug 15, 2024

This PR adds bfloat16 kernel benchmarks suite, comparing a raw CUDA kernel runtime and a Numba kernel runtime. It is expected to have high overhead without supporting LTOIR.

The profiling shows slowdown:

                 GOLD: simple_kernel(float *)  PY: cudapy::__main__::kernel[abi:v1,cw51cXTLSUwv1sCUt9Ww0FEw09RRQPKzLTg4gaGKFsG2oMQGEYakJSQB1PQBk0Bynm21OiwU1a0UoLGhDpQE8oxrNQE_3d](Array<float, (int)1, C, mutable, aligned>)
Time (%)                                100.0                                              100.0                                                                                                                             
Total Time (ns)                     1164770.0                                          2753366.0                                                                                                                             
Instances                              1000.0                                             1000.0                                                                                                                             
Avg (ns)                               1164.8                                             2753.4                                                                                                                             
Med (ns)                               1152.0                                             2528.0                                                                                                                             
Min (ns)                               1120.0                                             2495.0                                                                                                                             
Max (ns)                               1504.0                                             8992.0                                                                                                                             
StdDev (ns)                              21.1                                              814.3                                                                                                                             
Perf Ratio (PY / GOLD, %): 
Avg (ns)        236.383929
Med (ns)        219.444444
Min (ns)        222.767857
Max (ns)        597.872340
StdDev (ns)    3859.241706
dtype: float64

Contributes to #12

@isVoid
Copy link
Collaborator Author

isVoid commented Aug 28, 2024

Update: with NVIDIA/numba-cuda#48 inplace for Numba-CUDA, we will start to see a very low overhead between raw CUDA kernel performance and Numba CUDA kernel performance:

                 GOLD: simple_kernel(float *)  PY: cudapy::__main__::kernel[abi:v1,cw51cXTLSUwv1sCUt9Ww0FEw09RRQPKiLTj0gIGIFp_2b2oLQFEYYkHSQB1OQAk0Bynm21OizQ1K0UoIGvDpQE8oxrNQE_3d](Array<float, (int)1, C, mutable, aligned>)
Time (%)                                100.0                                              100.0                                                                                                                               
Total Time (ns)                     1136068.0                                          1145038.0                                                                                                                               
Instances                              1000.0                                             1000.0                                                                                                                               
Avg (ns)                               1136.1                                             1145.0                                                                                                                               
Med (ns)                               1121.0                                             1152.0                                                                                                                               
Min (ns)                               1119.0                                             1119.0                                                                                                                               
Max (ns)                               1504.0                                             1536.0                                                                                                                               
StdDev (ns)                              21.6                                               53.1                                                                                                                               
Perf Ratio (PY / GOLD, %): 
Avg (ns)       100.783382
Med (ns)       102.765388
Min (ns)       100.000000
Max (ns)       102.127660
StdDev (ns)    245.833333
dtype: float64

@isVoid
Copy link
Collaborator Author

isVoid commented Oct 7, 2024

We should add a readme to document how to use the benchmark suite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant