-
Notifications
You must be signed in to change notification settings - Fork 6
Benchmark Data
David Sorber edited this page Aug 25, 2021
·
1 revision
- Used gr-bench approach and scripts
- Tests were run using CUDA loopback blocks in the flow graph shown above for each of the following three cases:
- stock GR 3.9 + legacy (double copy) loopback - shown in blue in the graphs below
- ngsched + legacy (double copy) loopback - shown in orange in the graphs below
- ngsched + single mapped custom buffer - shown in green in the graphs below
- Each test case iterated over various values for "veclen" (batch size) and number of loopback blocks
- "veclen" (batch size) values: 1024, 2048, 4096, 8192, 16384, 32768
- number of loopback blocks value: 1, 2, 4, 16
- Each test case was run 10 times
- Each plot below shows execution time plotted against veclen. Note an equivalent plot could be made showing throughput (MB/s) vs. veclen.
- Total data copied was 100,000,000 * 8 byte
gr_complex
values for a total of 800,000,000 bytes (~762.94 MB)
- Dell XPS 15 laptop
- Intel i9-10885H (8 cores/16 threads)
- 32 GB DDR4-2933
- NVidia GTX 1650 GPU
- SuperMicro SM-X11DGQ Server
- 2x Intel Gold 6148 (20 cores/40 threads each)
- 512 GB DDR4-
- NVidia P100 GPU
- NVidia Jetson AGX Xavier
- 8-core ARM v8.2 CPU
- 32GB 256-Bit LPDDR4x
- 512-core Volta GPU with Tensor Cores