Dense Matrix Multiplication (DMM) is one of the core components in many scientific computations. In this repository, we implement the DMM algorithm for GPUs in CUDA using 4 algorithms, increasing each time the total performance.
- Naive: Simple implementation where each thread just computes one element from the output matrix.
- Coalesced memory acceses of A: Load tiles of the input matrix A in the shared memory.
- Reduced memory accesses: Load tiles of the input matrices A and B in the shared memory.
- Using cuBLAS library
All experiments were performed in a NVIDIA Tesla K40c (kepler architecture and compute capability=3.5)
- Total Performance in 2048×2048 matrices
- Choosing the optimal thread block size
- Performance in different problem sizes
cuda:
Source code for DMM.common:
Helper source code.make:
Scripts for compiling the source code.plots:
Plots in order to analyze our results.results:
Performance of different scenarios.report:
Final report in Greek.