Parallel Programming for GPUs - Matrix Multiplication

Dense Matrix Multiplication (DMM) is one of the core components in many scientific computations. In this repository, we implement the DMM algorithm for GPUs in CUDA using 4 algorithms, increasing each time the total performance.

Algorithms

Naive: Simple implementation where each thread just computes one element from the output matrix.
Coalesced memory acceses of A: Load tiles of the input matrix A in the shared memory.
Reduced memory accesses: Load tiles of the input matrices A and B in the shared memory.
Using cuBLAS library

Brief results

All experiments were performed in a NVIDIA Tesla K40c (kepler architecture and compute capability=3.5)

Total Performance in 2048×2048 matrices

Choosing the optimal thread block size

Performance in different problem sizes

Project Structure

cuda: Source code for DMM.
common: Helper source code.
make: Scripts for compiling the source code.
plots: Plots in order to analyze our results.
results: Performance of different scenarios.
report: Final report in Greek.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
common		common
cuda		cuda
make		make
plots		plots
report		report
results		results
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
description.pdf		description.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel Programming for GPUs - Matrix Multiplication

Algorithms

Brief results

Project Structure

Contributors:

About

Releases

Packages

Languages

License

PanosAntoniadis/cuda-exercises-ntua

Folders and files

Latest commit

History

Repository files navigation

Parallel Programming for GPUs - Matrix Multiplication

Algorithms

Brief results

Project Structure

Contributors:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages