By executing main.cu or looking into output.txt you can compare different realisations of matrix multiplication.
"native" - native realisation
"modified native" - native realisation with a modified crawl sequence and with a little memory access optimization
"with shared memory" - using CUDA's shared memory