Skip to content
Junli Gu edited this page Oct 29, 2015 · 22 revisions

Welcome to the OpenCL-caffe wiki!

#OpenCL caffe Wiki

Prerequisites

Please refer to the following page for instructions for step b & c. https://github.com/amd/OpenCL-caffe/wiki/How-to-set-up-clBLAS-and-OpenCL

##Caffe dependency The OpenCL caffe project is based on Berkeley's deep learning framework Caffe (https://github.com/BVLC/caffe), so first you need to install the dependencies by following Caffe's installation instructions at http://caffe.berkeleyvision.org/installation.html.

More information about CAFFE can be found at the CAFFE homepage

Make instructions

Go to the root directory of OpenCL-Caffe (CAFFEROOT in the following):

 export CLBLAS_ROOT=/opt/clBLAS-*/ //this should point to your own clBLAS path

 export AMDAPPSDKROOT=/opt/AMDAPPSDK-*/ //this should point to your own AMDAPPSDK path

 mkdir build 
 cd build

 cmake .. 

 make

 make runtest //to run test

Log files

The log files generated by caffe are redirected to a log subdir. Under the caffe directory, issue mkdir log.

This will create the log dir for all the generated log files.

For example example, to check training loss, issue grep -ni loss log/caffe.INFO

Data preparation

  • CIFAR data take cifar10 (http://www.cs.toronto.edu/~kriz/cifar.html) for example. cd data/cifar10, then run ./get_cifar10.sh. After the download is finished, cd examples/cifar10, run ./create_cifar10.sh;
  • ImageNet data please see page "how to Create ImageNet 2012 data"

Model training

Under the directory CAFFEROOT/ choose the network you want to train, eg ./examples/imagenet/train_alexnet.sh

Switch between original scheme and batched scheme

This OpenCL caffe framework supports two schemes in conv layers. Original scheme is the straight forward OpenCL port of the orignal caffe, compute one image out of the minibatch size each inner conv loop. Batched scheme is the optimized implementation that scales up multiple images into bigger matrixes and calls sgemm on bigger matrixes. So each conv inner loop we complete the computation of multiple images instead of one. The benefits of the batched scheme is that we pack up multiple skinny matrixes into one bigger and more regular size matrix, which is more in favor of BLAS performance. But using batched implementation, the OpenCL caffe observes a 6x speedup in training and 4x for inference.

instructions how to switch:

1. for original scheme

reset the macro in include/caffe/common.hpp to zero: #define use_packing_scheme 0

2. for batched scheme

reset the macro in include/caffe/common.hpp to one: #define use_packing_scheme 1 Further you can adjust the batched number to get the optimal performance on the GPU you are using. #define global_packing_N 16

As a reference, on AMD FirePro Hawaii GPUs, global_packing_N=16 demonstrates the best performance. On AMD Fuji GPUs, global_packing_N=32 demonstrates the best performance.

Benchmark deployment time

./build/tools/caffe time -model=models/bvlc_alexnet/deploy.prototxt -gpu 0

NOTE: Accurate timing info should be found in log/caffe.INFO

See log/caffe.INFO for timing.