-
Notifications
You must be signed in to change notification settings - Fork 152
Home
Welcome to the OpenCL-caffe wiki!
#OpenCL caffe Wiki
- a. System: Ubuntu >=12.04 with AMD's GPU
- b. OpenCL runtime environment
- c. [clBLAS] (https://github.com/clMathLibraries/clBLAS)
Please refer to the following page for instructions for step b & c. https://github.com/amd/OpenCL-caffe/wiki/How-to-set-up-clBLAS-and-OpenCL
##Caffe dependency The OpenCL caffe project is based on Berkeley's deep learning framework Caffe (https://github.com/BVLC/caffe), so first you need to install the dependencies by following Caffe's installation instructions at http://caffe.berkeleyvision.org/installation.html.
More information about CAFFE can be found at the CAFFE homepage
Go to the root directory of OpenCL-Caffe (CAFFEROOT in the following):
export CLBLAS_ROOT=/opt/clBLAS-*/ //this should point to your own clBLAS path
export AMDAPPSDKROOT=/opt/AMDAPPSDK-*/ //this should point to your own AMDAPPSDK path
mkdir build
cd build
cmake ..
make
make runtest //to run test
The log files generated by caffe are redirected to a log subdir. Under the caffe directory, issue mkdir log.
This will create the log dir for all the generated log files.
For example example, to check training loss, issue grep -ni loss log/caffe.INFO
- CIFAR data take cifar10 (http://www.cs.toronto.edu/~kriz/cifar.html) for example. cd data/cifar10, then run ./get_cifar10.sh. After the download is finished, cd examples/cifar10, run ./create_cifar10.sh;
- ImageNet data please see page "how to Create ImageNet 2012 data"
Under the directory CAFFEROOT/ choose the network you want to train, eg ./examples/imagenet/train_alexnet.sh
This OpenCL caffe framework supports two schemes in conv layers. Original scheme is the straight forward OpenCL port of the orignal caffe, compute one image out of the minibatch size each inner conv loop. Batched scheme is the optimized implementation that scales up multiple images into bigger matrixes and calls sgemm on bigger matrixes. So each conv inner loop we complete the computation of multiple images instead of one. The benefits of the batched scheme is that we pack up multiple skinny matrixes into one bigger and more regular size matrix, which is more in favor of BLAS performance. But using batched implementation, the OpenCL caffe observes a 6x speedup in training and 4x for inference.
instructions how to switch:
reset the macro in include/caffe/common.hpp to zero: #define use_packing_scheme 0
reset the macro in include/caffe/common.hpp to one: #define use_packing_scheme 1 Further you can adjust the batched number to get the optimal performance on the GPU you are using. #define global_packing_N 16
As a reference, on AMD FirePro Hawaii GPUs, global_packing_N=16 demonstrates the best performance. On AMD Fuji GPUs, global_packing_N=32 demonstrates the best performance.
./build/tools/caffe time -model=models/bvlc_alexnet/deploy.prototxt -gpu 0
NOTE: Accurate timing info should be found in log/caffe.INFO
See log/caffe.INFO for timing.