hi,it will support gpu ,for example ARM mali gpu, #23

zif520 · 2015-12-22T03:26:20Z

hi sh1r0,
I am very interested in your project,Are there plan to supply gpu? for example ARM mali opencl 1.1 gpu

sh1r0 · 2015-12-22T05:01:07Z

I would say that it's possible, but I'm not sure when. Currently, if you are interested in caffe w/ OpenCL support, you can refer to BVLC/caffe#2610.

naibaf7 · 2015-12-27T15:36:02Z

@sh1r0
It should be possible to get BVLC/caffe#2610 to work on android. It can probably be done by replacing the Caffe used in this project by the https://github.com/naibaf7/caffe branch and some minor adaptions/fixes.

sh1r0 · 2015-12-27T16:48:22Z

@naibaf7
👍 But I took a look at your branch, and I found that the commits are too much to make the branch like a patch to be easily applied to the upstream master branch. Would you like to rebase your branch?

naibaf7 · 2015-12-27T16:57:47Z

@sh1r0
Yes and I guess you would need to use 32 bit indexing (pointers) instead of 64 bit indexing for Android devices.
What requirements would you have to be able to integrate this?

sh1r0 · 2015-12-27T17:05:43Z

@naibaf7
Yes, I guess so. 😛
I think a branch which is rebased to the latest master branch (BVLC/caffe@03a84bf) should be enough for me to have some trials.
Thanks.

zif520 · 2015-12-28T11:05:36Z

@naibaf7 @sh1r0
hi,I've run it by compared naibaf7/caffe to caffe-android-lib ,it work well on cpu used EIGEN,
but it will failed when run greentea_memset() on GPU mode(mali T880 opencl 1.1,use 32 bit indexing ).
it failed when run viennacl::ocl::enqueue(),I am not familiar with opencl,so learn some about it to fix the problem later.
could you give some suggestions to me? thanks!

naibaf7 · 2015-12-28T11:29:26Z

@zif520
Did you change int_tp and int_tpc both to 32bit types for both the OpenCL and C++ part of the code?

https://github.com/naibaf7/caffe/blob/master/include/caffe/definitions.hpp
and
https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/cl_headers/header.cl

however it might break if you just change it, so I'll verify and fix that.

I have a phone with an Adreno 330 GPU that should also be OpenCL ready... might try to fix it up myself :)... the great news is that OpenCL-Z (from PlayStore) reports a valid libOpenCL.so version 1.2 on that one!

zif520 · 2015-12-29T02:02:27Z

@naibaf7
there is still some troubles in it ,i will spent some days to fix it ,
and then amd/OpenCL-caffe#17 said clblas 2.4 support opencl 1.1 ,i also will try it .
OpenCL-Z reports my telephone only supply opencl1.1

naibaf7 · 2015-12-29T03:13:29Z

@zif520
I am currently making my branch ready for 32 bit indexing again, so that both 64 bit and 32 bit work. Then it should be able to compile and run on Android 1.1 devices.

It is not necessary to compile and use clBLAS with my branch, ViennaCL comes with a built-in BLAS that should work on mobile devices with OpenCL 1.1

Can you share what you have done so far? (adaptions, code, ...) that would speed up the process.

zif520 · 2015-12-29T05:39:03Z

@naibaf7
yes! i will share it when it is completed,it is popular to use caffe(mxnet and so on) on telephone ,many people wang to do that .

naibaf7 · 2016-01-01T22:52:00Z

@sh1r0
I currently don't have the time for a complete rebase - this has to wait a bit.

@zif520
What's the progress? Is it working with my latest updates?

zif520 · 2016-01-01T23:28:36Z

@naibaf7
i am sorry,i go home just beause of new year,i will come back 2016,01,04,

sh1r0 · 2016-01-02T06:30:10Z

@naibaf7
OK, that's fine. I tried to merge my branch onto yours for some trials in the early stage. To see my progress, you can take a look at opencl_dev.

And there are some issues I found according to my tests:

CPU does not work as a OpenCL device (runtime error)
Running on GPU is about 5 times slower than in pure CPU mode (CPU_ONLY with OpenBLAS)

Note: my test device is with Qualcomm Snapdragon 801 (Qualcomm Krait 400 and Qualcomm Adreno 330) and the support of OpenCL 1.2.

I'm not quite sure if I miss anything I need to take care of, as I'm not familiar with OpenCL. :p

Thanks.

bhack · 2016-01-02T09:55:09Z

@sh1r0 I don't know how amd clblas or viennacl backends are optimized for this kind of devices. Qualcomm has its own Snapdragon optimized BLAS implementation but it is still CPU only.

naibaf7 · 2016-01-02T11:54:57Z

@sh1r0
Ok cool, at least you got it working!

Now, what is the runtime error that you get with using the CPU on OpenCL? I use a CPU BLAS with CPU devices instead of ViennaCL-BLAS or clBLAS, so that might make issues here.

As for performance, it should definitely not be that slow. But to identify where the culprit is, I'd need to have some layer-wise timings to see what exactly runs slow. Maybe something I can also have a look at, as I have an Adreno 330 as well.
Do you know how to do that quickly?

When you enumerate OpenCL devices, is the order the same as in OpenCL-Z?

sh1r0 · 2016-01-02T15:50:38Z

@naibaf7
Yes, it's really cool to have OpenCL works.

Sorry, I'm not sure what the problem might be, as I just got a segmentation fault when specifying CPU as the target device.

To get layer-wise timings, I think tools/caffe time is a good choice. However, with OpenCL build, I failed to make any executable run successfully on Android. I got ViennaCL: FATAL ERROR: Could not find kernel 'fillbuffer_float' from program '' for classification (cpp_classification) ~~and CANNOT LINK EXECUTABLE: cannot locate symbol "_ZN5caffe3NetIfEC1ERKSsNS_5PhaseEPKS1_" referenced by "./caffe"... for caffe~~. That's weird.
EDIT: For caffe, got Segmentation fault.

Yes, the order are consistent to that in OpenCL-Z.

naibaf7 · 2016-01-02T16:10:10Z

@sh1r0
Ok thanks, I'll try to work out what's going wrong.

Might it be that the binaries do not call set_mode and SetDevice properly? ViennaCL: FATAL ERROR: Could not find kernel 'fillbuffer_float' from program '' would imply the OpenCL kernels weren't compiled.

zif520 · 2016-01-03T00:55:33Z

@sh1r0 you will refer @naibaf7 ' s code in caffe.cpp:test(),you add setdevices() will fix it,at first ,you will init device

naibaf7 · 2016-01-03T02:35:51Z

@zif520
Yes, here, it is also important to mention that the device must be set before any solver or network is loaded. Knowledge of which device should be used is ultimately required to compile kernels, allocate memory and dispatch network/solver initialization.

It is even possible to have multiple devices work on multiple networks in parallel, but then the rules are as follows:

Caffe must be initialized with SetDevices on the main thread, providing a complete list of the devices to be used.
SelectDevice can be used to switch the device. When initializing networks on the main thread, select the device before creating a network or solver on that device.
The networks can be trained in parallel by using multiple host threads. In every thread, SelectDevice can switch to a different device. This selection will be thread local.
This threading feature should also work when being used in Android AsyncTasks, Java Threads or in Python Multithreading (without getting into GIL locks), making it very convenient to use.

sh1r0 · 2016-01-03T17:11:20Z

@zif520
Thanks, I've got CPU working as a OpenCL device. (I used SetDevice only before.)
But there might have other issues in tools/caffe such that it still does not work.

@naibaf7
I got some benchmark results, please refer to the link. time.cpp is basically caffe time. The number of iterations is 10 for cpu mode and 1 for gpu mode (as it takes ~6 minutes for a single iteration).
I found that there are little difference between using cpu and gpu as the OpenCL device. And as for forward timings, gpu mode (OpenCL) is ~70x slower than cpu mode.

naibaf7 · 2016-01-03T19:28:29Z

@sh1r0
I think now you benchmarked the OpenCL GPU twice:

    Caffe::SetDevices(gpus);
    Caffe::set_mode(Caffe::GPU);
    Caffe::SetDevice(gpus[0]);

should be either:

    Caffe::set_mode(Caffe::GPU);
    Caffe::SetDevice(gpus[0]);

or:

    Caffe::set_mode(Caffe::GPU);
    Caffe::SetDevices(gpus);
    Caffe::SelectDevice(gpus[0], false);

Besides, I think the ViennaCL GEMM for convolution seems really unsuitable for the Adreno GPU then. I don't know of any BLAS that is optimized for mobile GPUs. Probably a better performance can even be reached by implementing a simple direct convolution instead of using an explicit GEMM at all.
Maybe @karlrupp has an idea on this.

bhack · 2016-01-03T20:03:41Z

@naibaf7 A tuning issue on Adreno was opened at clMathLibraries/clBLAS#136

naibaf7 · 2016-01-03T20:07:30Z

@bhack
Thanks, good to know. However ViennaCL-BLAS seems to have optimization/tuning issues on this as well (which is what we are currently using in this Android-OpenCL experiment).
It is a bit unfortunate, since nVidia has well optimized cuBLAS for most devices, while other vendors have basically nothing to offer (yet).

bhack · 2016-01-03T20:14:01Z

@naibaf7 Have you experimented with https://github.com/ptillet/isaac? Probably could be an alternative path if clBLAS continue to not attract contributions by other vendors. /cc @ptillet

bhack · 2016-01-03T20:20:43Z

Also Google https://github.com/halide/Halide/tree/master/apps/linear_algebra could be benchmarked on android.

naibaf7 · 2016-01-03T21:34:32Z

@bhack @zif520 @sh1r0
Added ISAAC compile support to CMake and GNU Makefiles on my branch, if anyone fancies to try. It did not speed up on my GT650 or Intel HD4000. Maybe it can work on mobile.

bhack · 2016-01-03T21:39:06Z

@ptillet What is the status?

naibaf7 · 2016-09-06T13:13:10Z

@blueardour
kernel preload might only help if compile time takes too long. Zero-copy is already implemented in Caffe where needed and available. In general, the whole Stack already works on Mali / Android, only performance may sometimes be an issue, due to all of the kernels being optimized for full-feature desktop graphics cards.

zazd · 2016-09-06T13:26:21Z

@naibaf7 @blueardour , Yes, I make a test that use the most simple gemm in gpu(like three loop in cpu without any optimization like memory), and it use the same time compared with CLBlast of Mali-T628, and I find that the, the simple matrix, of the small matrix size is a little faster than CLblast while of the big matrix size, CLBlast is much faster.

naibaf7 · 2016-09-06T15:11:16Z

@zazd
Yes, but ask @CNugteren about that, I think he has some tips on what to do on Mali. I thought he even implemented kernels for Mali.

CNugteren · 2016-09-06T17:40:23Z

@zazd CLBlast is tuneable which means that if local memory usage is not beneficial on Mali, it will simply not use it. CLBlast already includes tuning results for ARM Mali, so it shouldn't be a factor 20 slower than a 'simple gemm'.

However, GEMM performance on Mali for CLBlast is sub-optimal (~ factor 2). This is because the ARM OpenCL compiler doesn't handle the style of kernel programming used. For example, inside the kernel I use these kinds of arrays and loops:

#define WPT 4 // or some other value: 1, 2, 4, 8, ...
float values[WPT]
for (int i = 0; i < WPT; ++i) { values[i] = 0.0; }

I expect this array and this loop to be 'unrolled' as registers, but the ARM Mali compiler cannot do that, adding a lot of loads/stores and loop overhead/indexing. One solution is to write a entire specific kernel for Mali until this issue is resolved. Someone is already doing that (CNugteren/CLBlast#81) in the form of an external kernel outside the normal library.

zazd · 2016-09-17T07:33:00Z

@CNugteren sorry, I make a mistake of the memory, so the get the wrony conclusion. I change it above. However the simple gemm is little faster in the small matrix size, about 2x-3x, but the time is short.

ganesanramachandran · 2016-11-10T06:11:53Z

Any pointers on compiling caffe for Adreno GPU on snapdragon processor?

neil1899 · 2016-12-29T09:05:40Z

hi @naibaf7
because of clblas and viennacl libraries is not appropriate for Embedded platform(such as qualcomm),
so i want to combine greentea[opencl: gpu mode] for some functions(such as
im2col.cl,math.cl,pooling.cl and so on) with qualcomm math lib[the library is the same as openblas](cpu mode),is it ok?

naibaf7 · 2016-12-29T12:54:44Z

@neil1899
Yes sure. There are already multiple BLAS libraries, but you can easily add your own ones to Caffe.

gfursin · 2017-02-03T17:45:38Z

Hi all. We just released a beta version of our tool to build OpenCL branch of Caffe with all dependencies for Android via Collective Knowledge Framework. There is still a lot to be done, but you can try it as following for ARM64-based device such as Samsung S7 (assuming that you have Android NDK and SDK installed):

$ sudo pip install ck
$ ck pull repo --url=https://github.com/dividiti/ck-caffe
$ ck install package:lib-caffe-bvlc-opencl-clblast-universal --target_os=android21-arm64

(note that CK will download ~600Mb of sources including OpenCV, Boost, etc) and it may take 10-30 minutes to build all dependencies depending on your host machine ;) ...

Then, if you connect your device via adb, you can do the following to compile and run classification example:
$ ck compile program:caffe-classification --target_os=android21-arm64
$ ck run program:caffe-classification --target_os=android21-arm64

At the moment, there is a problem with OpenCL kernel caching (it's very slow) but it's being worked on ...
In the mean time, you can slightly speed up execution via:
$ ck run program:caffe-classification --target_os=android21-arm64 --env.VIENNACL_CACHE_PATH=/data/local/tmp/viennacl_cache/ --env.CUDA_CACHE_DISABLE=1

More details about CK-based installation is here: https://github.com/dividiti/ck-caffe/wiki/Installation

Hope it's of any help/use ...

kindloaf · 2017-04-17T12:36:48Z

Hi @gfursin ,
I have a question regarding the compilation of ck-caffe. I would like to manually specify the path to the android toolchain (somehow the search script didn't find the compilers). How can I specify the path? Where can I find the file that stores the path to the compilers?

gfursin · 2017-04-18T07:42:18Z

Hi @kindloaf . You can specify extra paths where CK will search for already installed software using env var CK_DIRS. Just add root directory of your Android NDK.
By the way, you can specify multiple paths - just separate them with : on Linux and ; on Windows ...
Please, tell me if it works ...

kindloaf · 2017-04-18T12:18:40Z

@gfursin Actually I managed to find the toolchain, so I'm not sure how to test the CK_DIRS feature.

gfursin · 2017-04-18T12:42:12Z

@kindloaf you can help CK find Android NDK in an "unusual path" as following (on Linux):
$ export CK_DIRS=PATH_TO_YOUR_ANDROID_NDK ; ck detect soft:compiler.gcc.android.ndk --target_os=android21-arm64
You can then check if CK successfully registered in the CK via
$ ck show env

BTW, some more info about portable env/soft/package manager in the CK: https://github.com/ctuning/ck/wiki/Portable-workflows

kindloaf · 2017-04-18T13:24:58Z

@gfursin I just tested with the following steps, and it worked perfectly:
(1) ran ck detect, and the NDK was found.
(2) moved the current NDK to a different folder.
(3) ran ck detect again, and the NDK was not found.
(2) ran ck detect with CK_DIRS set to new NDK folder, and the NDK was found.

gfursin · 2017-04-18T13:28:21Z

@kindloaf - cool! Glad that it worked! Thanks for your feedback!

gfursin · 2017-04-18T13:54:54Z

By the way, you can now test if Caffe OpenCL (built via CK) can work on your mobile using this Android app: https://play.google.com/store/apps/details?id=openscience.crowdsource.video.experiments .

ARM64-based devices seem to be fine for medium size models (SqueezeNet, GoogleNet), while ARMv7a is ok only for small models like SqueezeNet. Performance results are available at http://cKnowledge.org/repo

@psyhtest & @fvella are now trying to improve/optimize CLBlast to speed up Caffe OpenCL for Android ...

matthai · 2017-04-26T00:14:35Z

@gfursin To be clear, are any of the measurements in http://cKnowledge.org/repo based on running models on mobile GPUs? I realize that some of the machines tested have Mali or Adreno GPUs, but I couldn't tell if the benchmark was run on the GPU or not.

Also, to summarize this discussion, what is currently the easiest way (if any) to run caffe on e.g. an Adreno or other decent mobile GPU?

naibaf7 · 2017-04-26T01:44:59Z

@matthai The easiest way is currently the way @gfursin is saying.
Performance not optimal yet, though adding better support for ARM GPUs and CPUs is something that could be done relatively easy now: https://github.com/ARM-software/ComputeLibrary

gfursin · 2017-04-26T15:33:08Z

@matthai , normally we run Caffe OpenCL scenarios on GPU only. However, you can actually run Caffe on your Android mobile (with GPU) connected to your host machine (Linux or Windows) using CK. There you can select different platform and device ID.

I described various ways to compile and run Caffe OpenCL for different Android devices here:

https://github.com/dividiti/ck-caffe/wiki/Installation

For example, if you have Android NDK and SDK installed, you can normally compile Caffe OpenCL and run classification example as following (for Samsung S7):
$ sudo pip install ck
$ ck pull repo --url=https://github.com/dividiti/ck-caffe
$ ck install package:lib-caffe-bvlc-opencl-clblast-universal --target_os=android21-arm64 --env.DISABLE_DEVICE_HOST_UNIFIED_MEMORY=ON
$ ck compile program:caffe-classification-opencl --speed --target_os=android21-arm64
$ ck run program:caffe-classification-opencl --target_os=android21-arm64

As @naibaf7, the performance is not yet optimal, and new ARM Compute Library should normally provide impressive performance gains (@AnthonyARM). I started discussing with @psyhtest how to add this lib to CK-Caffe workflow, but we just don't have time for that right now ...

gfursin · 2017-04-26T15:38:19Z

@matthai Extra note: Caffe and TensorFlow scenarios for cKnowledge.org/repo are compiled via CK as I showed above. The idea is to be able to reproduce these numbers from the command line and then use them as a reference for further optimization (via better library selection or compiler/OpenCL parameter auto-tuning) ...

naibaf7 · 2017-04-26T17:46:31Z

@gfursin We could team up adding support for ARM Compute Library if you want. Could go hand-in-hand with planned FP16 support that will come with AMD Vega.

matthai · 2017-04-26T18:16:23Z

@gfursin @naibaf7 Thanks for the quick and detailed discussion, guys.

To summarize, sounds like integration of ARM Compute Library should be the final piece of the puzzle before GPU-accelerated Caffe is available cleanly on ARM/OpenCL platforms. Can't wait!

naibaf7 · 2017-04-26T21:47:30Z

@matthai If you want to speed it up, and help us develop it, you're also welcome to try integrating the ARM Compute Library within the BLAS multiplexer for OpenCL (greentea_math_functions.cpp)... it's rather simple to do :)

gfursin · 2017-04-27T15:37:05Z

You are welcome, @matthai . With @naibaf7 's OpenCL Caffe and Collective Knowledge workflow framework we should now have a cross-platform Caffe which can run on both Linux, Windows and Android with CUDA, OpenCL and standard CPU support. This allows us to focus further on performance improvements ...

@naibaf7 - sure, we will be happy to team up to add ACM compute lib to Caffe. I need to finish a few urgent things within next two weeks, but then we can sync about all that then ...

Will keep in touch!

bhack · 2017-05-01T10:57:21Z

Can this integration be done in upstream libdnn?

naibaf7 · 2017-05-01T12:58:57Z

@bhack
LibDNN does not rely on BLAS libraries. However since the ARM ACM library seems opensource (and even if parts are not, analyzing what the kernels do is simple), I can use their knowledge to improve LibDNN kernels for ARM GPUs this summer.

zif520 closed this as completed Dec 30, 2015

sh1r0 reopened this Jan 2, 2016

sh1r0 mentioned this issue Jan 2, 2016

About Openblas #27

Closed

This was referenced Nov 17, 2016

Managed to make it run on GPUs on the phones, with extremely low performance. #81

Closed

libcaffe error #82

Closed

kindloaf mentioned this issue Apr 18, 2017

classification segmentation fault when calling Caffe::SetDevices dividiti/ck-caffe#99

Closed

bhack mentioned this issue May 1, 2017

Update LibDNN with Pooling and Deconvolution, check versions tiny-dnn/tiny-dnn#680

Open

hi,it will support gpu ,for example ARM mali gpu, #23

hi,it will support gpu ,for example ARM mali gpu, #23

Comments

zif520 commented Dec 22, 2015

sh1r0 commented Dec 22, 2015

naibaf7 commented Dec 27, 2015

sh1r0 commented Dec 27, 2015

naibaf7 commented Dec 27, 2015

sh1r0 commented Dec 27, 2015

zif520 commented Dec 28, 2015

naibaf7 commented Dec 28, 2015

zif520 commented Dec 29, 2015

naibaf7 commented Dec 29, 2015

zif520 commented Dec 29, 2015

naibaf7 commented Jan 1, 2016

zif520 commented Jan 1, 2016

sh1r0 commented Jan 2, 2016

bhack commented Jan 2, 2016

naibaf7 commented Jan 2, 2016

sh1r0 commented Jan 2, 2016

naibaf7 commented Jan 2, 2016

zif520 commented Jan 3, 2016

naibaf7 commented Jan 3, 2016

sh1r0 commented Jan 3, 2016

naibaf7 commented Jan 3, 2016

bhack commented Jan 3, 2016

naibaf7 commented Jan 3, 2016

bhack commented Jan 3, 2016

bhack commented Jan 3, 2016

naibaf7 commented Jan 3, 2016

bhack commented Jan 3, 2016

naibaf7 commented Sep 6, 2016 • edited Loading

zazd commented Sep 6, 2016 • edited Loading

naibaf7 commented Sep 6, 2016

CNugteren commented Sep 6, 2016

zazd commented Sep 17, 2016

ganesanramachandran commented Nov 10, 2016

neil1899 commented Dec 29, 2016

naibaf7 commented Dec 29, 2016

gfursin commented Feb 3, 2017

kindloaf commented Apr 17, 2017

gfursin commented Apr 18, 2017 • edited Loading

kindloaf commented Apr 18, 2017

gfursin commented Apr 18, 2017 • edited Loading

kindloaf commented Apr 18, 2017

gfursin commented Apr 18, 2017

gfursin commented Apr 18, 2017

matthai commented Apr 26, 2017

naibaf7 commented Apr 26, 2017

gfursin commented Apr 26, 2017

gfursin commented Apr 26, 2017

naibaf7 commented Apr 26, 2017

matthai commented Apr 26, 2017

naibaf7 commented Apr 26, 2017 • edited Loading

gfursin commented Apr 27, 2017

bhack commented May 1, 2017

naibaf7 commented May 1, 2017 • edited Loading

naibaf7 commented Sep 6, 2016 •

edited

Loading

zazd commented Sep 6, 2016 •

edited

Loading

gfursin commented Apr 18, 2017 •

edited

Loading

gfursin commented Apr 18, 2017 •

edited

Loading

naibaf7 commented Apr 26, 2017 •

edited

Loading

naibaf7 commented May 1, 2017 •

edited

Loading