-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hi,it will support gpu ,for example ARM mali gpu, #23
Comments
I would say that it's possible, but I'm not sure when. Currently, if you are interested in caffe w/ OpenCL support, you can refer to BVLC/caffe#2610. |
@sh1r0 |
@naibaf7 |
@sh1r0 |
@naibaf7 |
@naibaf7 @sh1r0 |
@zif520 https://github.com/naibaf7/caffe/blob/master/include/caffe/definitions.hpp however it might break if you just change it, so I'll verify and fix that. I have a phone with an Adreno 330 GPU that should also be OpenCL ready... might try to fix it up myself :)... the great news is that OpenCL-Z (from PlayStore) reports a valid libOpenCL.so version 1.2 on that one! |
@naibaf7 |
@zif520 It is not necessary to compile and use clBLAS with my branch, ViennaCL comes with a built-in BLAS that should work on mobile devices with OpenCL 1.1 Can you share what you have done so far? (adaptions, code, ...) that would speed up the process. |
@naibaf7 |
@naibaf7 |
@naibaf7 And there are some issues I found according to my tests:
Note: my test device is with Qualcomm Snapdragon 801 (Qualcomm Krait 400 and Qualcomm Adreno 330) and the support of OpenCL 1.2. I'm not quite sure if I miss anything I need to take care of, as I'm not familiar with OpenCL. :p Thanks. |
@sh1r0 I don't know how amd clblas or viennacl backends are optimized for this kind of devices. Qualcomm has its own Snapdragon optimized BLAS implementation but it is still CPU only. |
@sh1r0 Now, what is the runtime error that you get with using the CPU on OpenCL? I use a CPU BLAS with CPU devices instead of ViennaCL-BLAS or clBLAS, so that might make issues here. As for performance, it should definitely not be that slow. But to identify where the culprit is, I'd need to have some layer-wise timings to see what exactly runs slow. Maybe something I can also have a look at, as I have an Adreno 330 as well. When you enumerate OpenCL devices, is the order the same as in OpenCL-Z? |
@naibaf7 Sorry, I'm not sure what the problem might be, as I just got a segmentation fault when specifying CPU as the target device. To get layer-wise timings, I think Yes, the order are consistent to that in OpenCL-Z. |
@sh1r0 Might it be that the binaries do not call set_mode and SetDevice properly? |
@zif520 It is even possible to have multiple devices work on multiple networks in parallel, but then the rules are as follows:
|
@zif520 @naibaf7 |
@sh1r0
should be either:
or:
Besides, I think the ViennaCL GEMM for convolution seems really unsuitable for the Adreno GPU then. I don't know of any BLAS that is optimized for mobile GPUs. Probably a better performance can even be reached by implementing a simple direct convolution instead of using an explicit GEMM at all. |
@naibaf7 A tuning issue on Adreno was opened at clMathLibraries/clBLAS#136 |
@bhack |
@naibaf7 Have you experimented with https://github.com/ptillet/isaac? Probably could be an alternative path if clBLAS continue to not attract contributions by other vendors. /cc @ptillet |
Also Google https://github.com/halide/Halide/tree/master/apps/linear_algebra could be benchmarked on android. |
@ptillet What is the status? |
@blueardour |
@naibaf7 @blueardour , Yes, I make a test that use the most simple gemm in gpu(like three loop in cpu without any optimization like memory), and it use the same time compared with CLBlast of Mali-T628, and I find that the, the simple matrix, of the small matrix size is a little faster than CLblast while of the big matrix size, CLBlast is much faster. |
@zazd |
@zazd CLBlast is tuneable which means that if local memory usage is not beneficial on Mali, it will simply not use it. CLBlast already includes tuning results for ARM Mali, so it shouldn't be a factor 20 slower than a 'simple gemm'. However, GEMM performance on Mali for CLBlast is sub-optimal (~ factor 2). This is because the ARM OpenCL compiler doesn't handle the style of kernel programming used. For example, inside the kernel I use these kinds of arrays and loops:
I expect this array and this loop to be 'unrolled' as registers, but the ARM Mali compiler cannot do that, adding a lot of loads/stores and loop overhead/indexing. One solution is to write a entire specific kernel for Mali until this issue is resolved. Someone is already doing that (CNugteren/CLBlast#81) in the form of an external kernel outside the normal library. |
@CNugteren sorry, I make a mistake of the memory, so the get the wrony conclusion. I change it above. However the simple gemm is little faster in the small matrix size, about 2x-3x, but the time is short. |
Any pointers on compiling caffe for Adreno GPU on snapdragon processor? |
hi @naibaf7 |
@neil1899 |
Hi all. We just released a beta version of our tool to build OpenCL branch of Caffe with all dependencies for Android via Collective Knowledge Framework. There is still a lot to be done, but you can try it as following for ARM64-based device such as Samsung S7 (assuming that you have Android NDK and SDK installed): $ sudo pip install ck (note that CK will download ~600Mb of sources including OpenCV, Boost, etc) and it may take 10-30 minutes to build all dependencies depending on your host machine ;) ... Then, if you connect your device via adb, you can do the following to compile and run classification example: At the moment, there is a problem with OpenCL kernel caching (it's very slow) but it's being worked on ... More details about CK-based installation is here: https://github.com/dividiti/ck-caffe/wiki/Installation Hope it's of any help/use ... |
Hi @gfursin , |
Hi @kindloaf . You can specify extra paths where CK will search for already installed software using env var CK_DIRS. Just add root directory of your Android NDK. |
@gfursin Actually I managed to find the toolchain, so I'm not sure how to test the CK_DIRS feature. |
@kindloaf you can help CK find Android NDK in an "unusual path" as following (on Linux): BTW, some more info about portable env/soft/package manager in the CK: https://github.com/ctuning/ck/wiki/Portable-workflows |
@gfursin I just tested with the following steps, and it worked perfectly: |
@kindloaf - cool! Glad that it worked! Thanks for your feedback! |
By the way, you can now test if Caffe OpenCL (built via CK) can work on your mobile using this Android app: https://play.google.com/store/apps/details?id=openscience.crowdsource.video.experiments . ARM64-based devices seem to be fine for medium size models (SqueezeNet, GoogleNet), while ARMv7a is ok only for small models like SqueezeNet. Performance results are available at http://cKnowledge.org/repo @psyhtest & @fvella are now trying to improve/optimize CLBlast to speed up Caffe OpenCL for Android ... |
@gfursin To be clear, are any of the measurements in http://cKnowledge.org/repo based on running models on mobile GPUs? I realize that some of the machines tested have Mali or Adreno GPUs, but I couldn't tell if the benchmark was run on the GPU or not. Also, to summarize this discussion, what is currently the easiest way (if any) to run caffe on e.g. an Adreno or other decent mobile GPU? |
@matthai The easiest way is currently the way @gfursin is saying. |
@matthai , normally we run Caffe OpenCL scenarios on GPU only. However, you can actually run Caffe on your Android mobile (with GPU) connected to your host machine (Linux or Windows) using CK. There you can select different platform and device ID. I described various ways to compile and run Caffe OpenCL for different Android devices here: For example, if you have Android NDK and SDK installed, you can normally compile Caffe OpenCL and run classification example as following (for Samsung S7): As @naibaf7, the performance is not yet optimal, and new ARM Compute Library should normally provide impressive performance gains (@AnthonyARM). I started discussing with @psyhtest how to add this lib to CK-Caffe workflow, but we just don't have time for that right now ... |
@matthai Extra note: Caffe and TensorFlow scenarios for cKnowledge.org/repo are compiled via CK as I showed above. The idea is to be able to reproduce these numbers from the command line and then use them as a reference for further optimization (via better library selection or compiler/OpenCL parameter auto-tuning) ... |
@gfursin We could team up adding support for ARM Compute Library if you want. Could go hand-in-hand with planned FP16 support that will come with AMD Vega. |
@matthai If you want to speed it up, and help us develop it, you're also welcome to try integrating the ARM Compute Library within the BLAS multiplexer for OpenCL ( |
You are welcome, @matthai . With @naibaf7 's OpenCL Caffe and Collective Knowledge workflow framework we should now have a cross-platform Caffe which can run on both Linux, Windows and Android with CUDA, OpenCL and standard CPU support. This allows us to focus further on performance improvements ... @naibaf7 - sure, we will be happy to team up to add ACM compute lib to Caffe. I need to finish a few urgent things within next two weeks, but then we can sync about all that then ... Will keep in touch! |
Can this integration be done in upstream libdnn? |
@bhack |
hi sh1r0,
I am very interested in your project,Are there plan to supply gpu? for example ARM mali opencl 1.1 gpu
The text was updated successfully, but these errors were encountered: