Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does opencl-caffe support fp16? #55

Open
sixsamuraisoldier opened this issue Jul 24, 2016 · 12 comments
Open

Does opencl-caffe support fp16? #55

sixsamuraisoldier opened this issue Jul 24, 2016 · 12 comments

Comments

@sixsamuraisoldier
Copy link

Hi there, I'm trying to figure out if this branch supports fp16 compute for the rx 480.
Thanks in advance

@naibaf7
Copy link

naibaf7 commented Jul 24, 2016

@sixsamuraisoldier
Only nVidia has an experimental FP16 branch at the moment.
However I will add FP16 support with OpenCL to this branch in near future: https://github.com/BVLC/caffe/tree/opencl

@naibaf7
Copy link

naibaf7 commented Jul 24, 2016

@sixsamuraisoldier
Note that this branch hasn't been updated in over 7 months. I am not sure it is still maintained by anyone, as @gujunli and other coders of this branch have since left AMD and AMD are focusing on HiP/ROCm/SPIR-V based approaches instead.
The official BVLC Caffe OpenCL branch is over at https://github.com/BVLC/caffe/tree/opencl

@gstoner
Copy link
Contributor

gstoner commented Jul 24, 2016

As Naibabf7 stated this was an experimental branch of Caffe, We recommend you to use the BVLC version with OpenCL support moving forward.

One correction on naibaf7 comment we are still going to support OpenCL all future work from the Radeon Compute Team will be in support work up at BVLC. Right now the team is working on solver for machine learning that more optimised for AMD GPU architecture than generic math library.

ROCm is a new driver foundation for Linux compute that supports multiple languages:

  • Single Source C++ via HCC

  • HIP for device focused C++ with c-style runtime to simplify CUDA porting

  • OpenCL this will be out this fall, we are working on a new foundation which supports a much richer set of capabilities it at this time we bring RX480 support as well to ROCm.

    I will be changing the readme to point people to OpenCL port at BLVC.

@naibaf7
Copy link

naibaf7 commented Jul 24, 2016

@gstoner
Hey! Nice to hear from you :) great to have some signs from AMD again!
Did you see my latest email to you about 2-3 months back?

@gstoner
Copy link
Contributor

gstoner commented Jul 24, 2016

Right now we are heads down working bringing out new capabilites, For example,

The following family of solution support Single Rate Float16:

  • Fiji class hardware, Radeon R9 Nano, R9 Fury, R9 Fury X, FirePro S9300x2,
  • Tonga R380x,
  • Polaris Family: RX480. RX470, RX460

Here are example of some of the instruction supported, in the new GCN Native ISA Compiler we working hard to expose Float16,

• V_FREXP_EXP_I16_F16 Returns exponent of half precision float input, such that the original single float = significand * (2 ** exponent).

• V_CVT_F16_F32 Float32 to Float16.

• V_ADD_F16 D.f16 = S0.f16 + S1.f16. Supports denormals, round mode, exception flags, saturation.

• V_SUB_F16 D.f16 = S0.f16 - S1.f16. Supports denormals, round mode, exception flags, saturation. SQ translates to V_ADD_F16.

• V_MAC_F16 16-bit floating point multiply -accumulate

• V_FMA_F16.Fused half precision multiply add.

• V_MAD_F16 Floating point multiply-add (MAD). Gives same result as ADD after MUL_IEEE. Uses IEEE rules for 0*anything.

• V_MADAK_F16 16-bit floating-point multiply-add with constant add operand.

• V_MADMK_F16 16-bit floating-point multiply-add with multiply operand immediate.

• V_COS_F16 Cosine function

• V_SIN_F16 Sin function

• V_EXP_F16 Base2 exponent function

• V_LOG_F16 Base2 log function.

• V_SQRT_F16 if(S0.f16 == 1.0f) D.f16 = 1.0f; else D.f16 = ApproximateSqrt(S0.f16).

• V_FRACT_F16 Floating point ‘fractional’ part of S0.f.

• V_RCP_F16 if (S0.f16 == 1.0f), D.f16 = 1.0f; else D.f16 = ApproximateRecip(S0.f16).

• V_RSQ_F16 if(S0.f16 == 1.0f) D.f16 = 1.0f; else D.f16 = ApproximateRecipSqrt(S0.f16).

• V_RNDNE_F16 Floating-point Round-to-Nearest-Even Integer.

• V_TRUNC_F16 Floating point ‘integer’ part of S0.f. D.f16 = trunc(S0.f16). Round-to-zero semantics.

• V_LDEXP_F16

• V_CEIL_F16 Floating point ceiling function.

• V_FLOOR_F16 Floating-point floor function

• V_MAX_F16 D.f16 = max(S0.f16, S1.f16). IEEE compliant. Supports denormals, round mode, exception flags, saturation.

• V_MAX_I16 D.f16 = max(S0.f16, S1.f16). IEEE compliant. Supports denormals, round mode, exception flags, saturation.

• V_MIN_F16 D.f16 = min(S0.f16, S1.f16). IEEE compliant. Supports denormals, round mode, exception flags, saturation.

• V_CVT_PKRTZ_F16_F32 Convert two float 32 numbers into a single register holding two packed 16-bit floats.

• V_DIV_FIXUP_F16 Given a numerator, denominator, and quotient from a divide, this opcode detects and applies special case numerics, modifies the quotient if necessary. This opcode also generates invalid, denorm, and divide by zero exceptions caused by the division.

• V_SUBREV_F16 D.f16 = S1.f16 - S0.f16. Supports denormals, round mode, exception flags, saturation. SQ translates to V_ADD_F16.

Also did you the GCN 3 Architecture support Integer math 32bit & 16 bit

• V_ADD_U16 D.u16 = S0.u16 + S1.u16. Supports saturation (unsigned 16-bit integer domain).

• V_SUB_U16 D.u16 = S0.u16 - S1.u16. Supports saturation (unsigned 16-bit integer domain).

• V_MAD_I16 Signed integer muladd.

• V_MAD_U16 Unsigned integer muladd.

• V_SAD_U16 Sum of absolute differences with accumulation.

• V_MAX_I16 D.i[15:0] = max(S0.i[15:0], S1.i[15:0]).

• V_MAX_U16 D.u[15:0] = max(S0.u[15:0], S1.u[15:0]).

• V_MIN_I16 D.i[15:0] = min(S0.i[15:0], S1.i[15:0]).

• V_MIN_U16 D.u[15:0] = min(S0.u[15:0], S1.u[15:0]).

• V_MUL_LO_U16 D.u16 = S0.u16 * S1.u16. Supports saturation (unsigned 16-bit integer domain).

• V_CVT_F16_U16 D.f16 = uint16_to_flt16(S.u16). Supports denormals, rounding, exception flags and saturation.

• V_CVT_F16_I16 D.f16 = int16_to_flt16(S.i16). Supports denormals, rounding, exception flags and saturation

• V_SUBREV_U16 D.u16 = S1.u16 - S0.u16. Supports saturation (unsigned 16-bit integer domain). SQ translates this to V_SUB_U16 with reversed operands.

You can find out more on Float16 in the GCN version ISA manual http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/07/AMD_GCN3_Instruction_Set_Architecture.pdf

Also we now have added to the compiler disassembler/Assembler support and soon inline Assembly support so you be able tune your code even further.

On Jul 23, 2016, at 11:21 PM, Fabian Tschopp <[email protected]mailto:[email protected]> wrote:

@gstonerhttps://github.com/gstoner
Hey! Nice to hear from you :) great to have some signs from AMD again!
Did you see my latest email to you about 2-3 months back?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/55#issuecomment-234756629, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD8Duf2vXe6t4Y8VuWMWGGp0X2KgDhnLks5qYugsgaJpZM4JTemh.

@sixsamuraisoldier
Copy link
Author

Thanks everyone for the information, i will post this question on the opencl branch of caffe.

@sixsamuraisoldier
Copy link
Author

@gstoner
One quick question, does polaris (the rx480) support fp16 at a 2:1 ratio?
Thanks

@gstoner
Copy link
Contributor

gstoner commented Jul 24, 2016

On Jul 24, 2016, at 4:26 PM, Tapabrata Ghosh <[email protected]mailto:[email protected]> wrote:

@gstonerhttps://github.com/gstoner
One quick question, does polaris (the rx480) support fp16 at a 2:1 ratio?
Thanks


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/55#issuecomment-234803412, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD8DuXaOYlxFOVd6yGjEqs5HqgnJkkuGks5qY9hpgaJpZM4JTemh.

@gstoner
Copy link
Contributor

gstoner commented Jul 24, 2016

I guess i should have bolded the rate it is 1x rate for this generation of GPU, remember the base instruction are part of the GFX8 GPU Family. We have more stuff coming.
The following family of solution support Single Rate Float16:

  • Fiji class hardware, Radeon R9 Nano, R9 Fury, R9 Fury X, FirePro S9300x2,
  • Tonga R380x,
  • Polaris Family: RX480. RX470, RX460

On Jul 24, 2016, at 4:26 PM, Tapabrata Ghosh <[email protected]mailto:[email protected]> wrote:

@gstonerhttps://github.com/gstoner
One quick question, does polaris (the rx480) support fp16 at a 2:1 ratio?
Thanks


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/55#issuecomment-234803412, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD8DuXaOYlxFOVd6yGjEqs5HqgnJkkuGks5qY9hpgaJpZM4JTemh.

@naibaf7
Copy link

naibaf7 commented Jul 24, 2016

@gstoner
Excited for the next generation then :)

Follow-up on BVLC Caffe here: BVLC/caffe#4515

@gstoner
Copy link
Contributor

gstoner commented Oct 29, 2016

Our development branch of the LLVM AMDGPU compiler will be supporting Float16 and Int16 native instruction, instead of emulating FP16/Int16 with up-convert & down-convert instructions to convert from FP16/Int16 to Float and back. We now plumbing this through the tools

This is f16 tests on Fiji hardware successfully executing a matrix multiplication with half types with conversion and with Native instructions.

Orig Conversion based:
flat_load_ushort v8, v[6:7]
flat_load_ushort v9, v[4:5]
v_cvt_f32_f16_e32 v8, v8
v_cvt_f32_f16_e32 v9, v9
v_mac_f32_e32 v3, v9, v8

new Native Float :
flat_load_ushort v8, v[6:7]
flat_load_ushort v9, v[4:5]
v_mac_f16_e32 v3, v9, v8

One more thing Eigen has been ported over AMD GPU via HIP.

@gstoner
Copy link
Contributor

gstoner commented Nov 13, 2016

Float16 and Int16 for for Native GFX8.x based GPUs is in LLVM 4.0 source tree, llvm-mirror/llvm@9027123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants