Implement Variants of RoPE #176

ruanjm · 2025-03-05T08:26:17Z

ATT. Includes:

Rotate style: Supports both NEOX and GPT-J style. Backward is also supported by both styles.
The size in hidden dim of freqs/sin/cos can be half of that of input/output tensor.
Nope first: Rotate latter half of tensor.
2 channels: handle two inputs at once.
Inplace mode: Input and output is the same. Some specific optimization is also designed for this mode.
Get rotate angle or cos/sin via positions and offsets.

Compared with legacy implementation, the average latency in fp16 and bf16 is reduced to about 73.5% (from 37.1% in best cases to 142.5% in worst cases). Meanwhile, more functions are supported.

…PTJ rotate style.

fix script bug

…GPT-J

…put.

ruanjm added 30 commits March 5, 2025 08:27

Support 2 input channels

dffe049

make kernel functioanlites as structure

9eaba4f

Fix bug in backward tests

3df4b8c

Simplify kernel entry functions.

bbce56c

Simplify cpp interfaces

e421234

Reuse Freqs Front Part. Part.1. Core implementation.

74f2819

Reuse Freqs Front Part. Part.2. Complete tests and bug fix

41c2aa7

Reuse Freqs Front Part. Part.3. Update comments

74ff890

Support GPTJ - Part.1. Inferface update for distinguishing NEOX and G…

1b8ef7a

…PTJ rotate style.

Do not test bwd when rotate type is not NEOX.

fed517a

rename size_valid_f as size_r (rotate count).

9389b89

Optimization 1.1. Handle 2 elements in each circle.

a0ba2ed

Optimization 1.2. The rest part of 1.1.

2e8b16e

Encapsulate the function for fetching cos and sin.

2d26a0a

Encapsulate the function for calculating offset in hidden dim.

d1c975c

Support GPTJ - Part.2. Implement forward part.

1ad12cf

Support GPTJ - Part.3. Implement backward part.

659c715

Support Indirect Freqs Access

d52df2b

Support 1. Nope first. 2. Output inplace.

27ea960

Compatibility between new and legacy RoPE implementation.

9243bee

improve test script

f6f1507

fix script bug

fix script bug

c877b1c

Add color text for comparing

888e780

Enable two pass in 2 channel cases.

8d5bf27

Optimization: Merge memory op when stride_d is 1 and rotate style is …

28364fc

…GPT-J

Remove choosing 2pass and reduce compile time.

8646ca4

misc change in tests.

70087e5

Disable verbose

2c6a919

Improve code about elementwise copy and refine macros.

65eeb15

test fix

d04b656

ruanjm force-pushed the amd/dev/jruan/rope_support_vllm branch from 5a560f6 to d04b656 Compare March 5, 2025 08:27

ruanjm requested a review from valarLip March 5, 2025 08:49

ruanjm added 4 commits March 6, 2025 02:13

Add supporting different head count between q and kv.

4229c37

Add support to the case that positions and offsets is smaller than in…

64b156a

…put.

Split kernels in RoPE to 3 parts.

287b1ab

fix template issue

5d69b01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Variants of RoPE #176

Implement Variants of RoPE #176

ruanjm commented Mar 5, 2025 •

edited

Loading

Implement Variants of RoPE #176

Are you sure you want to change the base?

Implement Variants of RoPE #176

Conversation

ruanjm commented Mar 5, 2025 • edited Loading

ruanjm commented Mar 5, 2025 •

edited

Loading