You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
##############################
out: max 3.046875, mean 0.04052734375
lse: max 9.003575325012207, mean 7.748651027679443
out diff:
[0] max 0.0009765625, mean 2.372264862060547e-05
[1] max 0.001953125, mean 6.031990051269531e-05
lse diff:
[0] max 1.9073486328125e-06, mean 1.4424823291392386e-07
[1] max 1.9073486328125e-06, mean 2.30388167210549e-07
##############################
backward:
##############################
load_dq:
[0] max 2.828125, mean 0.04736328125
[1] max 0.3828125, mean 0.0308837890625
dq diff:
[0] max 0.001953125, mean 2.002716064453125e-05
[1] max 0.001953125, mean 5.555152893066406e-05
load_dk:
[0] max 2.578125, mean 0.04052734375
[1] max 0.31640625, mean 0.02197265625
dk0 diff:
[0] max 0.015625, mean 6.103515625e-05
[1] max 0.001953125, mean 3.814697265625e-05
load_dv:
[0] max 4.84375, mean 0.042236328125
[1] max 0.287109375, mean 0.02197265625
dv diff:
[0] max 0.03125, mean 5.984306335449219e-05
[1] max 0.0009765625, mean 3.528594970703125e-05
lucidrains & zhuzilin were hard working the last days and have completed the following two ring-attention implementations:
Create a test setup that verifies correctness and compares the performance of both solutions.
Phil decided to use a custom triton kernel. Find out why this kernel is used and if it is indeed faster than the cuda flash-attention 2.
Please generate a little report of your findings, either as markdown file or ipynb.
The text was updated successfully, but these errors were encountered: