You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From this point, the updated arxiv version of Based is more like subsequent research on subquadratic architectures rather than a simple upgrade. This new version introduces combined linear and sliding window attention, which is orthogonal to selecting a linear attention kernel studied with our paper. Right now, we do not have evaluations of a rebased kernel combined with sliding window attention.
Hi, I've just finished training the small 124M model, and it seems that replacing conv1d with sliding window attention is orthogonal to the Based/ReBased performance, as we achieve slightly better loss value. We will update our preprint and we have plans to release training pipeline and weights. Stay tuned!
Based architecture seems to have been updated - https://arxiv.org/abs/2402.18668. Any insights into how it compares with Rebased?
The text was updated successfully, but these errors were encountered: