You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Im looking around Differential Transformer paper and code,
I found that github version is based on flash attention and rotary embedding.
I wonder that is there any plan to upload simple example with transformer using Diff attention and example argument (ex. adjust num_heads according to original transformer's or other positional embedding...)
Thanks
The text was updated successfully, but these errors were encountered:
I find several implements in github by searching Differential Transformer and I'm looking for a implement with static kv_cache and torch.compile for faster inference.
I find several implements in github by searching Differential Transformer and I'm looking for a implement with static kv_cache and torch.compile for faster inference.
Hi @DevKiHyun
You can refer to Section 3.1 and Appendix D in our paper for detailed configurations of our models. You can also directly use configs of open-sourced LLMs and change their model code to turn it into Diff arch.
Hi,
Im looking around Differential Transformer paper and code,
I found that github version is based on flash attention and rotary embedding.
I wonder that is there any plan to upload simple example with transformer using Diff attention and example argument (ex. adjust num_heads according to original transformer's or other positional embedding...)
Thanks
The text was updated successfully, but these errors were encountered: