-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about the definitions in block sparse attention #519
Comments
Yes, that is correct. Tokens within a block always attend to each other no matter if it is uni/bi-directional. However, if you look at a local window, in the case of unidirectional, you can consider tokens within a block only attend to other tokens in the blocks before them in the same local window. While in case of bidirectional all tokens in the local window (no matter which block they are in) attend to each other. |
So if we want to use it for the machine translation, how do we set these
hyper parameters? For machine translation, tokens within a block cannot
attend to each other in decoder.
…On Tue, 10 Nov 2020 at 11:42, Arash Ashari ***@***.***> wrote:
Yes, that is correct. Tokens within a block always attend to each other no
matter if it is uni/bi-directional. However, if you look at a local window,
in the case of unidirectional, you can consider tokens within a block only
attend to other tokens in the blocks before them in the same local window.
While in case of bidirectional all tokens in the local window (no matter
which block they are in) attend to each other.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#519 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA6GZVGQTSDZEEX5CDZCB7LSPGJTHANCNFSM4TQGWCFQ>
.
|
You can use the attention mask to neutralize it; in such cases, attention mask is of dimension [leading dimensions, S, S] in which S stands for sequence length. |
I see. That is indeed one option. |
Hi, I have some question regarding the block sparse attention.
If I understand the description of API correctly,
block
is the block size (i.e., number of tokens in a block) whilenum_local_blocks
denotes the number of blocks (#tokens_per_window = block * num_local_blocks
) in a local window. So no matter which value (unidirectional
orbidirectional
) I choose forattention
, the tokens within a block will attend each other?The text was updated successfully, but these errors were encountered: