Question about the definitions in block sparse attention #519

szhengac · 2020-11-10T05:34:18Z

Hi, I have some question regarding the block sparse attention.

If I understand the description of API correctly, block is the block size (i.e., number of tokens in a block) while num_local_blocks denotes the number of blocks (#tokens_per_window = block * num_local_blocks) in a local window. So no matter which value (unidirectional or bidirectional) I choose for attention, the tokens within a block will attend each other?

The text was updated successfully, but these errors were encountered:

arashashari · 2020-11-10T19:42:28Z

Yes, that is correct. Tokens within a block always attend to each other no matter if it is uni/bi-directional. However, if you look at a local window, in the case of unidirectional, you can consider tokens within a block only attend to other tokens in the blocks before them in the same local window. While in case of bidirectional all tokens in the local window (no matter which block they are in) attend to each other.
Please let us know if your question is answered and we will close the issue.

szhengac · 2020-11-10T21:27:08Z

So if we want to use it for the machine translation, how do we set these hyper parameters? For machine translation, tokens within a block cannot attend to each other in decoder.

…

On Tue, 10 Nov 2020 at 11:42, Arash Ashari ***@***.***> wrote: Yes, that is correct. Tokens within a block always attend to each other no matter if it is uni/bi-directional. However, if you look at a local window, in the case of unidirectional, you can consider tokens within a block only attend to other tokens in the blocks before them in the same local window. While in case of bidirectional all tokens in the local window (no matter which block they are in) attend to each other. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#519 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA6GZVGQTSDZEEX5CDZCB7LSPGJTHANCNFSM4TQGWCFQ> .

arashashari · 2020-11-10T21:35:29Z

You can use the attention mask to neutralize it; in such cases, attention mask is of dimension [leading dimensions, S, S] in which S stands for sequence length.

szhengac · 2020-11-12T04:23:57Z

I see. That is indeed one option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the definitions in block sparse attention #519

Question about the definitions in block sparse attention #519

szhengac commented Nov 10, 2020

arashashari commented Nov 10, 2020 •

edited

Loading

szhengac commented Nov 10, 2020 via email

arashashari commented Nov 10, 2020

szhengac commented Nov 12, 2020

Question about the definitions in block sparse attention #519

Question about the definitions in block sparse attention #519

Comments

szhengac commented Nov 10, 2020

arashashari commented Nov 10, 2020 • edited Loading

szhengac commented Nov 10, 2020 via email

arashashari commented Nov 10, 2020

szhengac commented Nov 12, 2020

arashashari commented Nov 10, 2020 •

edited

Loading