Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "quiet softmax" (Attention Is Off By One) #691

Closed
wbrickner opened this issue Aug 24, 2023 · 2 comments
Closed

Support "quiet softmax" (Attention Is Off By One) #691

wbrickner opened this issue Aug 24, 2023 · 2 comments
Labels
feature The feature request

Comments

@wbrickner
Copy link
Contributor

Feature description

  • Softmax activation that adds 1 to the denominator: exp(x_i) / [ 1 + ∑ exp(x_j) ],
  • TranformerEncoder / MultiHeadAttention attention implementation that can conditionally use this activation instead of the default softmax.

Feature motivation

TLDR:

  • Attention heads are forced (by the normal softmax implementation) to deposit information even when no useful information exists in a head for the present sequence. This is bad. It also results in large variance of weights that make quantization and compression difficult.
  • This is fixed by adding 1 to the denominator.

Suggest a Solution

  • I propose a quiet_softmax activation function
  • I propose a use_quiet_softmax configuration field for MultiHeadAttention (on MultiHeadAttentionConfig), and to provide this field as well for all the layers that use MultiHeadAttention internally (like TransformerEncoder).
  • There is a case to be made that this should be the default softmax implementation
@wbrickner wbrickner changed the title Support "quiet attention" (Attention Is Off By One) Support "quiet softmax" (Attention Is Off By One) Aug 24, 2023
@wbrickner
Copy link
Contributor Author

PR submitted which resolves this: #692

@antimora antimora added the feature The feature request label Nov 20, 2023
@wbrickner
Copy link
Contributor Author

Merged, closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature The feature request
Projects
None yet
Development

No branches or pull requests

2 participants