Support "quiet softmax" (`Attention Is Off By One`) #691

wbrickner · 2023-08-24T21:29:24Z

Feature description

Softmax activation that adds 1 to the denominator: exp(x_i) / [ 1 + ∑ exp(x_j) ],
TranformerEncoder / MultiHeadAttention attention implementation that can conditionally use this activation instead of the default softmax.

Feature motivation

Attention is Off By One

TLDR:

Attention heads are forced (by the normal softmax implementation) to deposit information even when no useful information exists in a head for the present sequence. This is bad. It also results in large variance of weights that make quantization and compression difficult.
This is fixed by adding 1 to the denominator.

Suggest a Solution

I propose a quiet_softmax activation function
I propose a use_quiet_softmax configuration field for MultiHeadAttention (on MultiHeadAttentionConfig), and to provide this field as well for all the layers that use MultiHeadAttention internally (like TransformerEncoder).
There is a case to be made that this should be the default softmax implementation

The text was updated successfully, but these errors were encountered:

wbrickner · 2023-08-24T22:04:40Z

PR submitted which resolves this: #692

wbrickner · 2023-12-24T04:21:46Z

Merged, closed.

wbrickner mentioned this issue Aug 24, 2023

Implement Quiet Softmax (Attention Is Off By One) #692

Merged

1 task

wbrickner changed the title ~~Support "quiet attention" (Attention Is Off By One)~~ Support "quiet softmax" (Attention Is Off By One) Aug 24, 2023

antimora added the feature The feature request label Nov 20, 2023

wbrickner closed this as completed Dec 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support "quiet softmax" (`Attention Is Off By One`) #691

Support "quiet softmax" (`Attention Is Off By One`) #691

wbrickner commented Aug 24, 2023

wbrickner commented Aug 24, 2023

wbrickner commented Dec 24, 2023

Support "quiet softmax" (Attention Is Off By One) #691

Support "quiet softmax" (Attention Is Off By One) #691

Comments

wbrickner commented Aug 24, 2023

Feature description

Feature motivation

TLDR:

Suggest a Solution

wbrickner commented Aug 24, 2023

wbrickner commented Dec 24, 2023

Support "quiet softmax" (`Attention Is Off By One`) #691

Support "quiet softmax" (`Attention Is Off By One`) #691