You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Attention heads are forced (by the normal softmax implementation) to deposit information even when no useful information exists in a head for the present sequence. This is bad. It also results in large variance of weights that make quantization and compression difficult.
This is fixed by adding 1 to the denominator.
Suggest a Solution
I propose a quiet_softmax activation function
I propose a use_quiet_softmax configuration field for MultiHeadAttention (on MultiHeadAttentionConfig), and to provide this field as well for all the layers that use MultiHeadAttention internally (like TransformerEncoder).
There is a case to be made that this should be the default softmax implementation
The text was updated successfully, but these errors were encountered:
Feature description
1
to the denominator:exp(x_i) / [ 1 + ∑ exp(x_j) ]
,TranformerEncoder
/MultiHeadAttention
attention implementation that can conditionally use this activation instead of the defaultsoftmax
.Feature motivation
TLDR:
1
to the denominator.Suggest a Solution
quiet_softmax
activation functionuse_quiet_softmax
configuration field forMultiHeadAttention
(onMultiHeadAttentionConfig
), and to provide this field as well for all the layers that useMultiHeadAttention
internally (likeTransformerEncoder
).The text was updated successfully, but these errors were encountered: