The Transformer model is a neural network architecture that was introduced in the paper "Attention Is All You Need". It was designed to address some of the limitations of previous sequence-to-sequence models such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs).
The Transformer model uses a novel self-attention mechanism that allows it to process input sequences in parallel, rather than in a sequential manner like RNNs. This makes the model highly parallelizable and much faster than traditional sequence models.
Another key feature of the Transformer model is its ability to handle variable-length input sequences. Unlike RNNs, which require fixed-length sequences, the Transformer can handle sequences of any length by using positional encoding.
The Transformer has achieved state-of-the-art results on a wide range of natural language processing tasks such as machine translation, text summarization, and language modeling. Its success has led to its widespread adoption in the research community and industry.
In this repository, we provide an implementation of the Transformer model in PyTorch. The implementation includes a class that simplifies the learning process, making it easy for users to understand
Most competitive neural sequence transduction models have an
encoder-decoder structure (cite). Here, the encoder maps an input sequence of symbol representations
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure above, respectively.
The encoder is composed of a stack of
We employ a residual connection
(cite) around each of the two
sub-layers, followed by layer normalization
(cite).
That is, the output of each sub-layer is
To facilitate these residual connections, all sub-layers in the
model, as well as the embedding layers, produce outputs of dimension
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
While the linear transformations are the same across different
positions, they use different parameters from layer to
layer. Another way of describing this is as two convolutions with
kernel size 1. The dimensionality of input and output is
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
We call our particular attention "Scaled Dot-Product Attention".
The input consists of queries and keys of dimension
In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix
The two most commonly used attention functions are additive
attention (cite), and dot-product
(multiplicative) attention. Dot-product attention is identical to
our algorithm, except for the scaling factor of
While for small values of
We also modify the self-attention sub-layer in the decoder stack to
prevent positions from attending to subsequent positions. This
masking, combined with fact that the output embeddings are offset by
one position, ensures that the predictions for position
def subsequent_mask(size):
"Mask out subsequent positions."
attn_shape = (1, size, size)
subsequent_mask = torch.triu(torch.ones(attn_shape), diagonal=1).type(
torch.uint8
)
return subsequent_mask == 0
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
Where the projections are parameter matrices
In this work we employ
Similarly to other sequence transduction models, we use learned
embeddings to convert the input tokens and output tokens to vectors
of dimension
Since our model contains no recurrence and no convolution, in order
for the model to make use of the order of the sequence, we must
inject some information about the relative or absolute position of
the tokens in the sequence. To this end, we add "positional
encodings" to the input embeddings at the bottoms of the encoder and
decoder stacks. The positional encodings have the same dimension
In this work, we use sine and cosine functions of different frequencies:
where
In addition, we apply dropout to the sums of the embeddings and the
positional encodings in both the encoder and decoder stacks. For
the base model, we use a rate of