Try to create an addition based ViT instead of matrix multiplication.
In recent years, the field of neural network computation has seen a significant reliance on matrix multiplication operations. However, it is worth noting that matrix multiplication is a computationally intensive operation. In this report, we explore the possibility of finding a more efficient alternative to the dot product similarity with the goal of accelerating inference in neural networks.
Since the dot product of vectors, which forms the basis of matrix multiplication, essentially calculates dot product similarity (unnormalized cosine similarity), we have contemplated whether we could discover a more concise alternative to dot product similarity to accelerate inference.
There are already some works that have explored similar ideas. For instance, AdderNet used
However, all of the aforementioned methods were originally applied to CNNs. We aim to adapt these techniques to Transformer-based models.
- Multi:
$Y_{L,D_{out}}=X_{L,D_{in}}\cdot W_{D_{in},D_{out}}+b_{D_{out}}$ in which$y_{l,d}=\sum\limits_{i=1}^{D_{in}} x_{l,i} \times w_{i,d}$ - Adder:
$Y_{L,D_{out}}=X_{L,D_{in}}\bigoplus W_{D_{in},D_{out}}+b_{D_{out}}$ in which$y_{l,d}=-\sum\limits_{i=1}^{D_{in}}| x_{l,i} - w_{i,d}|$ - Euclid:
$Y_{L,D_{out}}=X_{L,D_{in}}\bigotimes W_{D_{in},D_{out}}+b_{D_{out}}$ in which$y_{l,d}=-\sum\limits_{i=1}^{D_{in}}( x_{l,i} - w_{i,d})^2$
We conducted a comparison of these methods with a Simple_ViT model. Methods 1 and 3 are differentiable functions, allowing for direct application of backpropagation. However, the Adder method encountered issues with backpropagation. In line with the AdderNet approach, we mitigated this problem by substituting the sign precision gradient with the full precision gradient. (For more details, refer to AdderNet.)
However, we encountered a training error on large dataset, as depicted in the chart below: both training and validation accuracies declined. It is essential to note that, in accordance with the backpropagation method described earlier, the addition and multiplication operations for
We modified the backpropagation method for
By adopting this method, AdderViT can be trained properly on any datasets. (We also experimented with sign derivatives, but they resulted in a loss of accuracy.)
However, there is another serious problem, as the derivative of addition is associated with both
Methods | train time | inference time | train memory |
---|---|---|---|
Multi | |||
Adder | |||
optim-Adder |
We thought about how radial distances (distances in each norm) behave in space. Points on the same surface will map to the same number. When
We aim to apply a transformation to the input to stretch the shape of the distribution in space, thereby mitigating the issues mentioned above. It's worth noting that this transformation involves point-wise multiplication, but it doesn't introduce too much computation.
Residual connection is a very common operation. The basic idea is that if this layer can't learn anything useful, then let this layer be discarded and the information from the previous layer can be directly passed on without any loss of information. But in Adder operation, there will always be some information, even useless or poisonous, added into feature map.
Residual connections are a common operation with a concept: if a layer doesn't learn anything useful, it can be skipped, and information from the previous layer can be directly passed on without any loss. However, in the Adder operation, there's always some information added to the feature map, even if it's detrimental.
The essence of this issue lies in the inability to find a suitable
We introduced another soft-mask transformation, denoted as F, to selectively mask some of the output values. To achieve an identity mapping, one can simply set all elements in
We utilize the following function for forward propagation:
For backpropagation, we employ the following formulas:
Unfortunately, this AdderViT is currently encountering some other unidentified issues and is not yielding satisfactory results. It's possible that the limited range of expression provided by Adder may not be sufficient to achieve a performance level comparable to multiplication in ViT models.
In this group of experiment, we used weak data augmentation, which makes the Multi overfits. Multi may actually be better than Adder if we use proper data augmentation.
qkv
: Replace the linear transformation to_qkv with the Adder operation.
1ffn
: Replace 2 layers of feed-forward with 1 layer of the Adder operation.
2ffn
: Replace 2 layers of feed-forward with 2 layers of the Adder operation.
adder
: Implement the original Adder operation from AdderNet for both FFN and to_qkv.
multi
: The control group using the original multiplication operation.
noupdtm
: Excludes adaptive learning rate from AdderNet.
euclider
: Utilizes the
wopout
: Attention without the project_out layer.
Due to the absence of optimization techniques tailored to Adder, the training process is considerably slow. As a result, it was trained on ImageNet for only 25 epochs, during which it exhibited a substantial performance gap, with addition performing significantly worse than multiplication.
Across all models, Adder consistently lags behind Multi by a margin of at least
Due to the training challenges from the lack of optimization and some other unidentified issues, our work was regrettably compelled to come to a halt.
- The three problems mentioned above and their possible solutions.
- Incorrect Back propagation and huge amount of computation.
- L1-distance caused diversity loss.
- Invalid residual connections.
- We did a series of experiments to test some of our ideas.
- Adder is more difficult to train than Multi and can easily lead to unstable training situations.
- There are still some unknown obstacles to the application of addition in full connection or attention, especially for large data sets.
We leveraged the pycuda package to optimize the new addition operation, which proved to be faster than performing it directly in Python, although not significantly.
We made use of the adder_cuda package, which is a cuda optimization of AdderNet. Please note that you may encounter installation issues with this package, so patience is advised.
We remain optimistic about the potential of the Adder operation as an alternative to traditional matrix multiplication. If you have any research insights on this topic or questions, please don't hesitate to reach out to us.