Vision-In-Transformer-Model

Apply Transformer Models to Computer Vision Tasks

The implementation about relative position embedding can refer to:

https://theaisummer.com/positional-embeddings/: BoT position embedding method (refer to BoT_Position_Embedding.png and BoT_Position_Embedding(2).png)

Swin Transformer position embedding refer to Swin_Transformer_Position_Embedding.png

The implementation detail about Swin Transformer can refer to:

https://zhuanlan.zhihu.com/p/361366090

One Swin Transformer Block = Swin_Transformer_Stage

Swin_Transformer_Stage = Swin_Transformer + Patch_Merge

Pyramid Vision Transformer-v2 code finished (2021-08-04) refer to PVTv2_Block

I changed some parts of PVTv2 from the official released version:(https://github.com/whai362/PVT/blob/v2/classification/pvt_v2.py)

The first different part is:

kernel_size = patch_size

stride = math.ceil(kernel_size / 2)

padding = math.floor(stride / 2)

According to the paper, I think these defined values are correct.

The second different part is:

reshape operation is slow and sometimes not operated in order, So I changed it to rearrange, and matrix multiplication is performed using einsum.

The third different part is:

I implemented a StageModule class like Swin Transformer, thereby I can insert this module into any model architecture.

The building block of CoTNet was implemented (paper: Contextual Transformer Networks for Visual Recognition)

I made several changes to this model:

I used AdaptiveAveragePool2d to replace LocalConv to reduce computation cost, this part is similar to PVTv2
The original CoTNet does not perform pixel-index attention: K2 = Q * V, this is only channel-index attention. So I added a MLP-mixer to do pixel-index attention in the code.
The reason for using SK attention for K1 and K2 fusion is not clear (paper does not explain that), so I deleted the SK attention and added a shortcut connection as a replacement.

The modification of outlook attention: outlook attention improves the fine-grained feature generation

the output shape of Unfold operation should be:

new_H = math.floor((H + 2 * self.padding - self.kernel_size + self.stride)/self.stride)

new_W = math.floor((W + 2 * self.padding - self.kernel_size + self.stride)/self.stride)

using Conv2d(..., kernel_size=1, stride=1, padding=0, ...) to generate v and attn, which can save permute operation time.
replace LayerNorm with BatchNorm2d, because the weights of attn, v and MLP are generated by Conv2d. Therefore, it is reasonable to use BatchNorm2d not LayerNorm

The implementation of Stand_Alone_Self_Attention (SASA) is finished !!!

The difference from the official code is: relative position encoding. I referred to the implementation method of Swin-T code: https://github.com/microsoft/Swin-Transformer/blob/main/models/swin_transformer.py. However, the origin point of SASA is fixed, so the shape of the position encoding should be: kernel_size^2, which is different from Swin-T: each point can be the origin point, the shape of position encoding is (window_size^2, window_size^2)

The implementation of QnA: Learned Queries for Efficient Local Attention is finished !!!

The official QnA code is written in JAX (https://github.com/moabarar/qna), I re-implemented it in Pytorch based on the paper. It is very similar to SASA, but the difference is the Queries generation.

Hoping this repositories can give you some help, and best wishes to every researcher working in this field.

In the end, if you have any question, please send email to me:

[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
Pytorch_Practice		Pytorch_Practice
Vision_Transformer_PPT		Vision_Transformer_PPT
.gitignore		.gitignore
BoT_Block_Layer_TF_1.py		BoT_Block_Layer_TF_1.py
BoT_Block_Layer_Torch.py		BoT_Block_Layer_Torch.py
BoT_Position_Embedding.png		BoT_Position_Embedding.png
BoT_Position_Embedding_2.png		BoT_Position_Embedding_2.png
CoT_Block.py		CoT_Block.py
LICENSE		LICENSE
Learned_Queries_Local_Attention.py		Learned_Queries_Local_Attention.py
MLP_mixer_Block.py		MLP_mixer_Block.py
PVTv2_Block.py		PVTv2_Block.py
Patch_Merge.py		Patch_Merge.py
README.md		README.md
Stand_Alone_Self_Attention.py		Stand_Alone_Self_Attention.py
Swin_Transformer.py		Swin_Transformer.py
Swin_Transformer_Position_embedding.png		Swin_Transformer_Position_embedding.png
Swin_Transformer_Stage.py		Swin_Transformer_Stage.py
Trans_DANet.py		Trans_DANet.py
Transformer Arc.png		Transformer Arc.png
ViT_Model_Test.py		ViT_Model_Test.py
ViT_model.py		ViT_model.py
learning_rate.py		learning_rate.py
outlook_attention.py		outlook_attention.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-In-Transformer-Model

About

Releases

Packages

Languages

License

xingshulicc/Vision-In-Transformer-Model

Folders and files

Latest commit

History

Repository files navigation

Vision-In-Transformer-Model

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages