Skip to content

GuanRunwei/Awesome-Vision-Transformer-Collection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 

Repository files navigation

Awesome Vision Transformer Collection

Variants of Vision Transformer and Vision Transformer for Downstream Tasks

author: Runwei Guan

affiliation: University of Liverpool / JITRI-Institute of Deep Perception Technology

email: [email protected] / [email protected] / [email protected]

Image Backbone

  • Vision Transformer paper code
  • Swin Transformer paper code
  • Swin Transformer V2: Scaling Up Capacity and Resolution paper code
  • DVT paper code
  • PVT paper code
  • Lite Vision Transformer: LVT paper
  • PiT paper code
  • Twins paper code
  • TNT paper code
  • Mobile-ViT paper code
  • Cross-ViT paper code
  • LeViT paper code
  • ViT-Lite paper
  • Refiner paper code
  • DeepViT paper code
  • CaiT paper code
  • LV-ViT paper code
  • DeiT paper code
  • CeiT paper code
  • BoTNet paper
  • ViTAE paper
  • Visformer: The Vision-Friendly Transformer paper code
  • Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training paper
  • AdaViT: Adaptive Tokens for Efficient Vision Transformer paper
  • Improved Multiscale Vision Transformers for Classification and Detection paper
  • Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding paper
  • Towards End-to-End Image Compression and Analysis with Transformers paper
  • MPViT: Multi-Path Vision Transformer for Dense Prediction paper
  • Lite Vision Transformer with Enhanced Self-Attention paper
  • PolyViT: Co-training Vision Transformers on Images, Videos and Audio paper
  • MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation paper
  • ELSA: Enhanced Local Self-Attention for Vision Transformer paper
  • Vision Transformer for Small-Size Datasets paper
  • SimViT: Exploring a Simple Vision Transformer with sliding windows paper
  • SPViT: Enabling Faster Vision Transformers via Soft Token Pruning paper
  • Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space paper
  • Vision Transformer with Deformable Attention paper code
  • PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture paper
  • QuadTree Attention for Vision Transformers paper code
  • TerViT: An Efficient Ternary Vision Transformer paper
  • BViT: Broad Attention based Vision Transformer paper
  • CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction paper
  • EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers paper
  • Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention paper
  • Coarse-to-Fine Vision Transformer paper
  • ViT-P: Rethinking Data-efficient Vision Transformers from Locality paper
  • MPViT: Multi-Path Vision Transformer for Dense Prediction paper
  • Event Transformer paper
  • DaViT: Dual Attention Vision Transformers paper
  • LightViT: Towards Light-Weight Convolution-Free Vision Transformers paper
  • UniNet: Unified Architecture Search with Convolution, Transformer, and MLP paper
  • Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning paper
  • EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications paper

Multi-label Classification

  • Graph Attention Transformer Network for Multi-Label Image Classification paper

Point Cloud Processing

  • Point Cloud Transformer paper
  • Point Transformer paper
  • Fast Point Transformer paper
  • Adaptive Channel Encoding Transformer for Point Cloud Analysis paper
  • PTTR: Relational 3D Point Cloud Object Tracking with Transformer paper
  • Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction paper
  • LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling paper
  • Geometric Transformer for Fast and Robust Point Cloud Registration paper
  • HiTPR: Hierarchical Transformer for Place Recognition in Point Cloud paper

Video Processing

  • Video Transformers: A Survey paper
  • ViViT: A Video Vision Transformer paper
  • Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos paper
  • LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach paper
  • Video Joint Modelling Based on Hierarchical Transformer for Co-summarization paper
  • InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer paper
  • TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers paper
  • Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning paper
  • Multiview Transformers for Video Recognition paper
  • MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition paper
  • Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval paper
  • A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection paper
  • Learning Trajectory-Aware Transformer for Video Super-Resolution paper
  • Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer paper

Model Compression

  • A Unified Pruning Framework for Vision Transformers paper
  • Multi-Dimensional Model Compression of Vision Transformer paper
  • Contextformer: A Transformer with Spatio-Channel Attention for Context Modeling in Learned Image Compression paper

Transfer Learning & Pretraining

  • Pre-Trained Image Processing Transformer paper code
  • UP-DETR: Unsupervised Pre-training for Object Detection with Transformers paper code
  • BEVT: BERT Pretraining of Video Transformers paper
  • Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text paper
  • On Efficient Transformer and Image Pre-training for Low-level Vision paper
  • Pre-Training Transformers for Domain Adaptation paper
  • RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training paper
  • Multiscale Convolutional Transformer with Center Mask Pretraining for Hyperspectral Image Classificationtion paper
  • DiT: Self-supervised Pre-training for Document Image Transformer paper
  • Underwater Image Enhancement Using Pre-trained Transformer paper

Multi-Modal

  • Multi-Modal Fusion Transformer for End-to-End Autonomous Driving paper
  • Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval paper
  • LAVT: Language-Aware Vision Transformer for Referring Image Segmentation paper
  • MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection paper
  • Visual-Semantic Transformer for Scene Text Recognition paper
  • Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text paper
  • LaTr: Layout-Aware Transformer for Scene-Text VQA paper
  • Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding paper
  • Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation paper
  • Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) -- Team: MMCUniAugsburg paper
  • On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering paper
  • DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers paper
  • CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers paper
  • VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer paper
  • Knowledge Amalgamation for Object Detection with Transformers paper
  • Are Multimodal Transformers Robust to Missing Modality? paper
  • Self-supervised Vision Transformers for Joint SAR-optical Representation Learning paper
  • Video Graph Transformer for Video Question Answering paper

Detection

  • YOLOS: You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection paper code
  • WB-DETR: Transformer-Based Detector without Backbone paper
  • UP-DETR: Unsupervised Pre-training for Object Detection with Transformers paper
  • TSP: Rethinking Transformer-based Set Prediction for Object Detection paper
  • DETR paper code
  • Deformable DETR paper code
  • DN-DETR: Accelerate DETR Training by Introducing Query DeNoising paper code
  • Rethinking Transformer-Based Set Prediction for Object Detection paper
  • End-to-End Object Detection with Adaptive Clustering Transformer paper
  • An End-to-End Transformer Model for 3D Object Detection paper
  • End-to-End Human Object Interaction Detection with HOI Transformer paper code
  • Adaptive Image Transformer for One-Shot Object Detection paper
  • Improving 3D Object Detection With Channel-Wise Transformer paper
  • TransPose: Keypoint Localization via Transformer paper
  • Voxel Transformer for 3D Object Detection paper
  • Embracing Single Stride 3D Object Detector with Sparse Transformer paper
  • OW-DETR: Open-world Detection Transformer paper
  • A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation paper
  • Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence paper
  • Voxel Transformer for 3D Object Detection paper
  • Short Range Correlation Transformer for Occluded Person Re-Identification paper
  • TransVPR: Transformer-based place recognition with multi-level attention aggregation paper
  • Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond paper
  • Arbitrary Shape Text Detection using Transformers paper
  • A high-precision underwater object detection based on joint self-supervised deblurring and improved spatial transformer network paper
  • A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection paper
  • Knowledge Amalgamation for Object Detection with Transformers paper
  • SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection paper
  • POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition paper
  • PSTR: End-to-End One-Step Person Search With Transformers paper
  • Scaling Novel Object Detection with Weakly Supervised Detection Transformers paper
  • OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers paper
  • Exploring Plain Vision Transformer Backbones for Object Detection paper

Segmentation

  • Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation paper code
  • Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention paper code
  • MaX-DeepLab: End-to-End Panoptic Segmentation With Mask Transformers paper code
  • Line Segment Detection Using Transformers without Edges paper
  • VisTR: End-to-End Video Instance Segmentation with Transformers paper code
  • SETR: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers paper code
  • Segmenter: Transformer for Semantic Segmentation paper
  • Fully Transformer Networks for Semantic Image Segmentation paper
  • SOTR: Segmenting Objects with Transformers paper code
  • GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation paper
  • Masked-attention Mask Transformer for Universal Image Segmentation paper
  • A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation paper
  • iSegFormer: Interactive Image Segmentation with Transformers paper
  • SOIT: Segmenting Objects with Instance-Aware Transformers paper
  • SeMask: Semantically Masked Transformers for Semantic Segmentation paper
  • Siamese Network with Interactive Transformer for Video Object Segmentation paper
  • Pyramid Fusion Transformer for Semantic Segmentation paper
  • Swin transformers make strong contextual encoders for VHR image road extraction paper
  • Transformers in Action:Weakly Supervised Action Segmentation paper
  • Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation paper
  • Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers paper
  • Contextual Attention Network: Transformer Meets U-Net paper
  • TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation paper

Pose Estimation

  • Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation paper
  • HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation paper
  • End-to-End Human Pose and Mesh Reconstruction with Transformers paper code
  • PE-former: Pose Estimation Transformer paper
  • Pose Recognition with Cascade Transformers paper code
  • Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer code
  • Geometry-Contrastive Transformer for Generalized 3D Pose Transfer paper
  • Temporal Transformer Networks with Self-Supervision for Action Recognition paper
  • Co-training Transformer with Videos and Images Improves Action Recognition paper
  • DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer paper
  • Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition paper
  • Motion-Aware Transformer For Occluded Person Re-identification paper
  • HeadPosr: End-to-end Trainable Head Pose Estimation using Transformer Encoders paper
  • ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers paper
  • Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding paper
  • Spatial Transformer Network on Skeleton-based Gait Recognition paper

Tracking and Trajectory Prediction

  • Transformer Tracking paper code
  • Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking paper code
  • MOTR: End-to-End Multiple-Object Tracking with TRansformer paper code
  • SwinTrack: A Simple and Strong Baseline for Transformer Tracking paper
  • Pedestrian Trajectory Prediction via Spatial Interaction Transformer Network paper
  • PTTR: Relational 3D Point Cloud Object Tracking with Transformer paper
  • Efficient Visual Tracking with Exemplar Transformers paper
  • TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer paper

Generative Model and Denoising

  • 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds paper
  • Spatial-Temporal Transformer for Dynamic Scene Graph Generation paper
  • THUNDR: Transformer-Based 3D Human Reconstruction With Markers paper
  • DoodleFormer: Creative Sketch Drawing with Transformers paper
  • Image Transformer paper
  • Taming Transformers for High-Resolution Image Synthesis paper code
  • TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up code
  • U2-Former: A Nested U-shaped Transformer for Image Restoration paper
  • Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers paper
  • SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers paper
  • StyleSwin: Transformer-based GAN for High-resolution Image Generation paper
  • Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction paper
  • SGTR: End-to-end Scene Graph Generation with Transformer paper
  • Flow-Guided Sparse Transformer for Video Deblurring paper
  • Spherical Transformer paper
  • MaskGIT: Masked Generative Image Transformer paper
  • Entroformer: A Transformer-based Entropy Model for Learned Image Compression paper
  • UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired image-to-image translation paper
  • Stripformer: Strip Transformer for Fast Image Deblurring paper
  • Vision Transformers for Single Image Dehazing paper
  • Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer paper

Self-Supervised Learning

  • Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning paper code
  • iGPT paper code
  • An Empirical Study of Training Self-Supervised Vision Transformers paper code
  • Self-supervised Video Transformer paper
  • TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning paper
  • TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning paper
  • Transformers in Action:Weakly Supervised Action Segmentation paper
  • Motion-Aware Transformer For Occluded Person Re-identification paper
  • Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics paper
  • Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut paper
  • Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers paper
  • Multi-class Token Transformer for Weakly Supervised Semantic Segmentation paper
  • Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers paper
  • DiT: Self-supervised Pre-training for Document Image Transformer paper
  • Self-supervised Vision Transformers for Joint SAR-optical Representation Learning paper
  • DILEMMA: Self-Supervised Shape and Texture Learning with Transformers paper

Depth and Height Estimation

  • Disentangled Latent Transformer for Interpretable Monocular Height Estimation paper
  • Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics paper
  • SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification paper

Explainable

  • Development and testing of an image transformer for explainable autonomous driving systems paper
  • Transformer Interpretability Beyond Attention Visualization paper code
  • How Do Vision Transformers Work? paper
  • eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation paper

Robustness

  • Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding paper

Deep Reinforcement Learning

  • Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels paper

Calibration

  • CTRL-C: Camera Calibration TRansformer With Line-Classification paper code

Radar

  • Learning class prototypes from Synthetic InSAR with Vision Transformers paper
  • Radar Transformer paper

Traffic

  • SwinUNet3D -- A Hierarchical Architecture for Deep Traffic Prediction using Shifted Window Transformers paper

AI Medicine

  • Semi-Supervised Medical Image Segmentation via Cross Teaching between CNN and Transformer paper
  • 3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis paper
  • Hformer: Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks paper
  • MT-TransUNet: Mediating Multi-Task Tokens in Transformers for Skin Lesion Segmentation and Classification paper
  • MSHT: Multi-stage Hybrid Transformer for the ROSE Image Analysis of Pancreatic Cancer paper
  • Generalized Wasserstein Dice Loss, Test-time Augmentation, and Transformers for the BraTS 2021 challenge paper
  • D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation paper
  • RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark paper
  • Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images paper
  • Swin Transformer for Fast MRI paper code
  • Automatic Segmentation of Head and Neck Tumor: How Powerful Transformers Are? paper
  • ViTBIS: Vision Transformer for Biomedical Image Segmentation paper
  • SegTransVAE: Hybrid CNN -- Transformer with Regularization for medical image segmentation paper
  • Improving Across-Dataset Brain Tissue Segmentation Using Transformer paper
  • Brain Cancer Survival Prediction on Treatment-naive MRI using Deep Anchor Attention Learning with Vision Transformer paper
  • Indication as Prior Knowledge for Multimodal Disease Classification in Chest Radiographs with Transformers paper
  • AI can evolve without labels: self-evolving vision transformer for chest X-ray diagnosis through knowledge distillation paper
  • Uni4Eye: Unified 2D and 3D Self-supervised Pre-training via Masked Image Modeling Transformer for Ophthalmic Image Classification paper
  • Characterizing Renal Structures with 3D Block Aggregate Transformers paper
  • Multimodal Transformer for Nursing Activity Recognition paper
  • RTN: Reinforced Transformer Network for Coronary CT Angiography Vessel-level Image Quality Assessment paper
  • Radiomics-Guided Global-Local Transformer for Weakly Supervised Pathology Localization in Chest X-Rays paper

Hardware

  • VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer paper