Skip to content

My implementation of recent cutting-edge computer vision models.

Notifications You must be signed in to change notification settings

andersonbcdefg/vision-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision Model Implementations

This repository contains a growing list of computer vision models implemented in PyTorch. I am mostly focusing on recent-ish models, especially those based on attention. My goal is to write my own implementation based on the paper, but also learn tricks and check my work by referencing existing implementations.

Models

  • MLP Mixer: MLP-based architecture based on a paper from Google AI Research, which shows that you don't have to use convolutions or attention to get good performance on computer vision tasks.
  • Vision Transformer: Attention-based architecture, adapted from the paper "Transformers for Image Recognition at Scale". The authors split images into patches, and feed the resulting sequence into a transformer encoder.
  • Masked Autoencoder Vision Transformer: Using the Vision Transformer architecture above, the authors of the paper "[Masked Autoencoders Are Scalable Vision Learners](Masked Autoencoders Are Scalable Vision Learners)" adopt a self-supervised pretraining objective similar to masked language modeling (Devlin et al., 2018), where image patches are masked before they are fed into a transformer encoder, and a lightweight decoder must reconstruct them.

Data & Training

My initial tests for these models are still in progress, involve training them on CIFAR-100. I will release code and results of some of these experiments soon.

References

Papers Implemented

Code References

  • Andrej Karpathy's mingpt: Referenced for some tricks related to implementation of multi-head attention.
  • Einops Documentation: Referenced for more tricks related to multi-head attention, namely, Einstein notation.
  • Phil Wang's ViT repository: Referenced for Vision Transformer and masked autoencoder, and subsequent proposed modifications/improvements in the literature, including replacing CLS token with average pooling after the final transformer block. I borrowed the elegant approach of wrapping the attention and FFN blocks in a "PreNorm" layer that handles normalization, tweaking it slightly to include the residual connection. This results in a much cleaner transformer block implementation.
  • Google Research MLP Mixer & ViT implementations: Referenced for my MLP Mixer and Vision Transformer implementations.
  • Facebook Research MAE implementation: Referenced for my Masked Autoencoder implementation.

About

My implementation of recent cutting-edge computer vision models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages