This repository contains a growing list of computer vision models implemented in PyTorch. I am mostly focusing on recent-ish models, especially those based on attention. My goal is to write my own implementation based on the paper, but also learn tricks and check my work by referencing existing implementations.
- MLP Mixer: MLP-based architecture based on a paper from Google AI Research, which shows that you don't have to use convolutions or attention to get good performance on computer vision tasks.
- Vision Transformer: Attention-based architecture, adapted from the paper "Transformers for Image Recognition at Scale". The authors split images into patches, and feed the resulting sequence into a transformer encoder.
- Masked Autoencoder Vision Transformer: Using the Vision Transformer architecture above, the authors of the paper "[Masked Autoencoders Are Scalable Vision Learners](Masked Autoencoders Are Scalable Vision Learners)" adopt a self-supervised pretraining objective similar to masked language modeling (Devlin et al., 2018), where image patches are masked before they are fed into a transformer encoder, and a lightweight decoder must reconstruct them.
My initial tests for these models are still in progress, involve training them on CIFAR-100. I will release code and results of some of these experiments soon.
- MLP Mixer: Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., ... & Dosovitskiy, A. (2021). MLP-Mixer: An All-MLP Architecture for Vision. Advances in Neural Information Processing Systems, 34, 24261-24272.
- Vision Transformer: Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Masked Autoencoder: He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000-16009)
- Andrej Karpathy's
mingpt
: Referenced for some tricks related to implementation of multi-head attention. - Einops Documentation: Referenced for more tricks related to multi-head attention, namely, Einstein notation.
- Phil Wang's ViT repository: Referenced for Vision Transformer and masked autoencoder, and subsequent proposed modifications/improvements in the literature, including replacing CLS token with average pooling after the final transformer block. I borrowed the elegant approach of wrapping the attention and FFN blocks in a "PreNorm" layer that handles normalization, tweaking it slightly to include the residual connection. This results in a much cleaner transformer block implementation.
- Google Research MLP Mixer & ViT implementations: Referenced for my MLP Mixer and Vision Transformer implementations.
- Facebook Research MAE implementation: Referenced for my Masked Autoencoder implementation.