Skip to content

Latest commit

 

History

History
94 lines (89 loc) · 18 KB

README.md

File metadata and controls

94 lines (89 loc) · 18 KB

History of computer vision architectures. A focus on Classification, Segmentation and Object detection networks.

Paper Date Description
Neocognition 1979 A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position
ConvNet 1989 Used back-propagation to learn the convolution kernel coefficients directly from images of hand-written numbers
Lenet December 1998 Introduced Convolutions.
Alex Net September 2012 Introduced ReLU activation and Dropout to CNNs. Winner ILSVRC 2012.
ZfNet 2013 ZFNet is a classic convolutional neural network. The design was motivated by visualizing intermediate feature layers and the operation of the classifier. Compared to AlexNet, the filter sizes are reduced and the stride of the convolutions are reduced.
GoogleNet 2014 One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
VGG September 2014 Used large number of filters of small size in each layer to learn complex features. Achieved SOTA in ILSVRC 2014.
Inception Net September 2014 Introduced Inception Modules consisting of multiple parallel convolutional layers, designed to recognize different features at multiple scales.
HighwayNet 2015 Introduced a new architecture designed to ease gradient-based training of very deep networks
Inception Net v2 / Inception Net v3 December 2015 Design Optimizations of the Inception Modules which improved performance and accuracy.
Res Net December 2015 Introduced residual connections, which are shortcuts that bypass one or more layers in the network. Winner ILSVRC 2015.
Inception Net v4 / Inception ResNet February 2016 Hybrid approach combining Inception Net and ResNet.
Dense Net August 2016 Each layer receives input from all the previous layers, creating a dense network of connections between the layers, allowing to learn more diverse features.
DarkNet 2016 A convolutional neural network that acts as a backbone for the YOLOv3 object detection approach.
Xception October 2016 Based on InceptionV3 but uses depthwise separable convolutions instead on inception modules.
Res Next November 2016 Built over ResNet, introduces the concept of grouped convolutions, where the filters in a convolutional layer are divided into multiple groups.
FractalNet 2017 The first simple alternative to ResNet.
Capsule Networks 2017 Proposed to improve the performance of CNNs, especially in terms of spatial hierarchies and rotation invariance.
WideResNet 2017 This paper first introduces a simple principle for reducing the descriptions of event sequences without loss of information.
PolyNet 2017 This paper proposes a novel synthetic network management model based on ForCES. This model regards the device under management (DUM) as forwarding element (FE).
Pyramidal Net 2017 A PyramidNet is a type of convolutional network where the key idea is to concentrate on the feature map dimension by increasing it gradually instead of by increasing it sharply at each residual unit with downsampling. In addition, the network architecture works as a mixture of both plain and residual networks by using zero-padded identity-mapping shortcut connections when increasing the feature map dimension.
Squeeze and Excitation Nets 2017 Focus on the channel relationship and propose a novel architectural unit, termed the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. These blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.
Mobile Net V1 April 2017 Uses depthwise separable convolutions to reduce the number of parameters and computation required.
CMPE-SE 2018 Competitive squeeze and excitation networks
RAN 2018 Residual attention neural network. Residual Attention Network is built by stacking Attention Modules which generate attention-aware features. The attention-aware features from different modules change adaptively as layers going deeper.
CB-CNN 2018 Channel boosted CNN, This idea of Channel Boosting exploits both the channel dimension of CNN (learning from multiple input channels) and Transfer learning (TL). TL is utilized at two different stages; channel generation and channel exploitation.
CBAM 2018 Convolutional Block Attention Module, a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, the module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.
Mobile Net V2 January 2018 Built upon the MobileNetv1 architecture, uses inverted residuals and linear bottlenecks.
Mobile Net V3 May 2019 Uses AutoML to find the best possible neural network architecture for a given problem.
Efficient Net May 2019 Uses a compound scaling method to scale the network's depth, width, and resolution to achieve a high accuracy with a relatively low computational cost.
NoisyStudent 2020 Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images.
Vision Transformer October 2020 Images are segmented into patches, which are treated as tokens and a sequence of linear embeddings of these patches are input to a Transformer
SwAV 2020 Self-supervised learning approach for image classification
ResNesT 2022 Designed to scale ResNet-style models to new levels of performance
DeiT December 2020 A convolution-free vision transformer that uses a teacher-student strategy with attention-based distillation tokens.
Swin Transformer March 2021 A hierarchical vision transformer that uses shifted windows to addresses the challenges of adapting the transformer model to computer vision.
CaiT 2021 Combines vision transformers with convolutional layers
T2T-ViT 2021 Improved transformer-based vision models with token-to-token vision transformers.
TNT 2021 Transformer in Transformer architecture for better hierarchical feature learning
BEiT June 2021 Utilizes a masked image modeling task inspired by BERT in, involving image patches and visual tokens to pretrain vision Transformers.
MobileViT October 2021 A lightweight vision transformer designed for mobile devices, effectively combining the strengths of CNNs and ViTs.
Masked AutoEncoder November 2021 An encoder-decoder architecture that reconstructs input images by masking random patches and leveraging a high proportion of masking for self-supervision.
CoAtNet 2021 CoAtNets (Convolution and Self-Attention Network)
ConvNeXt 2021 A design that adopts a transformer-like architecture while being a convolutional network. It improves upon the designs of earlier CNNs.
NFNet 2021 High-Performance Large-Scale Image Recognition Without Normalization
MLP-Mixer 2021 Introduced mixer layers as an alternative to convolutional layers.
gMLP 2021 Gated activations for better gradient flow
Conv Mixer January 2022 Processes image patches using standard convolutions for mixing spatial and channel dimensions.
MViT 2022 A multiview vision transformer, designed for processing videos, providing a way to integrate information from different frames efficiently.
Shuffle Transformer 2022 Combined shuffle units with transformer blocks for efficient processing
BEiT 2022 Introduces a BERT-style pre-training approach for image recognition, using masked image modeling.
CrossViT 2022 Combines vision transformers with convolutional layers
Masked Autoencoders (MAE) 2022 A self-supervised learning method where the model learns to reconstruct images from partial inputs, improving efficiency and performance.
RegNet 2023 Introduced a design space exploration approach to neural network architecture search, producing efficient and high-performing models for image classification and other tasks

Object Detection

Paper Date Description
RCNN November 2013 Uses selective search for region proposals, CNNs for feature extraction, SVM for classification followed by box offset regression.
SPPNet 2014 Spatial Pyramid Pooling Network.
Fast RCNN April 2015 Processes entire image through CNN, employs RoI Pooling to extract feature vectors from ROIs, followed by classification and BBox regression.
Faster RCNN June 2015 A region proposal network (RPN) and a Fast R-CNN detector, collaboratively predict object regions by sharing convolutional features.
YOLOv1 2015 You only look Once V1.
SSD December 2015 Discretizes bounding box outputs over a span of various scales and aspect ratios per feature map.
RFCN 2016 Region-based Fully Convolutional Networks.
YOLOv2 2016 You only look Once V2.
Feature Pyramid Network December 2016 Leverages the inherent multi-scale hierarchy of deep convolutional networks to efficiently construct feature pyramids.
Mask RCNN March 2017 Extends Faster R-CNN to solve instance segmentation tasks, by adding a branch for predicting an object mask in parallel with the existing branch.
Focal Loss August 2017 Addresses class imbalance in dense object detectors by down-weighting the loss assigned to well-classified examples.
RetinaNet 2017 A one-stage object detection model that utilizes a focal loss function to address class imbalance during training.
Cascade RCNN 2018 A multi-stage object detection architecture, the Cascade R-CNN, consists of a sequence of detectors trained with increasing IoU thresholds, to be sequentially more selective against close false positives. The detectors are trained stage by stage, leveraging the observation that the output of a detector is a good distribution for training the next higher quality detector.
YOLOv3 2018 You only look Once V3.
EfficientDet 2019 This paper aims to tackle this problem by systematically studying various design choices of detector architectures.
CenterNet 2019 This paper presents an efficient solution which explores the visual patterns within each cropped region with minimal costs.
DETR 2020 Detection Transformer, End-to-End Object Detection with Transformers, A new method that views object detection as a direct set prediction problem.
YOLOv4 2020 You only look Once V4.
YOLOv5 2020 You only look Once V5.
YOLOv6 2022 You only look Once V6.
YOLOv7 2022 You only look Once V7.
YOLOv8 2023 You only look Once V8.
YOLO-NAS 2023 The new YOLO-NAS architecture sets a new frontier for object detection tasks, offering the best accuracy and latency tradeoff performance.
RT-DETR 2023 A cutting-edge end-to-end object detector that provides real-time performance while maintaining high accuracy. It leverages the power of Vision Transformers (ViT) to efficiently process multiscale features by decoupling intra-scale interaction and cross-scale fusion. RT-DETR is highly adaptable, supporting flexible adjustment of inference speed using different decoder layers without retraining. The model excels on accelerated backends like CUDA with TensorRT, outperforming many other real-time object detectors.
SAM 2023 The Segment Anything Model, or SAM, is a cutting-edge image segmentation model that allows for promptable segmentation, providing unparalleled versatility in image analysis tasks. SAM forms the heart of the Segment Anything initiative, a groundbreaking project that introduces a novel model, task, and dataset for image segmentation.
Fast-SAM 2023 FastSAM is designed to address the limitations of the Segment Anything Model (SAM), a heavy Transformer model with substantial computational resource requirements. The FastSAM decouples the segment anything task into two sequential stages: all-instance segmentation and prompt-guided selection. The first stage uses YOLOv8-seg to produce the segmentation masks of all instances in the image. In the second stage, it outputs the region-of-interest corresponding to the prompt.
Mobile-SAM 2023 Mobile Segment Anything (MobileSAM).
YOLOv9 2024 You only look Once V9.
YOLO-World 2024 YOLO-World tackles the challenges faced by traditional Open-Vocabulary detection models, which often rely on cumbersome Transformer models requiring extensive computational resources. These models' dependence on pre-defined object categories also restricts their utility in dynamic scenarios. YOLO-World revitalizes the YOLOv8 framework with open-vocabulary detection capabilities, employing vision-language modeling and pre-training on expansive datasets to excel at identifying a broad array of objects in zero-shot scenarios with unmatched efficiency.