ViT Without Flattening

Experiment of adding convolutional layers to replace the flattening operation of ViT.

The inputs to ViT are not 1-D vectors but are the 2-D feature maps. This is different to the paper CvT: Introducing Convolutions to Vision Transformers.

My ViT got bad performance on my small dataset(3k train, 1k test). Inspired by the CNN's feature of remaining the 2-D structure of the image and it good performance on this small dataset, I want to remaining the 2-D structure of the input of ViT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ViT Without Flattening

Current Progress: Fixing the gradient issue

Files

README.md

Latest commit

History

README.md

File metadata and controls

ViT Without Flattening

Current Progress: Fixing the gradient issue