Yuxin Fang2,1, Quan Sun1, Xinggang Wang2, Tiejun Huang1, Xinlong Wang1, Yue Cao1
We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling.
With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets.
Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data.
We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance.
We hope our efforts enable a broader range of the research community to advance the field in a more efficient, affordable and equitable manner.
- Pre-training
- Image Classification
- Object Detection & Instance Segmentation
- Semantic Segmentation
- CLIP
- If you would like to use / fine-tune EVA-02 in your project, please start with a shorter schedule & smaller learning rate (compared with the baseline setting) first.
- Using EVA-02 as a feature extractor: #56.
@article{eva02,
title={Eva-02: A visual representation for neon genesis},
author={Fang, Yuxin and Sun, Quan and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
journal={Image and Vision Computing},
pages={105171},
year={2024},
publisher={Elsevier}
}
EVA-01, BEiT, BEiTv2, CLIP, MAE, timm, DeepSpeed, Apex, xFormer, detectron2, mmcv, mmdet, mmseg, ViT-Adapter, detrex, and rotary-embedding-torch.
-
For help and issues associated with EVA-02, or reporting a bug, please open a GitHub Issue with label EVA-02. Let's build a better & stronger EVA-02 together :)
-
We are hiring at all levels at BAAI Vision Team, including full-time researchers, engineers and interns. If you are interested in working with us on foundation model, self-supervised learning and multimodal learning, please contact Yue Cao (
[email protected]
) and Xinlong Wang ([email protected]
).