license |
---|
mit |
Yuxin Fang2,1, Wen Wang3,1, Binhui Xie4,1, Quan Sun1, Ledell Wu1, Xinggang Wang2, Tiejun Huang1, Xinlong Wang1, Yue Cao1
We launch EVA, a vision-centric foundation model to Explore the limits of Visual representation at scAle using only publicly accessible data and academic resources. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features (i.e., CLIP features) conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks.
EVA is the first open-sourced billion-scale vision foundation model that achieves state-of-the-art performance on a broad range of downstream tasks.
Table of Contents
- Image Classification
- Video Classification
- Object Detection & Instance Segmentation
- Semantic Segmentation
- EVA-CLIP
- Citation
- License
- Contact
We provide all pre-trained & fine-tuned EVAs for the community. The following table summarizes the basic statistics of MIM pre-trained EVA and image classification EVA.
model name | #param. | pre-training epochs on merged-30M | intermeidate fine-tuning epochs on IN-21K | fine-tuning epochs on IN-1K | IN-1K top-1 acc. | weight |
---|---|---|---|---|---|---|
eva_psz14 |
1.0B | 150 | - | - | - | 🤗 HF link (2GB ) |
eva_psz14to16 |
1.0B | 150 | - | - | - | 🤗 HF link (2GB ) |
eva_21k_224px_psz14 |
1.0B | 150 | 60 | - | - | 🤗 HF link (2GB ) |
eva_21k_1k_336px_psz14_ema |
1.0B | 150 | 60 | 10 | 89.6 | 🤗 HF link (4GB ) |
eva_21k_1k_560px_psz14_ema |
1.0B | 150 | 60 | 15 | 89.7 | 🤗 HF link (4GB ) |
eva_psz14to16
model interpolates the kernel size ofpatch_embed
from14x14
to16x16
. This is useful for object detection, instance segmentation & semantic segmentation, etc. Seeinterpolate_patch_14to16.py
for implementation details.- For MIM pre-trained EVA and EVA-CLIP, we use
deepspeed
fp16
format. IN-1K fine-tuned EVA weights are larger (4GB
v.s.2GB
) because ema updates models withfp32
format. The weights of other downstream tasks are also withfp32
format.
dataset | model name | init. weight | acc@1 | config | weight | logs |
---|---|---|---|---|---|---|
Kinetics722 | eva_video_k722 |
eva_psz14 |
- | config | 🤗 HF link (4.8GB ) |
ft_k722 |
Kinetics400 | eva_video_k400 |
eva_video_k722 |
89.7 | config | 🤗 HF link (4.8GB ) |
ft_k400 |
Kinetics600 | eva_video_k600 |
eva_video_k722 |
89.8 | config | 🤗 HF link (4.8GB ) |
ft_k600 |
Kinetics700 | eva_video_k700 |
eva_video_k722 |
82.9 | config | 🤗 HF link (4.8GB ) |
ft_k700 |
model name | #param. | pre-training interations on Objects365 | weight |
---|---|---|---|
eva_o365 |
1.1B | 380k | 🤗 HF link (4GB ) |
init. model weight | batch size | iter | AP box | AP mask | config | model weight |
---|---|---|---|---|---|---|
eva_o365 |
64 | 35k | 64.2 | 53.9 | config | 🤗 HF link (4GB ) |
eva_o365 |
64 | 45k | 63.9 | 55.0 | config | 🤗 HF link (4GB ) |
init. model weight | batch size | iter | AP box | AP mask | config | model weight |
---|---|---|---|---|---|---|
eva_o365 |
64 | 75k | 62.2 | 55.0 | config | 🤗 HF link (4GB ) |
init. model weight | batch size | iter | crop size | mIoU (ss) | config | seg model weight | logs |
---|---|---|---|---|---|---|---|
eva_psz14to16 |
32 | 60k | 896 | 53.4 | config | 🤗 HF link | training | evaluation |
init. model weight | batch size | iter | crop size | mIoU | config | seg model weight | logs |
---|---|---|---|---|---|---|---|
eva_sem_seg_coco |
64 | 20k | 896 | 61.5 (ss) | 62.3 (ms) | config | 🤗 HF link | training | evaluation |
model name | #param. | precision | data | batch size | IN-1K zero-shot top-1 | weight |
---|---|---|---|---|---|---|
eva_clip_psz14 |
1.3B | fp16 |
LAION-400M | 41K | 78.5 | 🤗 HF link (2GB ) |
The ImageNet-1K zero-shot classification performance is higher than our paper (
78.5
v.s.78.2
) because of longer training.
We choose to train a 1.3B CLIP model, not because it is easy, but because it is hard. Please refer to this note for a glance of the challenges in training very large CLIP.
To our knowledge, EVA-CLIP is the largest performant open-sourced CLIP model evaluated via zero-shot classification performance. We will updates the results in our paper soon. For more details of EVA-CLIP, please refer to Section 2.3.5 of our paper.
We hope open-sourcing EVA-CLIP can facilitate future research in multi-modal learning, representation leaning, AIGC, etc.
If you find our work helpful, please star this repo and cite the related articles. Thanks for your support!
@article{EVA,
title={EVA: Exploring the Limits of Masked Visual Representation Learning at Scale},
author={Fang, Yuxin and Wang, Wen and Xie, Binhui and Sun, Quan and Wu, Ledell and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
journal={arXiv preprint arXiv:2211.07636},
year={2022}
}
The content of this project itself is licensed under the MIT License.
For help or issues using EVA, please open a GitHub issue.
We are hiring at all levels at BAAI Vision Team, including full-time researchers, engineers and interns.
If you are interested in working with us on foundation model, self-supervised learning and multimodal learning, please contact Yue Cao ([email protected]
) and Xinlong Wang ([email protected]
).