We provide a thorough evaluation of EVA-CLIP on 35 popular zero-shot benchmarks (27 image classification benchmarks + 4 video classification benchmarks + 2x2 retrieval benchmarks). The evaluation testbed is heavily based on CLIP Benchmark. Thanks for their awesome work.
Table of Contents
model | model size | precision | training data | samples seen | avg. acc. |
---|---|---|---|---|---|
OpenAI CLIP-L | 430M | fp16 |
WIT-400M | 12 | 69.18 |
Open CLIP-H | 1.0B | pytorch amp bf16 |
LAION-2B | 32B | 72.39 |
Open CLIP-g | 1.3B | pytorch amp bf16 |
LAION-2B | 12B | 70.74 |
EVA CLIP-g | 1.1B | deepspeed fp16 |
LAION-400M | 11B | 71.43 |
EVA-CLIP shows very promising sample-efficiency, and we believe sufficient data scaling can further boost the performance.
Dataset (27 in total) | Model | acc@1 | acc@5 | mean_per_class_recall |
ImageNet-1K | OpenAI CLIP-L | 75.55 | 94.57 | 75.55 |
Open CLIP-H | 77.96 | 95.23 | 77.93 | |
Open CLIP-g | 76.65 | 94.84 | 76.66 | |
EVA CLIP-g | 78.53 | 95.51 | 78.51 | |
ImageNet-Adversarial | OpenAI CLIP-L | 70.76 | 90.76 | 67.88 |
Open CLIP-H | 59.33 | 85.64 | 58.18 | |
Open CLIP-g | 57.19 | 83.41 | 56.55 | |
EVA CLIP-g | 73.59 | 90.93 | 69.97 | |
ImageNet-Rendition | OpenAI CLIP-L | 87.83 | 97.11 | 86.44 |
Open CLIP-H | 89.33 | 97.36 | 88.1 | |
Open CLIP-g | 88.69 | 96.96 | 87.51 | |
EVA CLIP-g | 92.5 | 98.24 | 91.19 | |
ImageNet-Sketch | OpenAI CLIP-L | 59.58 | 84.25 | 59.61 |
Open CLIP-H | 66.58 | 88.12 | 66.57 | |
Open CLIP-g | 65.17 | 87.46 | 65.21 | |
EVA CLIP-g | 67.31 | 89.07 | 67.31 | |
ImageNet-V2 | OpenAI CLIP-L | 69.86 | 90.91 | 69.85 |
Open CLIP-H | 70.87 | 91.67 | 70.92 | |
Open CLIP-g | 69.56 | 90.86 | 69.61 | |
EVA CLIP-g | 71.52 | 92.11 | 71.56 | |
ObjectNet | OpenAI CLIP-L | 68.98 | 88.06 | 67.37 |
Open CLIP-H | 69.71 | 87.74 | 68.45 | |
Open CLIP-g | 67.53 | 86.7 | 66.56 | |
EVA CLIP-g | 72.33 | 89.37 | 70.88 | |
SUN397 | OpenAI CLIP-L | 67.57 | 93.69 | 68.3 |
Open CLIP-H | 75.2 | 96.08 | 75.15 | |
Open CLIP-g | 75.41 | 96.17 | 75.28 | |
EVA CLIP-g | 74.15 | 95.52 | 73.27 | |
VOC2007 | OpenAI CLIP-L | 78.27 | 96.88 | 86.45 |
Open CLIP-H | 77.68 | 94.22 | 84.97 | |
Open CLIP-g | 81.07 | 96.57 | 85.75 | |
EVA CLIP-g | 83.23 | 96.94 | 88.7 | |
Birdsnap | OpenAI CLIP-L | 40.52 | 64.81 | 40.12 |
Open CLIP-H | 52.92 | 73.47 | 52.91 | |
Open CLIP-g | 48.68 | 71.1 | 48.61 | |
EVA CLIP-g | 50 | 70.97 | 50.07 | |
Caltech101 | OpenAI CLIP-L | 86.67 | 96.85 | 93.27 |
Open CLIP-H | 88.24 | 88.24 | 94.56 | |
Open CLIP-g | 88.21 | 97.24 | 94.13 | |
EVA CLIP-g | 87.72 | 95.68 | 94.81 | |
Stanford Cars | OpenAI CLIP-L | 77.86 | 98.4 | 77.84 |
Open CLIP-H | 93.4 | 99.89 | 93.41 | |
Open CLIP-g | 92.92 | 99.88 | 93.45 | |
EVA CLIP-g | 91.71 | 99.76 | 91.66 | |
CIFAR10 | OpenAI CLIP-L | 95.6 | 99.63 | 95.6 |
Open CLIP-H | 97.45 | 99.93 | 97.45 | |
Open CLIP-g | 97.05 | 99.93 | 97.06 | |
EVA CLIP-g | 98.31 | 99.96 | 98.29 | |
CIFAR100 | OpenAI CLIP-L | 75.81 | 92.76 | 75.81 |
Open CLIP-H | 84.73 | 97.34 | 84.73 | |
Open CLIP-g | 83.91 | 97.31 | 83.92 | |
EVA CLIP-g | 88.66 | 88.71 | 88.65 | |
Country211 | OpenAI CLIP-L | 31.86 | 59.36 | 31.87 |
Open CLIP-H | 29.88 | 55.76 | 29.86 | |
Open CLIP-g | 28.8 | 54.24 | 28.82 | |
EVA CLIP-g | 28.63 | 55.37 | 28.64 | |
Describable Textures | OpenAI CLIP-L | 55.43 | 84.15 | 55.48 |
Open CLIP-H | 67.82 | 92.45 | 67.82 | |
Open CLIP-g | 68.03 | 92.39 | 68.09 | |
EVA CLIP-g | 61.33 | 87.5 | 61.38 | |
EuroSAT | OpenAI CLIP-L | 62.4 | 95.14 | 63.72 |
Open CLIP-H | 72.7 | 95.09 | 72.91 | |
Open CLIP-g | 63.22 | 98.1 | 63.71 | |
EVA CLIP-g | 73.57 | 98.75 | 74.39 | |
Facial Emotion Recognition 2013 | OpenAI CLIP-L | 49.89 | 97.23 | 49.33 |
Open CLIP-H | 52.01 | 96.55 | 50.68 | |
Open CLIP-g | 47.16 | 94.6 | 48.44 | |
EVA CLIP-g | 52.17 | 94.79 | 48.57 | |
FGVC Aircraft | OpenAI CLIP-L | 31.44 | 78.04 | 31.48 |
Open CLIP-H | 42.75 | 83.74 | 42.65 | |
Open CLIP-g | 37.71 | 79.9 | 37.61 | |
EVA CLIP-g | 32.37 | 73.75 | 32.29 | |
Oxford Flowers 102 | OpenAI CLIP-L | 79.2 | 92.16 | 79.32 |
Open CLIP-H | 80.11 | 92.91 | 79.92 | |
Open CLIP-g | 77.41 | 90.6 | 77.92 | |
EVA CLIP-g | 74.47 | 90.65 | 74.29 | |
Food101 | OpenAI CLIP-L | 93.05 | 99.3 | 93.06 |
Open CLIP-H | 92.74 | 99.22 | 92.73 | |
Open CLIP-g | 91.55 | 99.07 | 91.55 | |
FLIP-L | 89.3 | - | - | |
EVA CLIP-g | 93.46 | 99.32 | 93.46 | |
GTSRB | OpenAI CLIP-L | 50.55 | 76.08 | 43.96 |
Open CLIP-H | 58.36 | 82.1 | 54.32 | |
Open CLIP-g | 49.8 | 76.88 | 46.77 | |
EVA CLIP-g | 49.12 | 84.56 | 47.08 | |
MNIST | OpenAI CLIP-L | 76.35 | 93.53 | 75.88 |
Open CLIP-H | 72.86 | 94.21 | 73.65 | |
Open CLIP-g | 68.57 | 95.15 | 68.97 | |
EVA CLIP-g | 62.34 | 90.81 | 62.35 | |
Oxford-IIIT Pets | OpenAI CLIP-L | 93.49 | 99.78 | 93.45 |
Open CLIP-H | 94.55 | 99.86 | 94.51 | |
Open CLIP-g | 94.36 | 99.81 | 94.35 | |
EVA CLIP-g | 94.22 | 99.86 | 94.2 | |
STL10 | OpenAI CLIP-L | 99.36 | 100 | 99.36 |
Open CLIP-H | 98.48 | 99.99 | 98.48 | |
Open CLIP-g | 98.65 | 99.98 | 98.68 | |
EVA CLIP-g | 98.89 | 100 | 98.89 | |
RESISC45 | OpenAI CLIP-L | 64.64 | 93.21 | 64.68 |
Open CLIP-H | 70.54 | 96.03 | 70.55 | |
Open CLIP-g | 72.5 | 96.12 | 72.54 | |
EVA CLIP-g | 70.3 | 94.66 | 70.3 | |
PatchCamelyon | OpenAI CLIP-L | 51.98 | - | 51.97 |
Open CLIP-H | 54.24 | - | 54.22 | |
Open CLIP-g | 56.11 | - | 56.11 | |
EVA CLIP-g | 49.88 | - | 49.86 | |
Rendered SST2 | OpenAI CLIP-L | 68.86 | - | 68.88 |
Open CLIP-H | 64.25 | - | 64.27 | |
Open CLIP-g | 64.14 | - | 64.16 | |
EVA CLIP-g | 58.38 | - | 58.41 |
Dataset | Model | acc@1 | acc@5 | mean(acc@1, acc@5) |
UCF101 | OpenAI CLIP-L | 76.39 | 94.86 | 85.63 |
Open CLIP-H | 78.16 | 95.02 | 86.59 | |
Open CLIP-g | 77.73 | 94.98 | 86.36 | |
EVA CLIP-g | 76.05 | 93.64 | 84.84 | |
Kinetics400 | OpenAI CLIP-L | 52.88 | 76.06 | 64.47 |
Open CLIP-H | 51.63 | 74.49 | 63.06 | |
Open CLIP-g | 50.35 | 73.03 | 61.69 | |
EVA CLIP-g | 54.04 | 76.42 | 65.23 | |
Kinetics600 | OpenAI CLIP-L | 52.41 | 76 | 64.21 |
Open CLIP-H | 52.25 | 74.92 | 63.58 | |
Open CLIP-g | 50.79 | 73.53 | 62.16 | |
EVA CLIP-g | 52.76 | 75.99 | 64.38 | |
Kinetics700 | OpenAI CLIP-L | 45.73 | 69.63 | 57.68 |
Open CLIP-H | 44.64 | 67.54 | 56.09 | |
Open CLIP-g | 43.6 | 66.39 | 54.99 | |
EVA CLIP-g | 46.65 | 70.16 | 58.4 |
Dataset | Model | Text-to-Image Retrival | Image-to-Text Retrival | ||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
Flickr30k | OpenAI CLIP-L | 65.18 | 87.28 | 92 | 85.2 | 97.3 | 99 |
Open CLIP-H | 77.78 | 94.14 | 96.62 | 90.8 | 99.3 | 99.7 | |
Open CLIP-g | 76.52 | 93.62 | 96.28 | 90.8 | 99.1 | 99.8 | |
EVA CLIP-g | 72.64 | 91.6 | 95.12 | 88.3 | 98.3 | 99.3 | |
MSCOCO | OpenAI CLIP-L | 36.51 | 61.01 | 71.11 | 56.34 | 79.32 | 86.66 |
Open CLIP-H | 49.47 | 73.4 | 81.53 | 65.96 | 86.06 | 91.9 | |
Open CLIP-g | 47.99 | 72.37 | 80.75 | 64.96 | 85.3 | 91.46 | |
EVA CLIP-g | 44.07 | 68.5 | 77.33 | 61.76 | 83.28 | 89.96 |
The zero-shot retrieval performance of EVA-CLIP is relatively inferior to the Open CLIP-H / -g counterpart. We speculate there are two main reasons:
- The size / capacity of the language tower in EVA-CLIP is much smaller / weaker than Open CLIP-H and Open CLIP-g, i.e.,
124M
v.s.354M
. Meanwhile, retrieval tasks depend more on the capacity of the language branch compared with classification tasks. - Retrieval tasks seem benefit more from the training dataset size (LAION-2B used by Open CLIP), while we only leverage LAION-400M for EVA-CLIP training.
Nevertheless, it is hard to make a head-to-head comparison between different CLIP models. In the future, we will further scale up the language encoder & training data to improve the retrieval performance.
- Updates (Feb, 2023): We are training an improved version of EVA-CLIP+ (WIP), now achieving ~79.5 zero-shot top-1 accuracy on IN-1K, and outperforming the prev. best CLIP by ~0.5% in zero-shot retrieval. We will update the details soon and release all suites of EVA-CLIP+ in the future. Please stay tuned.