Skip to content

Latest commit

 

History

History
953 lines (923 loc) · 21.7 KB

benchmark.md

File metadata and controls

953 lines (923 loc) · 21.7 KB

EVA-CLIP Zero-shot Evaluation Results

We provide a thorough evaluation of EVA-CLIP on 35 popular zero-shot benchmarks (27 image classification benchmarks + 4 video classification benchmarks + 2x2 retrieval benchmarks). The evaluation testbed is heavily based on CLIP Benchmark. Thanks for their awesome work.

Table of Contents

Zero-shot Image Classification Evaluation

Averaged performance on all the 27 benchmarks.

model model size precision training data samples seen avg. acc.
OpenAI CLIP-L 430M fp16 WIT-400M 12 69.18
Open CLIP-H 1.0B pytorch amp bf16 LAION-2B 32B 72.39
Open CLIP-g 1.3B pytorch amp bf16 LAION-2B 12B 70.74
EVA CLIP-g 1.1B deepspeed fp16 LAION-400M 11B 71.43

EVA-CLIP shows very promising sample-efficiency, and we believe sufficient data scaling can further boost the performance.

Detailed results

Dataset (27 in total) Model acc@1 acc@5 mean_per_class_recall
ImageNet-1K OpenAI CLIP-L 75.55 94.57 75.55
Open CLIP-H 77.96 95.23 77.93
Open CLIP-g 76.65 94.84 76.66
EVA CLIP-g 78.53 95.51 78.51
ImageNet-Adversarial OpenAI CLIP-L 70.76 90.76 67.88
Open CLIP-H 59.33 85.64 58.18
Open CLIP-g 57.19 83.41 56.55
EVA CLIP-g 73.59 90.93 69.97
ImageNet-Rendition OpenAI CLIP-L 87.83 97.11 86.44
Open CLIP-H 89.33 97.36 88.1
Open CLIP-g 88.69 96.96 87.51
EVA CLIP-g 92.5 98.24 91.19
ImageNet-Sketch OpenAI CLIP-L 59.58 84.25 59.61
Open CLIP-H 66.58 88.12 66.57
Open CLIP-g 65.17 87.46 65.21
EVA CLIP-g 67.31 89.07 67.31
ImageNet-V2 OpenAI CLIP-L 69.86 90.91 69.85
Open CLIP-H 70.87 91.67 70.92
Open CLIP-g 69.56 90.86 69.61
EVA CLIP-g 71.52 92.11 71.56
ObjectNet OpenAI CLIP-L 68.98 88.06 67.37
Open CLIP-H 69.71 87.74 68.45
Open CLIP-g 67.53 86.7 66.56
EVA CLIP-g 72.33 89.37 70.88
SUN397 OpenAI CLIP-L 67.57 93.69 68.3
Open CLIP-H 75.2 96.08 75.15
Open CLIP-g 75.41 96.17 75.28
EVA CLIP-g 74.15 95.52 73.27
VOC2007 OpenAI CLIP-L 78.27 96.88 86.45
Open CLIP-H 77.68 94.22 84.97
Open CLIP-g 81.07 96.57 85.75
EVA CLIP-g 83.23 96.94 88.7
Birdsnap OpenAI CLIP-L 40.52 64.81 40.12
Open CLIP-H 52.92 73.47 52.91
Open CLIP-g 48.68 71.1 48.61
EVA CLIP-g 50 70.97 50.07
Caltech101 OpenAI CLIP-L 86.67 96.85 93.27
Open CLIP-H 88.24 88.24 94.56
Open CLIP-g 88.21 97.24 94.13
EVA CLIP-g 87.72 95.68 94.81
Stanford Cars OpenAI CLIP-L 77.86 98.4 77.84
Open CLIP-H 93.4 99.89 93.41
Open CLIP-g 92.92 99.88 93.45
EVA CLIP-g 91.71 99.76 91.66
CIFAR10 OpenAI CLIP-L 95.6 99.63 95.6
Open CLIP-H 97.45 99.93 97.45
Open CLIP-g 97.05 99.93 97.06
EVA CLIP-g 98.31 99.96 98.29
CIFAR100 OpenAI CLIP-L 75.81 92.76 75.81
Open CLIP-H 84.73 97.34 84.73
Open CLIP-g 83.91 97.31 83.92
EVA CLIP-g 88.66 88.71 88.65
Country211 OpenAI CLIP-L 31.86 59.36 31.87
Open CLIP-H 29.88 55.76 29.86
Open CLIP-g 28.8 54.24 28.82
EVA CLIP-g 28.63 55.37 28.64
Describable Textures OpenAI CLIP-L 55.43 84.15 55.48
Open CLIP-H 67.82 92.45 67.82
Open CLIP-g 68.03 92.39 68.09
EVA CLIP-g 61.33 87.5 61.38
EuroSAT OpenAI CLIP-L 62.4 95.14 63.72
Open CLIP-H 72.7 95.09 72.91
Open CLIP-g 63.22 98.1 63.71
EVA CLIP-g 73.57 98.75 74.39
Facial Emotion Recognition 2013 OpenAI CLIP-L 49.89 97.23 49.33
Open CLIP-H 52.01 96.55 50.68
Open CLIP-g 47.16 94.6 48.44
EVA CLIP-g 52.17 94.79 48.57
FGVC Aircraft OpenAI CLIP-L 31.44 78.04 31.48
Open CLIP-H 42.75 83.74 42.65
Open CLIP-g 37.71 79.9 37.61
EVA CLIP-g 32.37 73.75 32.29
Oxford Flowers 102 OpenAI CLIP-L 79.2 92.16 79.32
Open CLIP-H 80.11 92.91 79.92
Open CLIP-g 77.41 90.6 77.92
EVA CLIP-g 74.47 90.65 74.29
Food101 OpenAI CLIP-L 93.05 99.3 93.06
Open CLIP-H 92.74 99.22 92.73
Open CLIP-g 91.55 99.07 91.55
FLIP-L 89.3 - -
EVA CLIP-g 93.46 99.32 93.46
GTSRB OpenAI CLIP-L 50.55 76.08 43.96
Open CLIP-H 58.36 82.1 54.32
Open CLIP-g 49.8 76.88 46.77
EVA CLIP-g 49.12 84.56 47.08
MNIST OpenAI CLIP-L 76.35 93.53 75.88
Open CLIP-H 72.86 94.21 73.65
Open CLIP-g 68.57 95.15 68.97
EVA CLIP-g 62.34 90.81 62.35
Oxford-IIIT Pets OpenAI CLIP-L 93.49 99.78 93.45
Open CLIP-H 94.55 99.86 94.51
Open CLIP-g 94.36 99.81 94.35
EVA CLIP-g 94.22 99.86 94.2
STL10 OpenAI CLIP-L 99.36 100 99.36
Open CLIP-H 98.48 99.99 98.48
Open CLIP-g 98.65 99.98 98.68
EVA CLIP-g 98.89 100 98.89
RESISC45 OpenAI CLIP-L 64.64 93.21 64.68
Open CLIP-H 70.54 96.03 70.55
Open CLIP-g 72.5 96.12 72.54
EVA CLIP-g 70.3 94.66 70.3
PatchCamelyon OpenAI CLIP-L 51.98 - 51.97
Open CLIP-H 54.24 - 54.22
Open CLIP-g 56.11 - 56.11
EVA CLIP-g 49.88 - 49.86
Rendered SST2 OpenAI CLIP-L 68.86 - 68.88
Open CLIP-H 64.25 - 64.27
Open CLIP-g 64.14 - 64.16
EVA CLIP-g 58.38 - 58.41

Zero-shot Video Action Recognition Evaluation

Dataset Model acc@1 acc@5 mean(acc@1, acc@5)
UCF101 OpenAI CLIP-L 76.39 94.86 85.63
Open CLIP-H 78.16 95.02 86.59
Open CLIP-g 77.73 94.98 86.36
EVA CLIP-g 76.05 93.64 84.84
Kinetics400 OpenAI CLIP-L 52.88 76.06 64.47
Open CLIP-H 51.63 74.49 63.06
Open CLIP-g 50.35 73.03 61.69
EVA CLIP-g 54.04 76.42 65.23
Kinetics600 OpenAI CLIP-L 52.41 76 64.21
Open CLIP-H 52.25 74.92 63.58
Open CLIP-g 50.79 73.53 62.16
EVA CLIP-g 52.76 75.99 64.38
Kinetics700 OpenAI CLIP-L 45.73 69.63 57.68
Open CLIP-H 44.64 67.54 56.09
Open CLIP-g 43.6 66.39 54.99
EVA CLIP-g 46.65 70.16 58.4

Zero-shot Retrieval Evaluation

Dataset Model Text-to-Image Retrival Image-to-Text Retrival
R@1 R@5 R@10 R@1 R@5 R@10
Flickr30k OpenAI CLIP-L 65.18 87.28 92 85.2 97.3 99
Open CLIP-H 77.78 94.14 96.62 90.8 99.3 99.7
Open CLIP-g 76.52 93.62 96.28 90.8 99.1 99.8
EVA CLIP-g 72.64 91.6 95.12 88.3 98.3 99.3
MSCOCO OpenAI CLIP-L 36.51 61.01 71.11 56.34 79.32 86.66
Open CLIP-H 49.47 73.4 81.53 65.96 86.06 91.9
Open CLIP-g 47.99 72.37 80.75 64.96 85.3 91.46
EVA CLIP-g 44.07 68.5 77.33 61.76 83.28 89.96

The zero-shot retrieval performance of EVA-CLIP is relatively inferior to the Open CLIP-H / -g counterpart. We speculate there are two main reasons:

  • The size / capacity of the language tower in EVA-CLIP is much smaller / weaker than Open CLIP-H and Open CLIP-g, i.e., 124M v.s. 354M. Meanwhile, retrieval tasks depend more on the capacity of the language branch compared with classification tasks.
  • Retrieval tasks seem benefit more from the training dataset size (LAION-2B used by Open CLIP), while we only leverage LAION-400M for EVA-CLIP training.

Nevertheless, it is hard to make a head-to-head comparison between different CLIP models. In the future, we will further scale up the language encoder & training data to improve the retrieval performance.

  • Updates (Feb, 2023): We are training an improved version of EVA-CLIP+ (WIP), now achieving ~79.5 zero-shot top-1 accuracy on IN-1K, and outperforming the prev. best CLIP by ~0.5% in zero-shot retrieval. We will update the details soon and release all suites of EVA-CLIP+ in the future. Please stay tuned.