-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
226 lines (202 loc) · 9.76 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:image" content="https://tau-vailab.github.io/hierarcaps/assets/teaser.png" />
<meta property="og:title" content="Emergent Visual-Semantic Hierarchies in Image-Text Representations" />
<meta property="og:description"
content="We find that foundation VLMs like CLIP model visual-semantic hierarchies." />
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css"
integrity="sha384-B0vP5xmATw1+K9KRQjQERJvTumQW0nPEzvF6L/Z6nronJ3oUOFUFpCjEUQouq2+l" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="web/style.css">
<title>Emergent Visual-Semantic Hierarchies in Image-Text Representations</title>
</head>
<body class="container" style="max-width:840px">
<script src="https://code.jquery.com/jquery-3.5.1.slim.min.js"
integrity="sha384-DfXdz2htPH0lsSSs5nCTpuj/zy4C+OGpamoFVy38MVBnE+IbbVYUew+OrCXaRkfj"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js"
integrity="sha384-Piv4xVNRyMGpqkS2by6br4gNJ7DXjqk09RmUpJ8jgGtD7zP9yug3goQfGII0yAns"
crossorigin="anonymous"></script>
<!-- heading -->
<div>
<!-- title -->
<div class='row mt-5 mb-3'>
<div class='col text-center'>
<p class="h2 font-weight-normal">Emergent Visual-Semantic Hierarchies<br /> in Image-Text Representations</p>
</div>
</div>
<!-- authors -->
<div class="col text-center h6 font-weight-bold mb-2 ">
<span><a class="col-md-4 col-xs-6 pb-2" href="https://morrisalp.github.io/">Morris Alper</a></span>
<span><a class="col-md-4 col-xs-6 pb-2" href="https://www.elor.sites.tau.ac.il/">Hadar
Averbuch-Elor</a></span>
</div>
<!-- affiliations -->
<div class='row mb-1'>
<div class='col text-center'>
<p class="h6">
<a href="https://english.tau.ac.il/"><span>Tel Aviv University</span></a>
</p>
</div>
</div>
<div class='row mt-2 mb-3'>
<div class='col text-center'>
<p class="h3 font-weight-normal">ECCV 2024 <small>(Oral Presentation)</small></p>
<!-- <p class="h6 font-weight-normal">(top XX% of accepted papers)</p> -->
<!-- <sup class="fa fa-info-circle" ></sup> -->
</div>
</div>
<!-- links -->
<div class='row mb-4'>
<div class='col text-center'>
<a href="https://arxiv.org/abs/2407.08521" target="_blank" class="btn btn-outline-primary" role="button">
<i class="ai ai-arxiv"></i>
arXiv
</a>
<a href="https://github.com/TAU-VAILab/hierarcaps" target="_blank" class="btn btn-outline-primary"
role="button">
<i class="fa fa-github"></i>
Code
</a>
<a href="https://github.com/TAU-VAILab/hierarcaps/tree/main/data" target="_blank"
class="btn btn-outline-primary" role="button">
<i class="fa fa-database"></i>
Data
</a>
<a href="web/viz.html" target="_blank" class="btn btn-outline-primary" role="button">
<i class="fa fa-eye"></i>
Interactive Visualization
</a>
</div>
</div>
<!-- teaser -->
<div class='row justify-content-center'>
<div class="card teaser teaser_img_card">
<img src="assets/teaser.png" class="img-fluid rounded mx-auto d-block teaser_img">
</div>
<div class='text-center col-md-12 col-sm-12 col-xs-12 align-middle mt-1'>
<p class='h6'>
<em>TL;DR: Foundation VLMs like CLIP model <b>visual-semantic hierarchies</b> like the one shown above.</em>
</p>
<hr>
</div>
</div>
<!-- abstract -->
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<p class="h4 font-weight-bold title">Abstract</p>
<p class="abstract"><!--style="line-height: 1;">-->
While recent vision-and-language models (VLMs) like CLIP are a powerful tool for analyzing text and images in
a shared semantic space, they do not explicitly model the hierarchical nature of the set of texts which may
describe an image. Conversely, existing multimodal hierarchical representation learning methods require costly
training from scratch, failing to leverage the knowledge encoded by state-of-the-art multimodal foundation
models. In this work, we study the knowledge of existing foundation models, finding that they exhibit emergent
understanding of visual-semantic hierarchies despite not being directly trained for this purpose. We propose
the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding, and contribute the
HierarCaps dataset, a benchmark facilitating the study of hierarchical knowledge in image–text
representations, constructed automatically via large language models. Our results show that foundation VLMs
exhibit zero-shot hierarchical understanding, surpassing the performance of prior models explicitly designed
for this purpose. Furthermore, we show that foundation models may be better aligned to hierarchical reasoning
via a text-only fine-tuning phase, while retaining pretraining knowledge.
</p>
<hr>
</div>
</div>
<!-- method -->
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<p class="h4 font-weight-bold title">Probing and Optimizing with the RE Framework</p>
<p>
Our <i>Radial Embedding</i> (RE) framework defines geometric relations between embeddings in VLMs such as CLIP
which we find effectively encode hierarchical knowledge. The RE framework is designed to flexibly adapt to the
emergent geometry of such VLMs, contrasting with prior approaches that train hierarchical models from scratch
with stricter geometric assumptions. In addition to zero-shot probing, RE can be used to <i>align</i> VLMs
with a light-weight fine-tuning stage to enhance hierarchical understanding, using the RE loss illustrated
below:
</p>
<div class="row justify-content-center">
<div class="card pipe_card noborder">
<img src="assets/re.png" class="img-fluid rounded mx-auto d-block re_img">
</div>
</div>
<p>
This is a contrastive loss based on the learnable root embedding <b>r</b> and the triplet (<b>e</b>,
<b>e'</b>, <b>e''</b>) of caption text embeddings selected to include logical entailment and contradiction relations.
See our paper for further technical details.
</p>
<hr>
</div>
</div>
<!-- viz -->
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<p class="h4 font-weight-bold title">The HierarCaps Dataset</p>
<p>
We propose the <a href="https://github.com/TAU-VAILab/hierarcaps/tree/main/data">HierarCaps</a> dataset
consisting of images with paired ground-truth caption
hierarchies, as shown below:
</p>
<div class="row justify-content-center">
<div class="card pipe_card noborder">
<img src="assets/hierarcaps.png" class="img-fluid rounded mx-auto d-block pipe pipe4 pipe_img">
</div>
</div>
<p>
As existing image captioning datasets only have a single caption (or unrelated reference captions) for each
image, we leverage existing paired image-caption datasets along with an LLM and NLI-based pipeline to generate
logical caption hierarchies.
Our train set consists of over 70K paired images and captions, while we manually curate 1K items as a clean
test set. We also contribute quantitative metrics for hierarchical understanding on HierarCaps.
</p>
<p>
</p>
<hr>
</div>
</div>
<!-- viz -->
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<p class="h4 font-weight-bold title">Code, Trained Models, and Results</p>
<p>
We release our
<a href="https://github.com/TAU-VAILab/hierarcaps">code</a> and <a
href="https://github.com/TAU-VAILab/hierarcaps">trained models</a>, anticipating further research on
visual-semantic hierarchical understanding. We also provide a <a href="web/viz.html">interactive
visualization</a> of model results on the HierarCaps test set, as well as a random subset of the HierarCaps
train set.
</p>
<hr>
</div>
</div>
<!-- ack -->
<div>
<div class="row">
<div class='col-md-12 col-sm-12 col-xs-12'>
<p class='h4 font-weight-bold title'>Acknowledgements</p>
<p class="ack">
We thank Yotam Elor, Roi Livni, Guy Tevet, Chen Dudai, and Rinon Gal for
providing helpful feedback. This work was partially supported by ISF (grant
number 2510/23).
</p>
</div>
</div>
<hr>
</div>
<!-- citation -->
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<p class="h4 font-weight-bold title">Citation</p>
<pre><code>@InProceedings{alper2024hierarcaps,
author = {Morris Alper and Hadar Averbuch-Elor},
title = {Emergent Visual-Semantic Hierarchies in Image-Text Representations},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024}
}</code></pre>
</div>
</div>
</body>
</html>