Arxiv : https://arxiv.org/abs/2205.02870
Pre Trained Language Models are able to extract a significant amount of contextual information. We can encode this information by using encoder-heavy models such as BERT which can not only extract semantic information but also encode the context in which the words are used in.
They create three different clusters
- Semantic Clustering - use K Means to generate 5 clusters
- WH-Words Queries - Split the train and test set by manually building clusters based on queries related to definitions, instructions and the more general questions
- Short and Long Queries - Split the train set into groups of short and long queries with
6
being the cut off
Semantic clustering was done as follows
- K-Means on
[CLS]
representation of the MS Marco queries to build 100 initial clusters - Selecting the five clusters that maximize the sum of pairwise l2 distances
- Expand the five native clusters until they have no overlap
They investigate query styles and goals through question words. This results in 3 manual clusters being built
- Definition Queries : What, definition
- Instruction Queries : How
- Persons, locations and context ( Who, When, Where, Which )
This then creates three train-test sets which are 10M and 3500 elements large each
They benchmark five different models
- A standard Bi Encoder - which embeds queries and documents into the same embedding space ( Using the
CLS
embedding ) - A late interaction ColBERT model
- A Sparse bi-encoder SPLADE
- BM25
- MonoBERT Cross-ENcoder to re-rank BM25
The metrics they use to benchmark each model is MRR and Atomised Search Length. ColBERT and monoBERT were both finetuned with ~5M samples that had 150k epochs with bs=32
They then evaluate the performance of each model on individual clusters that they were not fine tuned on. They conclude that dense models are the most impacted by the shifts across clusters, followed by SPLADE, finally ColBERT and then monoBERT.