A lot of recent work has focused on sparse learned indexes that use deep neural architectures to significantly improve retrieval quality while keeping the efficiency benefits of the inverted index. While such sparse learned structures achieve effectiveness far beyond those of traditional inverted index-based rankers, there is still a gap in effectiveness to the best dense retrievers, or even to sparse methods that leverage more expensive optimizations such as query expansion and query term weighting. We focus on narrowing this gap by revisiting and optimizing DeepImpact, a sparse retrieval approach that uses DocT5Query for document expansion followed by a BERT language model to learn impact scores for document terms. We first reinvestigate the expansion process and find that the recently proposed Doc2Query-- query filtration does not enhance retrieval quality when used with DeepImpact. Instead, substituting T5 with a fine-tuned Llama 2 model for query prediction results in a considerable improvement. Subsequently, we study training strategies that have proven effective for other models, in particular the use of hard negatives, distillation, and pre-trained CoCondenser model initialization. Our results significantly narrow the effectiveness gap with the most effective versions of SPLADE.
To install the required dependencies, run the following command:
pip install -r requirements.txt
Checkout the notebook for a detailed example on how to use DeeperImpact, from expansions to running inference.
To run expansions on a collection of documents, use the following command:
python -m src.llama2.generate \
--llama_path <path | HuggingFaceHub link> \
--collection_path <path> \
--collection_type [msmarco | beir] \
--output_path <path> \
--batch_size <batch_size> \
--max_tokens 512 \
--num_return_sequences 80 \
--max_new_tokens 50 \
--top_k 50 \
--top_p 0.95 \
--peft_path soyuj/llama2-doc2query
This will generate a jsonl file with expansions for each document in the collection. To append the unique expansion terms to the original collection, use the following command:
python -m src.llama2.merge \
--collection_path <path> \
--collection_type [msmarco | beir] \
--queries_path <jsonl file generated above> \
--output_path <path>
To train DeeperImpact, use the following command:
torchrun --standalone --nproc_per_node=gpu -m src.deep_impact.train \
--dataset_path <cross-encoder-ms-marco-MiniLM-L-6-v2-scores.pkl.gz> \
--queries_path <queries.train.tsv> \
--collection_path <expanded_collection_path> \
--checkpoint_dir <checkpoint_dir_path> \
--batch_size <batch_size> \
--save_every <n> \
# optional if you want to start with a checkpoint
--start_with <specific_checkpoint_path> \
# if you want to use distillation
# instead, if you want to use triples and cross-entropy, pass triples in dataset_path and exclude this flag
--distil_kl \
# maximum token length for each document
--max_length 300 \
--lr 1e-6 \
--seed 42 \
--gradient_accumulation_steps 1 \
# experimental options
# for triples and cross-entropy with in-batch negatives
# pass triples in dataset_path and exclude --distil_kl
# --in_batch_negatives \
# for distillation using MarginMSELoss instead of KL divergence loss
# pass the same cross-encoder dataset, exclude --distil_kl, and include the qrels_path
# --distil_mse \
# --qrels_path qrels.train.tsv \
# to train cross-encoder DeepImpact model, pass triples and exclude --distil_kl
# --cross_encoder
It distributes the training across multiple GPUs in the machine. The batch_size
is per GPU. To manually set the GPUs,
use CUDA_VISIBLE_DEVICES
environment variable.
To run inference on a collection of documents, use the following command:
python -m src.deep_impact.index \
--collection_path <expanded_collection.tsv> \
--output_file_path <path> \
--model_checkpoint_path <model_checkpoint_path> \
--num_processes <n> \
--process_batch_size <process_batch_size> \
--model_batch_size <model_batch_size>
It distributes the inference across multiple GPUs in the machine. To manually set the GPUs, use CUDA_VISIBLE_DEVICES
environment variable.
To quantize the generated impact scores, use the following command:
python -m src.deep_impact.indexing.quantize \
-i <deep_impact_collection_path> \
-o <quantized_deep_impact_collection_path>
You can then use Anserini to generate the inverted index and export it in CIFF format, which can then be directly processed with PISA.
For quick experimentation, you can also use a custom implementation of an inverted index:
python -m src.deep_impact.inverted_index.create \ -i <quantized_deepimpact_collection_path> \ -o <inverted_index_dir_path>To rank:
python -m src.deep_impact.rank \ --index_path <inverted_index_dir_path> \ --queries_path <queries_to_rank> \ --output_path <run_file_path> \ --dataset_type [msmarco | beir] \ --num_workers <n> # if qrels file path is specified, only consider queries in the qrels file --qrels_pathTo evaluate:
python -m src.deep_impact.evaluate \ --run_file_path <run_file_path> \ --qrels_path <qrels_path>
For any questions or comments, please reach out to us via email: [email protected]
Please cite our work as:
- DeeperImpact: Optimizing Sparse Learned Index Structures
@misc{basnet2024deeperimpact,
title={DeeperImpact: Optimizing Sparse Learned Index Structures},
author={Soyuj Basnet and Jerry Gou and Antonio Mallia and Torsten Suel},
year={2024},
eprint={2405.17093},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (paper)
- Document Expansion by Query Prediction (paper)
- From doc2query to docTTTTTquery (paper)
- Doc2Query--: When Less is More (paper)
- Passage Re-ranking with BERT (paper)
- Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval (paper)
- Context-Aware Document Term Weighting for Ad-Hoc Search (paper)
- Efficiency Implications of Term Weighting for Passage Retrieval (paper)
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (paper)
- SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking (paper)
- Wacky Weights in Learned Sparse Representations and the Revenge of Score-at-a-Time Query Evaluation (paper)