Awesome Information Retrieval

Created by: Krzysztof Jankowski, Michał Janik, Michał Grotkowski, Antoni Hanke. The research was presented at the Data Science Summit ML Edition 2024 and earlier parts at ML in PL 2024 and Warsaw.ai episode XX.

About

The repository contains code developed for experimentation with information retrieval and question answering systems. By combining various retrievers, rerankers and other techniques we conduct an in depth analysis on how to achieve the most performant pipelines. The repository uses the following models: Retrievers:

BM25 and ElasticSearch BM25
Dragon
Snowflake Arctic-embed-m

Rerankers:

BGE-reranker-large
Rank Zephyr
Own reranker - Mistral 7B with special prompt to compare question with 2 passages
Hybrid rerankers: flexible code to combine rerankers into pipelines or split the retrievers results into several rerankers

Other models can be easily integrated through Hydra configs.

Technical paper coming soon.

Running

The experiments were run on a Kubernetes cluster. We provide the instructions in:

Kubernetes Job Deployement Configuration files are available in Kubernetes jobs. In order to use them some fields e.g. VolumeMount need to be modified as they are specific to the cluster.

Standard Python Installation and Execution

To set up and run the project locally, follow these steps:

Clone the repository:

git clone https://github.com/jankowskichristopher/awesome-information-retrieval.git
cd awesome-information-retrieval

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required dependencies:
```
pip install -r requirements.txt
```
Change to the source directory and run the main script:
```
cd src
python main.py
```

Make sure you have Python 3.x installed on your system before following these steps.

Project structure

The repository uses Hydra to efficiently manage different configs and override parameters. The conducted experiments with results are reported alongside plots and useful scripts for visualization and data processing. More information about visualization is present in a separate README.

The source code is divided into a structure that enables relatively easy modifications.

.
├── src/
│   ├── cfgs/                 # Hydra configs
│   ├── dataset/
│   │   ├── beir/             # BEIR dataset for retrieval evaluation
│   │   └── qa/               # Different question answering datasets for evaluation of generators
│   ├── evaluation/           # For the retrieval and generation evaluation
│   ├── experiments/          # Code for conducting various experiments
│   ├── generators/           # Various LLM generators used in generation and LLM reranking
│   ├── rerankers/            # Various rerankers e.g. embedding and LLM
│   └── retrievers/           # Various retrievers e.g. BM25, Dragon, Arctic
├── experiments/
│   ├── plots/                # Plots for visualization
│   └── README.md             # More information about visualization
├── constants.py              # Useful constants
├── utils.py                  # Useful utils
├── constants.py              # Weights and Biases logging
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Information Retrieval

About

Running

Standard Python Installation and Execution

Project structure

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs		docs
experiments		experiments
jobs		jobs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

JankowskiChristopher/awesome-information-retrieval

Folders and files

Latest commit

History

Repository files navigation

Awesome Information Retrieval

About

Running

Standard Python Installation and Execution

Project structure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages