Skip to content

JankowskiChristopher/awesome-information-retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Information Retrieval

License: MIT Code style: black

Created by: Krzysztof Jankowski, Michał Janik, Michał Grotkowski, Antoni Hanke. The research was presented at the Data Science Summit ML Edition 2024 and earlier parts at ML in PL 2024 and Warsaw.ai episode XX.

About

The repository contains code developed for experimentation with information retrieval and question answering systems. By combining various retrievers, rerankers and other techniques we conduct an in depth analysis on how to achieve the most performant pipelines. The repository uses the following models: Retrievers:

Rerankers:

  • BGE-reranker-large
  • Rank Zephyr
  • Own reranker - Mistral 7B with special prompt to compare question with 2 passages
  • Hybrid rerankers: flexible code to combine rerankers into pipelines or split the retrievers results into several rerankers

Other models can be easily integrated through Hydra configs.

Technical paper coming soon.

Running

The experiments were run on a Kubernetes cluster. We provide the instructions in:

Standard Python Installation and Execution

To set up and run the project locally, follow these steps:

  1. Clone the repository:

    git clone https://github.com/jankowskichristopher/awesome-information-retrieval.git
    cd awesome-information-retrieval
    
  2. Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
    
  3. Install the required dependencies:

    pip install -r requirements.txt
    
  4. Change to the source directory and run the main script:

    cd src
    python main.py
    

Make sure you have Python 3.x installed on your system before following these steps.

Project structure

The repository uses Hydra to efficiently manage different configs and override parameters. The conducted experiments with results are reported alongside plots and useful scripts for visualization and data processing. More information about visualization is present in a separate README.

The source code is divided into a structure that enables relatively easy modifications.

.
├── src/
│   ├── cfgs/                 # Hydra configs
│   ├── dataset/
│   │   ├── beir/             # BEIR dataset for retrieval evaluation
│   │   └── qa/               # Different question answering datasets for evaluation of generators
│   ├── evaluation/           # For the retrieval and generation evaluation
│   ├── experiments/          # Code for conducting various experiments
│   ├── generators/           # Various LLM generators used in generation and LLM reranking
│   ├── rerankers/            # Various rerankers e.g. embedding and LLM
│   └── retrievers/           # Various retrievers e.g. BM25, Dragon, Arctic
├── experiments/
│   ├── plots/                # Plots for visualization
│   └── README.md             # More information about visualization
├── constants.py              # Useful constants
├── utils.py                  # Useful utils
├── constants.py              # Weights and Biases logging
└── README.md

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •