DIS-Project-1: Document Retrieval

Authors: Harkeerat Singh Sawhney, Jiayi Li, Alexandre Sallinen

Overview

This repo contain three different implementations of document retrieval system. If you want to get our best results please run the first implementation. The other two implementations are us trying to implement different strategies. The implementations are as follows:

1. BM25 Ranking (Best Implementation)

Filename: br.ipynb
Score: 0.77599

2. Cosine Similarity + BM25 Ranking

Filename: cosine_similarity_bm25_reraanking.py
Score: 0.68688

3. Cosine Similarity

Filename: cosine_similarity.py
Score: 0.27647

Implementations:

1. BM25 Ranking

This implementation gave us the best result. It is implemented in the notebook bm25_ranking.ipynb. This gave us the score of $0.77599$ on Kaggle. The main functions of this notebook include:

Loading and preprocessing the corpus and datasets. This section loads the corpus and the training and test datasets. It preprocesses the corpus and the datasets by using different text processing techniques such as tokenization, lemmatization, and stopword removal.
Building BM25 models for each language. This section builds BM25 models for each language using the preprocessed corpus.
Evaluating Recall@10 on the validation set. This section evaluates the Recall@10 metric on the validation set for each language.
Retrieving documents for test queries. This section retrieves documents for test queries using the BM25 models.
Generating a submission file. This section generates a submission file containing the retrieval results for test queries.

How to run the code:

Place the following three files in the same directory as this notebook: corpus.json, train.csv, test.csv
Run the notebook BM25_Ranking.ipynb, it will generate the required files and submission.csv file.

Requirements

Python 3.x
Required Libraries: pandas, numpy, scikit-learn, nltk, transformers, tqdm, joblib, konlpy

Generated Files:

Preprocessed Corpus:
- Filename: preprocessed_corpus.pkl
- Content: Pickle file containing the preprocessed corpus texts and BM25 Models and Document IDs:
BM25 Model Files:
- bm25_model_{lang}.joblib (e.g., bm25_model_en.joblib)
Document ID Files:
- doc_ids_{lang}.pkl (e.g., doc_ids_en.pkl)
Validation Queries:
- preprocessed_val_queries.pkl
- val_query_langs.pkl
Test Queries:
- preprocessed_test_queries.pkl
- test_query_langs.pkl
Submission File:
- Filename: submission.csv
- Content: CSV file containing the retrieval results for test queries.

Your do not need to generate all these files again once you complete execution.

Cosine Similarity with BM25 Re-ranking

This script implements a document retrieval system that combines cosine similarity with BM25 re-ranking. The system preprocesses text data, computes term frequency (TF) and document frequency (DF) dictionaries, calculates inverse document frequency (IDF) scores, and ranks documents using cosine similarity and BM25.

Requirements

Python 3.x
Required Libraries:
pandas
numpy
nltk
torch
transformers
tqdm

Directory Setup

The whole dataset should be under the folder named data in the same directory as the script. The data folder should contain the following files:

./corpus.json/corpus.json: JSON file containing the corpus data.
./train.csv: CSV file containing the training data.
./test.csv: CSV file containing the test data.
./dev.csv: CSV file containing the validation data.

Functions

compute_tf_df_and_avgdl(corpus_df, path_to_saved_file)
- Computes the term frequency (TF), document frequency (DF), average document length (avgdl), and number of documents.
compute_idf(df_dict, num_docs)
- Computes the inverse document frequency (IDF) scores.
load_and_preprocess_queries(path_to_test_query)
- Loads and preprocesses the test queries.
rank_documents_with_cosine_similarity_and_bm25(corpus_df, test_query_df, tf_dict, idf_dict, avgdl)

Output Ranked Documents CSV:

Filename: submission.csv
Content: CSV file containing the query IDs and the ranked document IDs.

3. Document Retrieval with Cosine Similarity and Language Embedding

This project implements a document retrieval system using cosine similarity and language embeddings. The system preprocesses text data, builds an inverted index, retrieves candidate documents, and re-ranks them using multilingual embeddings.

Required Libraries:

pandas
numpy
nltk
torch
transformers
tqdm
fast_langdetect

Functions

1. split_into_sentences(text)

Splits the given text into sentences.

2. preprocess_text(text)

Preprocesses the text by detecting the language, removing stop words, and tokenizing.

3. build_inverted_index(docs)

Builds an inverted index from the given documents.

4. get_original_docid(sentence_id)

Extracts the original document ID from a sentence ID.

5. preprocess_query(query)

Preprocesses a query using the same preprocessing steps as for documents.

6. score_documents_for_query(args)

Scores documents for a given query using the BM25 algorithm.

7. get_embedding(text)

Generates embeddings for the given text using a pre-trained multilingual model.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
Project 1_ Document Retrieval.pdf		Project 1_ Document Retrieval.pdf
README.md		README.md
bm25_ranking.ipynb		bm25_ranking.ipynb
bm25_text_preprocessing.py		bm25_text_preprocessing.py
corpus_processing_and_embedding.py		corpus_processing_and_embedding.py
cosine_similarity_bm25_reranking.py		cosine_similarity_bm25_reranking.py
cosine_similarity_lm_reranking..py		cosine_similarity_lm_reranking..py
debug.py		debug.py
document_ranking_with_bm25.py		document_ranking_with_bm25.py
faiss_based_ANN_Implementation.py		faiss_based_ANN_Implementation.py
final_implementation.ipynb		final_implementation.ipynb
final_implementation.py		final_implementation.py
preprocessing_with_stopwords_and_stemming.py		preprocessing_with_stopwords_and_stemming.py
query_ranking_and_embedding.py		query_ranking_and_embedding.py
query_ranking_and_preprocessing.py		query_ranking_and_preprocessing.py
ranking_with_bm25.py		ranking_with_bm25.py
submission.csv		submission.csv
team_run1.py		team_run1.py
text_preprocessing_and_batch_processing.py		text_preprocessing_and_batch_processing.py
text_preprocessing_and_embedding_setup.py		text_preprocessing_and_embedding_setup.py
text_preprocessing_and_stopwords.py		text_preprocessing_and_stopwords.py
text_preprocessing_and_stopwords_setup.py		text_preprocessing_and_stopwords_setup.py
text_preprocessing_and_tfidf.py		text_preprocessing_and_tfidf.py
text_preprocessing_and_tfidf_computation.py		text_preprocessing_and_tfidf_computation.py
text_preprocessing_setup.py		text_preprocessing_setup.py
text_preprocessing_with_stopwords_and_stemming.py		text_preprocessing_with_stopwords_and_stemming.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DIS-Project-1: Document Retrieval

Overview

1. BM25 Ranking (Best Implementation)

2. Cosine Similarity + BM25 Ranking

3. Cosine Similarity

Implementations:

1. BM25 Ranking

How to run the code:

Requirements

Generated Files:

Cosine Similarity with BM25 Re-ranking

Requirements

Directory Setup

Functions

Output Ranked Documents CSV:

3. Document Retrieval with Cosine Similarity and Language Embedding

Required Libraries:

Functions

1. split_into_sentences(text)

2. preprocess_text(text)

3. build_inverted_index(docs)

4. get_original_docid(sentence_id)

5. preprocess_query(query)

6. score_documents_for_query(args)

7. get_embedding(text)

About

Releases

Packages

Languages

Harkeerat2002/document-retrieval

Folders and files

Latest commit

History

Repository files navigation

DIS-Project-1: Document Retrieval

Overview

1. BM25 Ranking (Best Implementation)

2. Cosine Similarity + BM25 Ranking

3. Cosine Similarity

Implementations:

1. BM25 Ranking

How to run the code:

Requirements

Generated Files:

Cosine Similarity with BM25 Re-ranking

Requirements

Directory Setup

Functions

Output Ranked Documents CSV:

3. Document Retrieval with Cosine Similarity and Language Embedding

Required Libraries:

Functions

1. split_into_sentences(text)

2. preprocess_text(text)

3. build_inverted_index(docs)

4. get_original_docid(sentence_id)

5. preprocess_query(query)

6. score_documents_for_query(args)

7. get_embedding(text)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages