Lucene Search Engine for Cranfield Dataset

This project implements a search engine using Apache Lucene for indexing and querying the Cranfield dataset. It supports different analyzers (Standard, English, Whitespace) and scoring models (VSM, BM25, Boolean, LMD). The project also provides evaluation using TREC Eval to measure the effectiveness of the search engine.

Requirements

Java (JDK 11 or later)
Apache Maven
Python (for result conversion)
TREC Eval (for evaluation)

Project Structure

The project consists of three main components:

ParseCranfield.java: Parses the Cranfield dataset for indexing.
CreateIndex.java: Creates an index using different analyzers.
QueryIndex.java: Queries the index with different similarity models and analyzers.

How to Run the Project

Step 1: Index the Cranfield Dataset

To create the index, use the CreateIndex class with the desired analyzer. The available analyzers are standard, whitespace, and english.

mvn exec:java -Dexec.mainClass="com.assignment.lucene.CreateIndex" -Dexec.args="standard"

Replace standard with whitespace or english depending on the desired analyzer. Step 2: Query the Index

You can query the index using different similarity models (vsm, bm25, boolean, lmd) and analyzers (standard, whitespace, english). You can run the query in batch mode (for running all queries) or interactive mode (for manual input).

mvn exec:java -Dexec.mainClass="com.assignment.lucene.QueryIndex" -Dexec.args="batch vsm standard"

Replace batch with interactive for interactive mode.
Replace vsm with bm25, boolean, or lmd for the similarity model.
Replace standard with whitespace or english for the analyzer.

Step 3: Convert Results for TREC Eval

After running queries, you can convert the results files using the provided Python script (script.py):

This will process the results files for evaluation using TREC Eval.

Step 4: Evaluate with TREC Eval

Once the results are converted, you can evaluate the performance using TREC Eval. From within the trec_eval folder, use the following command:

./trec_eval <qrels_file> <results_file>

Replace <qrels_file> with the path to the qrels file. Replace <results_file> with the path to the results file.

For example:

./trec_eval -q ../cranqrel_converted ../lucene-search-engine/results/standard_vsm_results.txt

Project Structure

.
├── src
│   ├── main
│   │   └── java
│   │       └── com
│   │           └── assignment
│   │               └── lucene
│   │                   ├── ParseCranfield.java
│   │                   ├── CreateIndex.java
│   │                   └── QueryIndex.java
│   └── resources              
│       ├── cran.all.1400       # Cranfield documents file
│       ├── cran.qry            # Query file
│       ├── cranqrel            # Qrels (relevance judgments) file
│       └── cranqrel.readme     # Readme explaining qrel format
├── english_index
│   └── # index files generated using english analyser
├── whitespace_index
│   └── # index files generated using whitespace analyser
├── standard_index
│   └── # index files generated using standard analyser
├── results
│   └── (output result files)
├── script.py                   # Python script to convert results for TREC Eval
└── trec_eval
    └── (TREC Eval executable)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
plotted_results		plotted_results
src		src
target		target
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml
script.py		script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lucene Search Engine for Cranfield Dataset

Requirements

Project Structure

How to Run the Project

Step 1: Index the Cranfield Dataset

Step 3: Convert Results for TREC Eval

Step 4: Evaluate with TREC Eval

Project Structure

Results

Mean Average Precision (MAP) Scores plotted for each combination of Analyser + Similarity Model

Precision-Recall Graph of Best Performing Combination (English Analyser with BM25 Similarity)

About

Releases

Packages

Languages

abdullahzia-6464/lucene-cranfield-search

Folders and files

Latest commit

History

Repository files navigation

Lucene Search Engine for Cranfield Dataset

Requirements

Project Structure

How to Run the Project

Step 1: Index the Cranfield Dataset

Step 3: Convert Results for TREC Eval

Step 4: Evaluate with TREC Eval

Project Structure

Results

Mean Average Precision (MAP) Scores plotted for each combination of Analyser + Similarity Model

Precision-Recall Graph of Best Performing Combination (English Analyser with BM25 Similarity)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages