This project aims to develop a Search Engine from scratch. Is made up of two main stages: the construction of an inverted index structure from a set of text documents (from MSMARCO Passages collection) and the query processing stage over such inverted index.
sudo apt install maven
mvn compile
mkdir data/intermediate_postings/
mkdir data/intermediate_postings/index/
mkdir data/intermediate_postings/lexicon/
mkdir data/intermediate_postings/doc_index/
cd data
chmod +x cleanup.sh
./cleanup.sh
cd ..
mvn -e exec:java -Dexec.mainClass="org.offline_phase.MainClass" -Dexec.args="-p -c"
[-p] apply stemming and stopword removal
[-c] index compression
mvn -e exec:java -Dexec.mainClass="org.online_phase.MainClass" -Dexec.args="-p -c -k=20 -s=bm25
[-p] apply stemming and stopword removal
[-c] index compression
[-k=20] retrieve the top 20 document
[-s=bm25] use BM25 scoring function (otherwise TFIDF will be applied)
mvn -e exec:java -Dexec.mainClass="org.evaluation.MainClass" -Dexec.args="-p -c -k=20 -s=bm25 -mode=d
[-p] apply stemming and stopword removal
[-c] index compression
[-k=20] retrieve the top 20 document
[-s=bm25] use BM25 scoring function (otherwise TFIDF will be applied)
[-mode=c] use DAAT in conjunctive mode
[-mode=d] use DAAT in disjunctive mode (if no mode is specified, MaxScore si used
All experiments and benchmarks detailed were executed on the same machine (MSI Prestige 14 Evo A11M) in order to ensure a standardized and well-defined environment for our experiments.
- OS: Microsoft Windows 11 Home 64 bit Ver.2009(OS build 22000.675)
- CPU: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
- Memory: 16 GB @ 2133 MHz, 8 × 2048 MB, LPDDR4-4267
- Graphics: Intel(R) Iris(R) Xe Graphics, 1024 MB
- Disk: SSD, SAMSUNG MZVL2512HCJQ-00B00, 476.94 GB
flags | # Terms | Index (MB) | Lexicon (MB) | DocIndex (MB) | Time Elapsed |
---|---|---|---|---|---|
none | 1,369,123 | 2690 | 54.8 | 101 | 00:15:09 |
-c | 1,369,123 | 1340 | 54.8 | 101 | 00:17:38 |
-p | 1,170,498 | 1430 | 46.3 | 101 | 00:06:30 |
-p -c | 1,170,498 | 738 | 46.3 | 101 | 00:07:35 |
mode | score | MAP | P@20 | NDCG | Time Elapsed (s) |
---|---|---|---|---|---|
DAAT Conj. | TFIDF | 0.139 | 0.329 | 0.262 | 0.023 |
DAAT Conj. | BM25 | 0.141 | 0.351 | 0.266 | 0.021 |
DAAT Disj. | TFIDF | 0.132 | 0.367 | 0.257 | 0.034 |
DAAT Disj. | BM25 | 0.182 | 0.460 | 0.323 | 0.035 |
MaxScore | TFIDF | 0.132 | 0.367 | 0.257 | 0.028 |
MaxScore | BM25 | 0.182 | 0.460 | 0.323 | 0.026 |
for more information about results and performance, please talke a look to the final report