An Information Retrieval system that is capable of accommodating
- Boolean search queries
- Positional search queries
- Free text queries
- Naive Bayes Classification
- K Nearest Neighbor Classification
The Naive Bayes classifier classifies documents into 5 class values namely
- Business
- Entertainment
- Politics
- Technology
- Sports
On using the search engine, only the first 100 characters of each found
document will be printed in the command line. The entire output will be
written to a text file called query_results.txt
.
All the dependencies required can be installed using pip
- Pandas
- Numpy
- Nltk
- After installing NLTK, uncomment 3rd line in program to download essential dependent files.
- Pickle
Parameter | Feature |
---|---|
--bs | Boolean Search using Inverted Index |
--ps | Positional Search using Inverted Index |
--vsm | Free text search using Vector Space Model |
--nb | Naive Bayes Classification (Multinomial) |
--knn | K Nearest Neighbor Classification |
Currently this is a command line application and the following commands can be used to utilize the features of the system.
python3 search_engine.py --bs *query
python3 search_engine.py --ps *query
python3 search_engine.py --vsm *query
python3 search_engine.py --nb document_path
python3 search_engine.py --knn document_path
A set of test queries that can be used to check the functionality of the system are given below.
AND Queries
- Jackie Ballard Parliament
- Research and Development
Positional Queries
- flattered and honoured
- printed with registered trademarks
Free text Queries
- Cyber computers government
- Actress actor theatre
Test Classification Documents
A set of test documents are available in testing_documents
folder that
can be used to test the functionality of the naive bayes classifier. These
documents were not used to train the model and are from the testing set of
documents.
Some examples are
- politics_1.txt
- sport_1.txt
- entertainment_1.txt
- business_1.txt
- tech_1.txt
[https://drive.google.com/drive/folders/1tqJWz7wUYaVPhLadqFEhh4Rq8rt3kkkM?usp=sharing]
python3 search_engine.py --bs Anderson country
The above command will search for documents containing both "Anderson" and "country".
python3 search_engine.py --ps New York
The above command will search for documents that contain "New" and "York" in the same order.
python3 search_engine.py --vsm Today in politics
The above command will search and provides top 10 documents that are similar to the query.
python3 search_engine.py --knn test_documents/tech/tech_1.txt
It classifies the tech_1.txt document using K nearest neighbors classifier.
python3 search_engine.py --nb test_documents/tech/tech_2.txt
It classifies the tech_1.txt document using Naive Bayes classifier.