This project is a document search and similarity engine built using Django. It allows users to upload files to an existing corpus and search for queries within that corpus. The engine processes the documents by performing tokenization, stop word removal, and stemming using the Porter Stemmer. It then calculates the similarity measure using TF-IDF vectorizer and Cosine similarity, displaying the results for each file.
- File Upload: Users can upload new documents to an existing corpus.
- Search Queries: Users can search for queries within the corpus.
- Document Processing: The engine processes documents by tokenizing, removing stop words, and stemming, it supports(pdf, text and docx).
- Similarity Calculation: It calculates the similarity measure using TF-IDF vectorizer and Cosine similarity.
- Results Display: Displays the calculated similarity measures for each file in the corpus.
-
Clone the repository:
git clone https://github.com/Samuel-K95/Search-engine.git cd Search-engine
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required dependencies:
pip install -r requirements.txt
-
Run the development server:
python manage.py runserver
-
Access the application:
Open your web browser and go to
http://127.0.0.1:8000
.
-
Upload a File:
- Navigate to the upload page.
- Choose a file from your local system.
- Click the "Upload" button to add the file to the corpus.
-
Search for a Query:
- Go to the search page.
- Enter your search query in the search bar.
- Click the "Search" button to get the results.
-
View Results:
- The results page will display the calculated similarity measures for each file in the corpus, sorted by relevance.
The document processing steps include:
- Tokenization: Splitting the text into individual tokens (words).
- Stop Word Removal: Removing common stop words (e.g., "the", "and", "is") that do not carry significant meaning.
- Stemming: Reducing words to their root form using the Porter Stemmer (e.g., "running" becomes "run").
The similarity between the query and documents in the corpus is calculated using the following techniques:
- TF-IDF Vectorizer: Converts the text into numerical vectors based on term frequency-inverse document frequency.
- Cosine Similarity: Measures the cosine of the angle between two vectors, indicating the similarity between the query and each document.
- Django
- scikit-learn
- NLTK (Natural Language Toolkit)
- PyPDF2 (for PDF file handling)
- Fork the repository
- Create a new branch (
git checkout -b feature-branch
) - Commit your changes (
git commit -m 'Add new feature'
) - Push to the branch (
git push origin feature-branch
) - Create a new Pull Request
This project is licensed under the MIT License. See the LICENSE file for details.