Skip to content
This repository has been archived by the owner on Jun 6, 2020. It is now read-only.

Latest commit

 

History

History
30 lines (18 loc) · 1.59 KB

README.md

File metadata and controls

30 lines (18 loc) · 1.59 KB

Royal Institute of Technology KTH - Stockholm

Simple Search Engine

GitHub Actions status

A simple search engine to index a corpus of documents and search for words with specific query paramteres. This project is part of the course ID1020 Algorithms and Data Structures.

This repository contains code written during the fall semester 2016 by Simone Stefani

Structure

alt text

Description

  • Index: a HashMap that contains all the indexed words as word-list_of_postings key-value pairs.
  • ResultDocument: an object that links a word (or a set of word) with a document that contains it. It refers to a specific document and carries properties related to the words such as hits, populairty and relevance (as tf-idf).

The search engine contains other two HashMaps:

  • DocumentsLength: keeps track of the length of each processed document.
  • Cache: contains cached queries

The the postings (resultDocuments) for each word are sorted dynamically at insertion. Consequently they can be retrieved through binary search.

When the user input query string is processed a parsedQuery is returned in the form of nested sub-query objects. Consequently when searching for a complex query, the parsedQuery can be analysed recursively and the fundamental queries can be then combined with operators.