Purpose of this project is to leverage reviews about major delivery companies that are operating in the UK, and perform NLP tasks to analyze different aspects of the reviews like the sentiment, most common words, probability distributions across word sequences, and more.
graph LR
A[Build a tool to connect to web sources APIs] -->|Get reviews from web| B[Clean reviews]
B --> D[Knowledge Graphs]
B --> F[Unsupervised Clustering]
B --> C(Sentiment Analysis)
B --> |Identify topic of review| E[Topic Extraction]
E --> |Train Model| I[Assign Topic to new instances]
C --> |Train Model| J[Sentiment Classifier]
I --> K[Build UI]
J --> K[Build UI]
To get reviews from the TrustPilot website, we are leveraging a custom made web scraping tool. This tool is iterating across different pages of the website and collects the reviews and any other relevant information, with the output being stored in CSV files.
-
Set-up the appropriate configurations in config.json. The config needs to get populated with the following metadata:
- source_url: Main domain URL
- starting_page: Domain subpath to a specific reviews page
- steps: Defines number of pages to iterate over
- company: Company/Service of interest -
Execute the python retriever script
python data_retriever.py