Skip to content

App to collect, analyze, and display tech industry news

Notifications You must be signed in to change notification settings

tyknkd/news-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tech Industry News Analyzer App workflow status

Overview

This independent project applied big data software architecture principles and machine learning techniques to analyze recent tech industry news articles, automatically extract common themes, and sort them into groups by topic. The primary aim was to help readers quickly identify current trends in media reporting on the tech industry and focus on the topics of most interest to the reader.

System Requirements

The following were the requirements for the project. Links to the corresponding code used to test the requirements are given in parentheses.

  1. Users can view a web page displaying titles of articles grouped by topic (integration test)
  2. Users can click a link to navigate to the full article on the publisher website (integration test)
  3. Users can click a link to view all article titles assigned to a topic (integration test)
  4. Users can view a list of all the topics (integration test)
  5. Users can view a page summarizing the purpose of the website (integration test)
  6. Users can receive the data via a REST API (integration test)
  7. The system can collect the data from external APIs (unit test)
  8. The system can persist the data (unit test)
  9. The system can update the data as new data becomes available (unit test)

Architecture

architecture diagram

The software architecture consists of three microservices which interact via a message queue broker, as illustrated in the above diagram. First, (starting on the left side of the diagram) a data collector microservice collects news article data daily from an external internet source (newsapi.org), stores the data in a database, and publishes it to a message queue. Next, upon receiving the data from the message queue, a data analyzer microservice stores it in a database, applies Latent Dirichlet Allocation (LDA) to discover common topics, and publishes the results to another message queue. Finally, a web microservice receives the data, stores it in a database, and presents the articles sorted by topic to the end user via web pages and a REST API service.

Design

The project design was guided by current best practices in software development and big data architecture. Through the use of containerized pods with delivery-confirmed message queues and data persistence, service interruptions to the end user are minimized and the system remains robust to temporary partitions between the services. Because the collected data is well-structured, relational databases are used to efficiently store, process, and retrieve the data. In addition, test doubles and mock external services were used to implement efficient unit and integration tests in an automated continuous integration and deployment workflow. Furthermore, online metrics and visualizations permit real-time monitoring of system performance.

Tech Stack

The following technology tools were used to implement the project.

Technical Features

This table summarizes the required technical features and practices implemented in the project and the corresponding code.

Feature Code
Web application applications/web-server
Data collection applications/data-collector
Data analyzer applications/data-analyzer
Unit tests web-server/src/test, data-collector/src/test, data-analyzer/src/test, data-support/src/test, mq-support/src/test
Data persistence components/data-support
REST API endpoint web-server/src/main/kotlin/io/newsanalyzer/webserver/plugins/Routing.kt
Production environment deployment
Integration tests web-server/src/test, data-collector/src/test, data-analyzer/src/test
Test doubles components/test-support
Continuous integration github/workflows
Monitoring monitoring
Event collaboration messaging components/mq-support
Continuous delivery github/workflows

Testing

Gradle is used to implement unit and integration tests, and these tests are incorporated into the continuous integration/continuous deployment workflow. Using test doubles and mock external services, the unit tests check each element of the system (e.g., database operations, message queue, data processing, etc.), and the integration tests check that these elements function together at the app level as expected: that the data can be reliably (1) collected, (2) stored in the collector database, (3) transferred to the data analyzer, (4) processed with unsupervised machine learning (LDA), (5) stored in the analyzer database, (6) passed to the web server, (7) stored in the web-server database, and (8) displayed to the end user in reverse chronological order grouped by topic. (It is worth noting that because unsupervised machine learning is being applied to unlabeled data in this project, the article topics can only be identified by common keywords.)

Production Environment

The project was deployed on Google Kubernetes Engine using Helm. homepage

REST API

The REST API entry point is ./api. HATEOAS principles are applied in order to facilitate hypermedia-driven discovery of the endpoints within the API. homepage

Monitoring

Production monitoring was accomplished by scraping metrics with Prometheus and visualizing with Google Cloud Monitoring.
monitoring screenshot

Local development environment monitoring is implemented with Prometheus and Grafana instances run in Docker containers. dev monitoring screenshot

Local Setup

To run the app locally, you can either (A) run the app fully containerized on a local machine or (B) run each service (web, data collector, data analyzer) from separate terminals with only the database and message queue in Docker containers. Either way, you must perform the preliminary environment setup first. (The following commands are for a Linux/Unix environment.)

Preliminary Environment Setup

  1. Install Docker Engine.
  2. In a bash shell, clone the git repository and change to the project directory.
git clone https://github.com/tyknkd/news-analyzer.git
cd news-analyzer
  1. Run the following bash commands to create a secrets file for the PostgreSQL database password (changing the yourPasswordGoesHere string as desired).
mkdir secrets
echo yourPasswordGoesHere > secrets/postgres_password.txt
  1. Create a secrets file for the Grafana admin password (changing the yourPasswordGoesHere string as desired).
echo yourPasswordGoesHere > secrets/grafana_password.txt
  1. Obtain an API key from newsapi.org and save it to a separate secrets file with the following bash command, replacing the yourNewsApiKeyGoesHere string with your newly obtained key. (NB: You can run the tests with a fake key, but an exception will be thrown if you attempt to run the app locally without a valid key because the real news data cannot be collected without it.)
echo yourNewsApiKeyGoesHere > secrets/news_api_key.txt

A. Fully Containerized Setup

  1. Perform preliminary environment setup above.
  2. Load the sensitive environment variables
source sensitive.env
  1. Optional: Run all tests in containers. (This will take several minutes.)
docker compose up test
  1. Build and start the Docker containers. (This will take several minutes the first time.)
docker compose up
  1. In a web browser, open http://localhost:8888
  2. Optional: View Grafana monitoring dashboard at localhost:3000/d/cdk5654bbrvnkf/news-analyzer-dashboard?orgId=1 (Note: The Grafana username is admin and the password is as set in the environment setup above.)
  3. To stop all containers, press CTRL+C in the bash shell from which it was started.

B. Run Microservices from Separate Terminals Setup

  1. Perform preliminary environment setup above.
  2. Install Java 17
  3. Install Spark 3.3.2 (Scala 2.13 version)
  4. Load the sensitive environment variables.
source sensitive.env
  1. Start only the database and message queue containers. (If monitoring is desired, append prometheus and grafana to the command.)
docker compose up db mq
  1. In a separate bash shell, load the environment variables, and build and test the project
source .env && source sensitive.env
./gradlew build
  1. In a separate terminal, load the environment variables and start the web server first.
source .env && source sensitive.env
./gradlew applications:web-server:run
  1. In a separate bash shell, load the environment variables and start the data analyzer server second.
source .env && source sensitive.env
./gradlew applications:data-analyzer:run
  1. In a separate shell, load the environment variables and start the data collector server last.
source .env && source sensitive.env
./gradlew applications:data-collector:run
  1. In a web browser, open http://localhost:8888
  2. Optional: If the Prometheus and Grafana containers are running (see Step 5), view the monitoring dashboard at localhost:3000/d/cdk5654bbrvnkf/news-analyzer-dashboard?orgId=1 (Note: The Grafana username is admin and the password is as set in the environment setup above.)
  3. To stop the servers and Docker containers, press CTRL+C in the bash shells from which they were started.