Tech Industry News Analyzer App

Overview

This independent project applied big data software architecture principles and machine learning techniques to analyze recent tech industry news articles, automatically extract common themes, and sort them into groups by topic. The primary aim was to help readers quickly identify current trends in media reporting on the tech industry and focus on the topics of most interest to the reader.

System Requirements

The following were the requirements for the project. Links to the corresponding code used to test the requirements are given in parentheses.

Users can view a web page displaying titles of articles grouped by topic (integration test)
Users can click a link to navigate to the full article on the publisher website (integration test)
Users can click a link to view all article titles assigned to a topic (integration test)
Users can view a list of all the topics (integration test)
Users can view a page summarizing the purpose of the website (integration test)
Users can receive the data via a REST API (integration test)
The system can collect the data from external APIs (unit test)
The system can persist the data (unit test)
The system can update the data as new data becomes available (unit test)

Architecture

The software architecture consists of three microservices which interact via a message queue broker, as illustrated in the above diagram. First, (starting on the left side of the diagram) a data collector microservice collects news article data daily from an external internet source (newsapi.org), stores the data in a database, and publishes it to a message queue. Next, upon receiving the data from the message queue, a data analyzer microservice stores it in a database, applies Latent Dirichlet Allocation (LDA) to discover common topics, and publishes the results to another message queue. Finally, a web microservice receives the data, stores it in a database, and presents the articles sorted by topic to the end user via web pages and a REST API service.

Design

The project design was guided by current best practices in software development and big data architecture. Through the use of containerized pods with delivery-confirmed message queues and data persistence, service interruptions to the end user are minimized and the system remains robust to temporary partitions between the services. Because the collected data is well-structured, relational databases are used to efficiently store, process, and retrieve the data. In addition, test doubles and mock external services were used to implement efficient unit and integration tests in an automated continuous integration and deployment workflow. Furthermore, online metrics and visualizations permit real-time monitoring of system performance.

Tech Stack

The following technology tools were used to implement the project.

Ubuntu v.22.4.4: Operating system
Kotlin v.1.9.22: Programming language
Java Virtual Machine v.17.0.10: Compilation and libraries
Gradle v.8.7: Build tool
Ktor v.2.3.8: Kotlin application framework
Netty v.4.1.106: Web server
Apache Freemarker v.2.3.32: Dynamic webpage templating
PostgreSQL v.16.2: Relational database
Exposed v.0.48.0: Object relational mapping framework
HikariCP v.5.1.0: Database connection pooling
Apache Spark v.3.3.2: Data analytics
Kotlin for Apache Spark v.1.2.4: Kotlin-Spark compatibility API
RabbitMQ v.5.21.0: Messaging broker
Junit v.4.13.2: Testing
Kover v.0.7.5: Test code coverage measurement
Micrometer v.1.6.8: Application metrics interface
Prometheus v.2.51.2: Performance metrics and monitoring storage
Grafana v.10.4.2: Performance metrics visualization
Docker Engine v.25.0.3: Containerization
Kubernetes v.1.30.0: Deployment container orchestrator
Kompose v.1.33.0: Docker Compose to Kubernetes conversion tool
Helm v.3.14.4: Kubernetes package manager
GitHub: Version control, continuous integration and deployment
Google Kubernetes Engine: v.1.28.8: Cloud computing service

Technical Features

This table summarizes the required technical features and practices implemented in the project and the corresponding code.

Feature	Code
Web application	applications/web-server
Data collection	applications/data-collector
Data analyzer	applications/data-analyzer
Unit tests	web-server/src/test, data-collector/src/test, data-analyzer/src/test, data-support/src/test, mq-support/src/test
Data persistence	components/data-support
REST API endpoint	web-server/src/main/kotlin/io/newsanalyzer/webserver/plugins/Routing.kt
Production environment	deployment
Integration tests	web-server/src/test, data-collector/src/test, data-analyzer/src/test
Test doubles	components/test-support
Continuous integration	github/workflows
Monitoring	monitoring
Event collaboration messaging	components/mq-support
Continuous delivery	github/workflows

Testing

Gradle is used to implement unit and integration tests, and these tests are incorporated into the continuous integration/continuous deployment workflow. Using test doubles and mock external services, the unit tests check each element of the system (e.g., database operations, message queue, data processing, etc.), and the integration tests check that these elements function together at the app level as expected: that the data can be reliably (1) collected, (2) stored in the collector database, (3) transferred to the data analyzer, (4) processed with unsupervised machine learning (LDA), (5) stored in the analyzer database, (6) passed to the web server, (7) stored in the web-server database, and (8) displayed to the end user in reverse chronological order grouped by topic. (It is worth noting that because unsupervised machine learning is being applied to unlabeled data in this project, the article topics can only be identified by common keywords.)

Production Environment

The project was deployed on Google Kubernetes Engine using Helm.

REST API

The REST API entry point is ./api. HATEOAS principles are applied in order to facilitate hypermedia-driven discovery of the endpoints within the API.

Monitoring

Production monitoring was accomplished by scraping metrics with Prometheus and visualizing with Google Cloud Monitoring.

Local development environment monitoring is implemented with Prometheus and Grafana instances run in Docker containers.

Local Setup

To run the app locally, you can either (A) run the app fully containerized on a local machine or (B) run each service (web, data collector, data analyzer) from separate terminals with only the database and message queue in Docker containers. Either way, you must perform the preliminary environment setup first. (The following commands are for a Linux/Unix environment.)

Preliminary Environment Setup

Install Docker Engine.
In a bash shell, clone the git repository and change to the project directory.

git clone https://github.com/tyknkd/news-analyzer.git
cd news-analyzer

Run the following bash commands to create a secrets file for the PostgreSQL database password (changing the yourPasswordGoesHere string as desired).

mkdir secrets
echo yourPasswordGoesHere > secrets/postgres_password.txt

Create a secrets file for the Grafana admin password (changing the yourPasswordGoesHere string as desired).

echo yourPasswordGoesHere > secrets/grafana_password.txt

Obtain an API key from newsapi.org and save it to a separate secrets file with the following bash command, replacing the yourNewsApiKeyGoesHere string with your newly obtained key. (NB: You can run the tests with a fake key, but an exception will be thrown if you attempt to run the app locally without a valid key because the real news data cannot be collected without it.)

echo yourNewsApiKeyGoesHere > secrets/news_api_key.txt

A. Fully Containerized Setup

Perform preliminary environment setup above.
Load the sensitive environment variables

source sensitive.env

Optional: Run all tests in containers. (This will take several minutes.)

docker compose up test

Build and start the Docker containers. (This will take several minutes the first time.)

docker compose up

In a web browser, open http://localhost:8888
Optional: View Grafana monitoring dashboard at localhost:3000/d/cdk5654bbrvnkf/news-analyzer-dashboard?orgId=1 (Note: The Grafana username is admin and the password is as set in the environment setup above.)
To stop all containers, press CTRL+C in the bash shell from which it was started.

B. Run Microservices from Separate Terminals Setup

Perform preliminary environment setup above.
Install Java 17
Install Spark 3.3.2 (Scala 2.13 version)
Load the sensitive environment variables.

source sensitive.env

Start only the database and message queue containers. (If monitoring is desired, append prometheus and grafana to the command.)

docker compose up db mq

In a separate bash shell, load the environment variables, and build and test the project

source .env && source sensitive.env
./gradlew build

In a separate terminal, load the environment variables and start the web server first.

source .env && source sensitive.env
./gradlew applications:web-server:run

In a separate bash shell, load the environment variables and start the data analyzer server second.

source .env && source sensitive.env
./gradlew applications:data-analyzer:run

In a separate shell, load the environment variables and start the data collector server last.

source .env && source sensitive.env
./gradlew applications:data-collector:run

In a web browser, open http://localhost:8888
Optional: If the Prometheus and Grafana containers are running (see Step 5), view the monitoring dashboard at localhost:3000/d/cdk5654bbrvnkf/news-analyzer-dashboard?orgId=1 (Note: The Grafana username is admin and the password is as set in the environment setup above.)
To stop the servers and Docker containers, press CTRL+C in the bash shells from which they were started.

Name		Name	Last commit message	Last commit date
Latest commit History 508 Commits
.github/workflows		.github/workflows
applications		applications
buildSrc		buildSrc
components		components
deployment		deployment
docker-databases/docker-entrypoint-initdb.d		docker-databases/docker-entrypoint-initdb.d
docker/spark		docker/spark
gradle/wrapper		gradle/wrapper
images		images
monitoring		monitoring
rabbitmq		rabbitmq
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
VCS.md		VCS.md
app-entrypoint.sh		app-entrypoint.sh
compose.yaml		compose.yaml
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
sensitive.env		sensitive.env
settings.gradle.kts		settings.gradle.kts
test-entrypoint.sh		test-entrypoint.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tech Industry News Analyzer App

Overview

System Requirements

Architecture

Design

Tech Stack

Technical Features

Testing

Production Environment

REST API

Monitoring

Local Setup

Preliminary Environment Setup

A. Fully Containerized Setup

B. Run Microservices from Separate Terminals Setup

About

Releases

Packages

Languages

tyknkd/news-analyzer

Folders and files

Latest commit

History

Repository files navigation

Tech Industry News Analyzer App

Overview

System Requirements

Architecture

Design

Tech Stack

Technical Features

Testing

Production Environment

REST API

Monitoring

Local Setup

Preliminary Environment Setup

A. Fully Containerized Setup

B. Run Microservices from Separate Terminals Setup

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages