This independent project applied big data software architecture principles and machine learning techniques to analyze recent tech industry news articles, automatically extract common themes, and sort them into groups by topic. The primary aim was to help readers quickly identify current trends in media reporting on the tech industry and focus on the topics of most interest to the reader.
The following were the requirements for the project. Links to the corresponding code used to test the requirements are given in parentheses.
- Users can view a web page displaying titles of articles grouped by topic (integration test)
- Users can click a link to navigate to the full article on the publisher website (integration test)
- Users can click a link to view all article titles assigned to a topic (integration test)
- Users can view a list of all the topics (integration test)
- Users can view a page summarizing the purpose of the website (integration test)
- Users can receive the data via a REST API (integration test)
- The system can collect the data from external APIs (unit test)
- The system can persist the data (unit test)
- The system can update the data as new data becomes available (unit test)
The software architecture consists of three microservices which interact via a message queue broker, as illustrated in the above diagram. First, (starting on the left side of the diagram) a data collector microservice collects news article data daily from an external internet source (newsapi.org), stores the data in a database, and publishes it to a message queue. Next, upon receiving the data from the message queue, a data analyzer microservice stores it in a database, applies Latent Dirichlet Allocation (LDA) to discover common topics, and publishes the results to another message queue. Finally, a web microservice receives the data, stores it in a database, and presents the articles sorted by topic to the end user via web pages and a REST API service.
The project design was guided by current best practices in software development and big data architecture. Through the use of containerized pods with delivery-confirmed message queues and data persistence, service interruptions to the end user are minimized and the system remains robust to temporary partitions between the services. Because the collected data is well-structured, relational databases are used to efficiently store, process, and retrieve the data. In addition, test doubles and mock external services were used to implement efficient unit and integration tests in an automated continuous integration and deployment workflow. Furthermore, online metrics and visualizations permit real-time monitoring of system performance.
The following technology tools were used to implement the project.
- Ubuntu v.22.4.4: Operating system
- Kotlin v.1.9.22: Programming language
- Java Virtual Machine v.17.0.10: Compilation and libraries
- Gradle v.8.7: Build tool
- Ktor v.2.3.8: Kotlin application framework
- Netty v.4.1.106: Web server
- Apache Freemarker v.2.3.32: Dynamic webpage templating
- PostgreSQL v.16.2: Relational database
- Exposed v.0.48.0: Object relational mapping framework
- HikariCP v.5.1.0: Database connection pooling
- Apache Spark v.3.3.2: Data analytics
- Kotlin for Apache Spark v.1.2.4: Kotlin-Spark compatibility API
- RabbitMQ v.5.21.0: Messaging broker
- Junit v.4.13.2: Testing
- Kover v.0.7.5: Test code coverage measurement
- Micrometer v.1.6.8: Application metrics interface
- Prometheus v.2.51.2: Performance metrics and monitoring storage
- Grafana v.10.4.2: Performance metrics visualization
- Docker Engine v.25.0.3: Containerization
- Kubernetes v.1.30.0: Deployment container orchestrator
- Kompose v.1.33.0: Docker Compose to Kubernetes conversion tool
- Helm v.3.14.4: Kubernetes package manager
- GitHub: Version control, continuous integration and deployment
- Google Kubernetes Engine: v.1.28.8: Cloud computing service
This table summarizes the required technical features and practices implemented in the project and the corresponding code.
Feature | Code |
---|---|
Web application | applications/web-server |
Data collection | applications/data-collector |
Data analyzer | applications/data-analyzer |
Unit tests | web-server/src/test, data-collector/src/test, data-analyzer/src/test, data-support/src/test, mq-support/src/test |
Data persistence | components/data-support |
REST API endpoint | web-server/src/main/kotlin/io/newsanalyzer/webserver/plugins/Routing.kt |
Production environment | deployment |
Integration tests | web-server/src/test, data-collector/src/test, data-analyzer/src/test |
Test doubles | components/test-support |
Continuous integration | github/workflows |
Monitoring | monitoring |
Event collaboration messaging | components/mq-support |
Continuous delivery | github/workflows |
Gradle is used to implement unit and integration tests, and these tests are incorporated into the continuous integration/continuous deployment workflow. Using test doubles and mock external services, the unit tests check each element of the system (e.g., database operations, message queue, data processing, etc.), and the integration tests check that these elements function together at the app level as expected: that the data can be reliably (1) collected, (2) stored in the collector database, (3) transferred to the data analyzer, (4) processed with unsupervised machine learning (LDA), (5) stored in the analyzer database, (6) passed to the web server, (7) stored in the web-server database, and (8) displayed to the end user in reverse chronological order grouped by topic. (It is worth noting that because unsupervised machine learning is being applied to unlabeled data in this project, the article topics can only be identified by common keywords.)
The project was deployed on Google Kubernetes Engine using Helm.
The REST API entry point is ./api
. HATEOAS principles are applied in order to facilitate hypermedia-driven
discovery of the endpoints within the API.
Production monitoring was accomplished by scraping metrics with Prometheus
and visualizing with Google Cloud Monitoring.
Local development environment monitoring is implemented with Prometheus and Grafana instances run in Docker containers.
To run the app locally, you can either (A) run the app fully containerized on a local machine or (B) run each service (web, data collector, data analyzer) from separate terminals with only the database and message queue in Docker containers. Either way, you must perform the preliminary environment setup first. (The following commands are for a Linux/Unix environment.)
- Install Docker Engine.
- In a bash shell, clone the git repository and change to the project directory.
git clone https://github.com/tyknkd/news-analyzer.git
cd news-analyzer
- Run the following bash commands to create a secrets file for the PostgreSQL database password (changing the
yourPasswordGoesHere
string as desired).
mkdir secrets
echo yourPasswordGoesHere > secrets/postgres_password.txt
- Create a secrets file for the Grafana admin password (changing the
yourPasswordGoesHere
string as desired).
echo yourPasswordGoesHere > secrets/grafana_password.txt
- Obtain an API key from newsapi.org and save it to a separate secrets file with the following bash command, replacing the
yourNewsApiKeyGoesHere
string with your newly obtained key. (NB: You can run the tests with a fake key, but an exception will be thrown if you attempt to run the app locally without a valid key because the real news data cannot be collected without it.)
echo yourNewsApiKeyGoesHere > secrets/news_api_key.txt
- Perform preliminary environment setup above.
- Load the sensitive environment variables
source sensitive.env
- Optional: Run all tests in containers. (This will take several minutes.)
docker compose up test
- Build and start the Docker containers. (This will take several minutes the first time.)
docker compose up
- In a web browser, open http://localhost:8888
- Optional: View Grafana monitoring dashboard at localhost:3000/d/cdk5654bbrvnkf/news-analyzer-dashboard?orgId=1 (Note: The Grafana username is
admin
and the password is as set in the environment setup above.) - To stop all containers, press
CTRL+C
in the bash shell from which it was started.
- Perform preliminary environment setup above.
- Install Java 17
- Install Spark 3.3.2 (Scala 2.13 version)
- Load the sensitive environment variables.
source sensitive.env
- Start only the database and message queue containers. (If monitoring is desired, append
prometheus
andgrafana
to the command.)
docker compose up db mq
- In a separate bash shell, load the environment variables, and build and test the project
source .env && source sensitive.env
./gradlew build
- In a separate terminal, load the environment variables and start the web server first.
source .env && source sensitive.env
./gradlew applications:web-server:run
- In a separate bash shell, load the environment variables and start the data analyzer server second.
source .env && source sensitive.env
./gradlew applications:data-analyzer:run
- In a separate shell, load the environment variables and start the data collector server last.
source .env && source sensitive.env
./gradlew applications:data-collector:run
- In a web browser, open http://localhost:8888
- Optional: If the Prometheus and Grafana containers are running (see Step 5), view the monitoring dashboard at localhost:3000/d/cdk5654bbrvnkf/news-analyzer-dashboard?orgId=1 (Note: The Grafana username is
admin
and the password is as set in the environment setup above.) - To stop the servers and Docker containers, press
CTRL+C
in the bash shells from which they were started.