We will get a big amount of streaming data from the NewsAPI. We will store this requested data into a Kafka cluster through a Java spring boot application. After we will process this data with Spark and store in HBase and Hive.
- collector: Get information from APIs and introduce this data into Kafka Cluster
- consumer: Receive information and process via Spark streaming and save it into Hive & HBase
- start.sh : Script to start project
- test.sh: Script to run tests in both projects
- scripts.sh: Scripts to manage kafka and stop servers
- config.txt: Configuration to apply in the project (query to run, time intervals...)
Get into the folder and:
./test.sh ./run.sh
Documentation v1 (Updated) : Google slides Documentation v2 : Google doc