Skip to content

Project Description

abhijeet-13 edited this page Aug 9, 2018 · 1 revision

Tiny Disclaimer: This work was done as part of the final project in a course on Big-Data Analytics at the University of Texas at Dallas. This code may not be copied in its entirety without the express approval from one of the contributors of the project (as listed below); but however may be used for understanding the general concepts or deriving inspiration from, so long as you understand that this code is provided as is without any warranty, explicit or implicit, from the project contributors. The contributors of the project were:

  • Abhijeet Singh (Email: axs175531 [AT] utdallas.edu)
  • Kartikey Gupta (Email: kxg173430 [AT] utdallas.edu)
  • Marissa Miller

The project makes use of the 2016 United States Presidential Election Tweet Ids dataset (available here), which is a massive dataset consisting of the Tweet Ids corresponding to an almost 280 million tweets related to the US 2016 Presidential Election. The idea is to figure out who were the people who influenced the online Twitter community at the time, as evident from the retweet-patterns in the data. This phase is followed by streaming Tweets from the detected influencers and performing LDA on them to detect the expected topics which may trend currently.

The project is divided into three package modules:

  1. download: This package module is responsible for obtaining the complete Tweet information corresponding to the tweet Ids present in the dataset files, in the form of Big Integers. The Scala singleton object TweetDownloader requires the Twitter API keys and uses the twitter4j Java library to download the corresponding tweets, parse them, find if it is a retweet and if so, add edges of the form (retweeting user, retweeted user) to the output file.

  2. process: This package module is responsible for processing the edges obtained in the previous part. The InfluencerDetector Scala singleton creates a graph out of the edges from the last step; and runs a PageRank algorithm to detect the influencers within the community with the help of the Spark GraphFrames library. The singleton BotDetector filters this list of influencers by analyzing the graph for patterns like high outward to inward degree ratios and clique motifs.

  3. analyze: This package module is responsible for analyzing the expected trends based on what the influencers discuss on the online community. The module TweetListener takes in a list of influencers detected in the last step using the Twitter Streaming APIs and streams the content of those tweets in real-time to a remote Kafka cluster. The module TopicAnalyzer is responsible for using Spark MLlib performing LDA, for a cluster size of K = 3, and sending the topics back to the Kafka cluster under different topics. These topics may then be redirected to Kibana using ElasticSearch LogStash for visualization.

Presented below is a Kibana snapshot describing how the three topics currently trending could be: Legal, Politics, and, Health, as evident from the pie-chart of the top topics for a chosen period:

Topic 1: Legal

Topic 2: Politics

Topic 3: Health

Clone this wiki locally