The course projects of Columbia Big Data Analytics
Homework 1: Clustering and Binary Classification
- Implement and run in Spark
- Process data with Spark Dataframe, and perform graph analysis
The goals of this assignment are to
(1) understand how to implement K-means clustering algorithm in Spark by utilizing transformations and actions,
(2) understand the impact of using different distance measurements and initialization strategies in clustering,
(3) learn how to use the built-in Spark MLlib library to conduct supervised and unsupervised learning,
(4) have experience of processing data with ML Pipeline and Dataframe.
In the first question, you will conduct document clustering. The dataset we’ll be using is a set of vectorized text documents. In today’s world, you can see applications of document clustering almost everywhere. For example, Flipboard uses LDA topic modelling, approximate nearest neighbor search, and clustering to realize their “similar stories / read more” recommendation feature. You can learn more by reading this blog post. To conduct document clustering, you will implement the classic iterative K-means clustering in Spark with different distance functions, and compare with the one implemented in Spark MLlib.
In the second question, you will load data into Spark Dataframe and perform binary classification with Spark MLlib. We will use logistic regression model as our classifier, which is one of the foundational and widely used tools for making classifications. For example, Facebook uses logistic regression as one of the components in its online advertising system. You can read more in a publication here.
Homework 2: “People You Might Know” Social Network
Write a Spark program that implements a simple “People You Might Know” social network friendship recommendation algorithm. The key idea is that if two people have a lot of mutual friends, then the system should recommend that they connect with each other.
Input:
The input file contains the adjacency list and has multiple lines in the following format: . Here, is a unique integer ID corresponding to a unique user and is a comma separated list of unique IDs corresponding to the friends of the user with the unique ID . Note that the friendships are mutual (i.e., edges are undirected): if A is friend with B then B is also friend with A. The data provided is consistent with that rule as there is an explicit entry for each side of each edge.
Algorithm:
Let us use a simple algorithm such that, for each user U, the algorithm recommends N = 10 users who are not already friends with U, but have the most number of mutual friends in common with U.
Output:
The output should contain one line per user in the following format: Where is a unique ID corresponding to a user and is a list of unique IDs corresponding to the algorithm’s recommendation of people that might know, ordered in decreasing number of mutual friends. If a user has less than 10 second-degree friends, output all of the, in decreasing order of the number of mutual friends. If a user has no friends, providing an empty list of recommendations. If there are recommended users with the same number of mutual friends, then output those user IDs in numerically ascending order.
Run Connected Components and PageRank with GraphFrames. You can refer to the GraphFrames documentation.
PageRank measures the importance of each vertex in a graph, assuming an edge from u to v represents an endorsement of v’s importance by u. For example, if a Twitter user is followed by many others, the user will be ranked highly.
Homework 3: Twitter data analysis with Spark Streaming
In this assignment, implement a streaming analysis process. The architecture is as follows. A socket request data from twitter API and send data to spark streaming process. Spark read real time data and do analysis. It also save temp streaming results to Google Storage. After the streaming process terminate, it reads the final data from Google Storage and save it to BigQuery, and then clean the data in Storage.
The streaming operation should be:
Homework 4: Data Visualization
Using d3.js to do visualization