Big Data Analytics in Columbia

COLUMBIA Big Data

Big Data Analytics in Columbia

The course projects of Columbia Big Data Analytics

Homework 1: Clustering and Binary Classification

Implement and run in Spark
Process data with Spark Dataframe, and perform graph analysis

The goals of this assignment are to

(1) understand how to implement K-means clustering algorithm in Spark by utilizing transformations and actions,

(2) understand the impact of using different distance measurements and initialization strategies in clustering,

(3) learn how to use the built-in Spark MLlib library to conduct supervised and unsupervised learning,

(4) have experience of processing data with ML Pipeline and Dataframe.

In the first question, you will conduct document clustering. The dataset we’ll be using is a set of vectorized text documents. In today’s world, you can see applications of document clustering almost everywhere. For example, Flipboard uses LDA topic modelling, approximate nearest neighbor search, and clustering to realize their “similar stories / read more” recommendation feature. You can learn more by reading this blog post. To conduct document clustering, you will implement the classic iterative K-means clustering in Spark with different distance functions, and compare with the one implemented in Spark MLlib.

In the second question, you will load data into Spark Dataframe and perform binary classification with Spark MLlib. We will use logistic regression model as our classifier, which is one of the foundational and widely used tools for making classifications. For example, Facebook uses logistic regression as one of the components in its online advertising system. You can read more in a publication here.

Homework 2: “People You Might Know” Social Network

Write a Spark program that implements a simple “People You Might Know” social network friendship recommendation algorithm. The key idea is that if two people have a lot of mutual friends, then the system should recommend that they connect with each other.

Input:

The input file contains the adjacency list and has multiple lines in the following format: . Here, is a unique integer ID corresponding to a unique user and is a comma separated list of unique IDs corresponding to the friends of the user with the unique ID . Note that the friendships are mutual (i.e., edges are undirected): if A is friend with B then B is also friend with A. The data provided is consistent with that rule as there is an explicit entry for each side of each edge.

Algorithm:

Let us use a simple algorithm such that, for each user U, the algorithm recommends N = 10 users who are not already friends with U, but have the most number of mutual friends in common with U.

Output:

The output should contain one line per user in the following format: Where is a unique ID corresponding to a user and is a list of unique IDs corresponding to the algorithm’s recommendation of people that might know, ordered in decreasing number of mutual friends. If a user has less than 10 second-degree friends, output all of the, in decreasing order of the number of mutual friends. If a user has no friends, providing an empty list of recommendations. If there are recommended users with the same number of mutual friends, then output those user IDs in numerically ascending order.

Run Connected Components and PageRank with GraphFrames. You can refer to the GraphFrames documentation.

PageRank measures the importance of each vertex in a graph, assuming an edge from u to v represents an endorsement of v’s importance by u. For example, if a Twitter user is followed by many others, the user will be ranked highly.

Homework 3: Twitter data analysis with Spark Streaming

In this assignment, implement a streaming analysis process. The architecture is as follows. A socket request data from twitter API and send data to spark streaming process. Spark read real time data and do analysis. It also save temp streaming results to Google Storage. After the streaming process terminate, it reads the final data from Google Storage and save it to BigQuery, and then clean the data in Storage.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
HW1		HW1
HW2		HW2
HW3		HW3
HW4		HW4
Pics		Pics
Project		Project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Analytics in Columbia

Homework 1: Clustering and Binary Classification

Homework 2: “People You Might Know” Social Network

Homework 3: Twitter data analysis with Spark Streaming

Homework 4: Data Visualization

Barchart

Django and D3

About

Releases

Packages

Languages

qw2261/Big-Data

Folders and files

Latest commit

History

Repository files navigation

Big Data Analytics in Columbia

Homework 1: Clustering and Binary Classification

Homework 2: “People You Might Know” Social Network

Homework 3: Twitter data analysis with Spark Streaming

Homework 4: Data Visualization

Barchart

Django and D3

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages