- Introduction
- Open Source NLP Libraries Demo
- EDA & OLAP
- Topic Modeling
- Recommender System
- Future Potential Projects
- Appendix
Welcome to my awesome data science project portfolio. In my repo, you can find awesome and practical solutions to some of the real world business problems with the-state-of-art machine learning and deep learning algorithms. Most of my projects will be demoed in jupyter notebook form. Jupyter notebook is an excellent way to share my work with the world. It comes with markdown and interactive python environment and it is portable to other platforms like Databricks and Google Colaboratory as well.
My project collection covers various trending machine learning applications such as Natural Language Processing, Large Scale Machine Learning with Spark, and Recommender System. There are more to come. Potential future projects include Text Summarization, Stock Price Forecast, Trading Strategy with Reinforcement Learning, and Computer Vision.
Natural language processing (NLP) is a trending area about how to program machines to process and analyze large amounts of natural language data, and extract meaningful information from it.
I believe we are still at an early stage of NLP development. However, NLP at current stage is already able to perform many tasks. The following is a list of most commonly researched tasks in natural language processing. Note that some of these tasks have direct real-world applications.
Syntax Challenges
- Sentence breaking
- Word segmentation
- Morphological segmentation
- Stemming and Lemmatization
- Part-of-speech tagging
- Terminology extraction
Semantics Challenges
- Named entity recognition (NER)
- Relationship extraction
- Topic segmentation and recognition
- Sentiment analysis
- Machine translation
- Natural language generation
- Question answering
- Natural language understanding
There are many tools and libraries designed to solve NLP problems. The most commonly used libraries are Natrual Language ToolKit (NLTK), spaCy, sklearn NLP toolkit, gensim, Pattern, polyglot and many others. However, I only select four of them for demo.
NLTK (DEMO)
NLTK (Natural Language Toolkit) is used for such tasks as tokenization, lemmatization, stemming, parsing, POS tagging, etc. This library has tools for almost all NLP tasks.
Pros:
- The earliest python NLP libraries and the most well-known full NLP library
- Many third-party extensions
- Supports the largest number of languages
Cons:
- Complicated to learn
- Slow
- Doesn't provide neural network models
- No integrated word vectors
Scikit-Learn (DEMO)
Scikit-learn provides a large library for machine learning. The tools for text preprocessing are also presented here.
Pros:
- Many functions to use bag-of-words method of creating features for text classification tasks
- Provides a wide varity of algorithms to build ML models
- Good documentation
Cons:
- Doesn't have sophisticated preprocessing things like pos-taggin, parsing, and NER
- Doesn't use neural network models
Gensim (DEMO)
Gensim is the package for topic and vector space modeling, document similarity.
Pros:
- Works with large datasets and processes data streams
- Provides tf-idf vectorization, word2vec, document2vec, Latent Semantic Analysis, Latent Dirichlet Allocation
- Supports deep learning
Cons:
- Designed primarily for unsupervised text modeling
- Doesn't have enough tools to provide full NLP pipeline, so should be used with some other library (spaCy or NLTK)
spaCy (DEMO)
spaCy is the main competitor of the NLTK. These two libraries can be used for the same tasks. spaCy offers a full NLP pipeline (tokenizer, tagger, parser, and NER) through spaCy's container objects such as Doc, Token, Span, and Lexeme. Compared to NLTK, spaCy is more opinionated on the architecture of a NLP pipeline.
Pros:
- The fastest NLP framework
- Easy to learn and use because it has one single highly optimized tool for all tasks
- Processes objects; object-oriented
- Uses neural networks for training some models
- Provides built-in word vectors
Cons:
- Lacks flexibility, comparing to NLTK
- Sentence segmentation is slower than NLTK
- Doesn't support many languages
Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Usually in python or jupyter notebook environment, data scientists use pandas, numpy, matplotlib, seaborn or even plotly to perform EDA.
Online analytical processing (OLAP), is an approach to answering multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. In the context of Big Data Analytics (Distributed Computing), data scientists often perform OLAP with SQL query on Apache License Software such as HIVE, Spark, Hadoop.
The following are two projects that I have done. One is about San Francisco Crime datasets. The other is Medium Blogpost text datasets.
San Francisco Crime Analysis in Apache Spark (DEMO)
- Perform analytical operations such as consolidation, drill-down, and slicing and dicing on a 15 year dataset of reported incidents from SFPD
- Perform spatial and time series analysis to further understand crime pattern and distribution in SF
- Build data processing pipeline based on Spark RDD, DataFrame, Spark SQL for various OLAP tasks
- Train and fine-tune Time Series model to forecast the number of theft incidents per month
Medium BlogPost Analysis in Pandas & Seaborn (DEMO)
- Develop statistical data visualization with seaborn to get summary statistics such as distribution of blogpost's popularity, trends in different blogpost topics, and top n popular topics and blogpost's authors
- Perform feature engineering to extract features from blogpost's contents, titles, authors, and topics
- Apply various statistical charts to understand correlation between a blogpost's popularity and its extracted features
Topic modeling is a type of statistical modeling for discovering the latent “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.
The following projects are using topic model as a text mining tool to discover the latent "topics" in Medium blogposts, as well as trends and popularity of different latent "topics". With topic modeling, we are able to identify insights about what latent 'topics' are trendy and continue to be the most popular content.
Medium blogpost datasets are scraped from Medium using scrapy framework. Details of scrapy implementation is in my another data science project MediumBlog
NLP and Topic Modeling on Medium BlogPost with Apache Spark (DEMO)
- Apply topic modeling to understand what drives a blog post’s popularity (as measured in claps) and the interaction between users’ preferences and blog posts’ contents
- Build a feature extraction pipeline using Spark, which consists of tokenizing raw texts, stop-words removal, stemming/lemmatization, and BOW/TF-IDF transformation
- Implement unsupervised learning models of K-means and LDA to discover latent topics embedded in blog posts and identify key words of each topics for clustering and similarity queries
- Evaluate model’s clustering results by visual displays with dimensionality reduction using PCA and T-SNE
NLP and Topic Modeling on Medium BlogPost with Sklearn (DEMO)
- Perform similar tasks like above but using sklearn rather than Spark
A recommender system is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item. Recommender systems are utilized in a variety of areas including movies, music, news, social tags, and products in general. Recommender systems typically produce a list of recommendations in one of two ways – through collaborative filtering or through content-based filtering.
Collaborative filtering
This approach builds a model from a user's past behaviour (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in
Content-based filtering
This approach utilizes a series of discrete characteristics of an item in order to recommend additional items with similar properties
Hybrid Recommender
This one combines the previous two approaches
In my project, I will focus on building a collaborative filtering engine. In collaborative filtering, there are typically following challenges:
- cold start
- data sparsity
- popular bias (how to recommend products from the tail of product distribution)
- scalability (computation grows as number of users and items grow)
- pool relationship between like-minded yet sparse users
Solution
Use matrix factorization technique to reduce dimensionality and sparsity, as well as capturing user information in user latent factors and item information in item latent factors
Implementations
I choose to use two types of different ML algos to build two separate movie recommendation engines and compare their performance and results respectively. The following is the list of my ML algos to implement movie recommendation engine
- Alternating Least Square (ALS) Matrix Factorization
- Neural Collaborative Filtering Approach
- Generalized Matrix Factorization (GMF)
- Multi-Layer Perceptron (MLP)
- Neural Matrix Factorization (NeuMF)
Datasets
I use MovieLens Small Datasets. This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies.
Model Performance Comparison on Test Datasets
MODEL | MEAN SQUARED ERROR | ROOT MEAN SQUARED ERROR |
---|---|---|
ALS | 0.8475 | 0.9206 |
- | - | - |
GMF | 0.8532 | 0.9237 |
MLP | 0.8270 | 0.9094 |
NeuMF | 0.8206 | 0.9059 |
Movie Recommendation Engine Development in Apache Spark (DEMO)
In the context of distributed computing and large scale machine learning, Alternating Least Square (ALS) in Spark ML is definitely the one of the first go-to models for collaborative filtering in recommender system. ALS algo has been proven to be very effective for both explicit and implicit feedback datasets.
In addition, Alternating Least Squares with Weighted λ Regularization (ALS-WR) is a parallel algorithm designed for a large-scale collaborative filtering challenge, the Netflix Prize. This method is meant to resolve scalability and sparseness of the user profiles, and it's simple and scales well to very large datasets
-
Advantages of collaborative filtering over content based methods
- No need to know about item content
- "Item cold-start" problem is avoided
- User interest may change over time
- Explainability
-
My implementation to train the best ALS model via cross-validation and hyperparam-tuning
from src.spark_recommender_system import Dataset, train_ALS
from pyspark.ml.evaluation import RegressionEvaluator
# config
SEED = 99
MAX_ITER = 10
SPLIT_RATIO = [6, 2, 2]
DATAPATH = './data/movie/ratings.csv'
# construct movie ratings dataset object
rating_data = Dataset(spark, DATAPATH)
# get rating data as Spark RDD
rating_rdd = rating_data.RDD
# get train, validation, and test data
train_data, validation_data, test_data = rating_data.split_data(rating_rdd, SPLIT_RATIO, SEED)
# create a hyperparam tuning grid
regParams = [0.05, 0.1, 0.2, 0.4, 0.8]
ranks = [6, 8, 10, 12, 14]
# train models and select the best model in hyperparam tuning
best_model = train_ALS(train_data, validation_data, MAX_ITER, regParams, ranks)
# test model
predictions = best_model.transform(test_data)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))
Movie Recommendation Engine Development in Neural Networks with Keras (DEMO)
Neural Collaborative Filtering (NCF) is a paper published by National University of Singapore, Columbia University, Shandong University, and Texas A&M University in 2017. It utilizes the flexibility, complexity, and non-linearity of Neural Network to build a recommender system. It proves that Matrix Factorization, a traditional recommender system, is a special case of Neural Collaborative Filtering. In addition, it shows that NCF outperforms the state-of-the-art models in two public datasets
Before we get into the Keras implementation of Neural Collaborative Filtering (NCF), let's quickly review Matrix Factorization and how it is implemented in the context of Neural Networks.
Here is the illustration of the math:
Essentially, each user and item is projected onto a latent space, represented by a latent vector. The more similar the two latent vectors are, the more related the corresponding users’ preference. Since we factorize the user-item matrix into the same latent space, we can measure the similarity of any two latent vectors with cosine-similarity or dot product.
In Neural Network, we will implement an embedding layer. We usually map a user one-hot encoded vector to a user embedded vector, and map a item one-hot encoded vector to an item vector. Then we will do a element-wise multiplication between user latent vector and item latent vector. Now we have a element-wise user-item latent vector.
In traditional matrix factorization, we would just sum up the vector, which is also the dot product of user latent vector and item latent vector. Then we minimize the loss between the dot product of these two and the true ratings in user-item association matrix
However, in the world of neural network, we can generalize matrix factorization by feeding the element-wise user-item latent vector into FC layer. The Neural FC layer can be any kind neuron connections. With the complicated connection and non-linearity in the Neural CF layers, this model is capable of properly estimating the complex interactions between user and item in the latent space. Then the objective function is to minimize the loss between the predictions and the ratings. This is exactly how Generalized Matrix Factorization (GMF) is implemented. Below is the graph of network architecture:
To further generalize the process of matrix factorization in neural network, we need to increase the complexity in hypothesis space of the network and remove calucation rules from the neural topology. This means we will remove the element-wise multiplication layer and add more Neural CF layers, for example, multiple layer perceptron (MLP), can be placed after the concat layer of user and item embedded layers. And this is the Multi-Layer Perceptron (MLP) model. Below is the graph of network architecture:
Now that we understand how generalized matrix factorization works in the world of neural network, the next question is how we can improve the model. One simple trick that is often used in Machine Learning competitions is "stacking". In neural networks, "stacking" means we concat the outputs of GMF and MLP networks and connect it with the sigmoid activation output layer. And this is Neural Matrix Factorization (NeuMF). Below is the graph of network architecture:
- My implementation to build Neural Matrix Factorization (NeuMF) and train the model
import pandas as pd
from src.neural_recommender_system import (get_GMF_model,
get_MLP_model,
get_NeuMF_model,
train_model,
load_trained_model)
# data config
DATAPATH = './data/movie/ratings.csv'
MODELPATH = './data/movie/tmp/model.hdf5'
SEED = 99
TEST_SIZE = 0.2
# model config
EMBEDDED_DIM = 10
L2_REG = 0
MLP_HIDDEN_LAYERS = [64, 32, 16, 8]
# trainer config
OPTIMIZER = 'adam'
BATCH_SIZE = 64
EPOCHS = 30
VAL_SPLIT = 0.25
# load ratings
df_ratings = pd.read_csv(
DATAPATH,
usecols=['userId', 'movieId', 'rating'],
dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})
# get total number of users and items
num_users = len(df_ratings.userId.unique())
num_items = len(df_ratings.movieId.unique())
# train/test split
df_train, df_test = train_test_split(df_ratings, TEST_SIZE, SEED)
# build Generalized Matrix Factorization (GMF)
GMF_model = get_GMF_model(num_users, num_items, EMBEDDED_DIM, L2_REG, L2_REG)
# build Multi-Layer Perceptron (MLP)
MLP_model = get_MLP_model(num_users, num_items,
MLP_HIDDEN_LAYERS, [L2_REG for i in range(4)])
# build Neural Matrix Factorization (NeuMF)
NeuMF_model = get_NeuMF_model(num_users, num_items, EMBEDDED_DIM,
(L2_REG, L2_REG), MLP_HIDDEN_LAYERS,
[L2_REG for i in range(4)])
# let's just train Neural Matrix Factorization (NeuMF)
train_model(NeuMF_model, OPTIMIZER, BATCH_SIZE, EPOCHS, VAL_SPLIT,
inputs=[df_train.userId.values, df_train.movieId.values],
outputs=df_train.rating.values,
filepath=MODELPATH)
# load the best trained model
# rebuild
NeuMF_model = get_NeuMF_model(num_users, num_items, EMBEDDED_DIM,
(L2_REG, L2_REG), MLP_HIDDEN_LAYERS,
[L2_REG for i in range(4)])
# load weights
NeuMF_model = load_trained_model(NeuMF_model, MODELPATH)
# define metric - rmse
rmse = lambda true, pred: np.sqrt(
np.mean(
np.square(
np.squeeze(predictions) - np.squeeze(df_test.rating.values)
)
)
)
# test model
predictions = NeuMF_model.predict([df_test.userId.values, df_test.movieId.values])
error = rmse(df_test.rating.values, predictions)
print('The out-of-sample RMSE of rating predictions is', round(error, 4))