Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull request for recommender system framework over GraphChi for review #2

Open
wants to merge 70 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
75726ce
First Commit - Some refactoring of setup code and SVDPP implementatio…
MohtaMayank Sep 20, 2013
f33479c
First Commit - Some refactoring of setup code and SVDPP implementatio…
MohtaMayank Sep 20, 2013
558482c
SVDPP fixes - still not giving RMSE as cpp version for test set
MohtaMayank Sep 22, 2013
15a5a46
First version of Bayesian Probabilistic Matrix Factorization - not te…
MohtaMayank Sep 22, 2013
bbaad36
Wrong PMF - Need to implement dynamic edge values?
MohtaMayank Oct 2, 2013
e71eeee
Working PMF and SVDPP - RMSE not as low as CPP version
MohtaMayank Oct 2, 2013
d784996
Working versions of SVDPP and PMF - only evaluated using training RMSE
MohtaMayank Oct 3, 2013
bb3565b
Adding comment about difference with C++ implementation
MohtaMayank Oct 3, 2013
1f25ecc
Cleaning imports
MohtaMayank Oct 4, 2013
8eb9ad9
Incomplete implementation of LibFM_MCMC
MohtaMayank Oct 7, 2013
d0ec864
Merge branch 'metrics' into first-branch
MohtaMayank Oct 8, 2013
9cf62fe
Refactoring SVDPP - Use HugeDoubleMatrix instead of individual objects
MohtaMayank Oct 8, 2013
13f626c
Refactoring code
MohtaMayank Oct 13, 2013
ef5a105
Refactoring ALS and PMF to use efficient data structure for params an…
MohtaMayank Oct 13, 2013
b08b520
BiasSgd framework
sam9595 Oct 15, 2013
6ced3f3
Incorrect implementation of LibFM, useful things in there to correct …
MohtaMayank Oct 19, 2013
afa50d3
Lot of random changes, LibFM SGD implementation
MohtaMayank Oct 22, 2013
e117013
Add BiasSgd
sam9595 Oct 22, 2013
db8219f
Merge branch 'first-branch' of https://github.com/MohtaMayank/graphch…
sam9595 Oct 22, 2013
08e3b46
temp commit
sam9595 Oct 22, 2013
fe9fb8e
commit trial
sam9595 Oct 22, 2013
8fe5271
First commit for generic Data API for rec systems
MohtaMayank Oct 24, 2013
252ed2d
Standard framework for recommender systems - Needs some improvement a…
MohtaMayank Oct 26, 2013
552723c
More refactoring and comments - LibFM needs debugging
MohtaMayank Oct 27, 2013
e9e1919
Merge branch 'first-branch' of https://github.com/MohtaMayank/graphch…
sam9595 Oct 27, 2013
d19b350
modify biasSgd
sam9595 Oct 28, 2013
7388ac1
Implementing code for validation
MohtaMayank Oct 30, 2013
a3121d9
Some minor cleanup
MohtaMayank Oct 30, 2013
d520eb6
Merging recommender system framework
MohtaMayank Oct 30, 2013
2d1c58f
Fixing SVDPP bug
MohtaMayank Oct 30, 2013
1efc5bf
modify some ALS lines based on Aapo's suggestions
sam9595 Nov 2, 2013
0df95f5
modify some ALS lines based on Aapo's suggestions
sam9595 Nov 2, 2013
0090c72
lastFM converter and biasSgd setting
sam9595 Nov 4, 2013
03915ab
Merge branch 'first-branch'
MohtaMayank Nov 5, 2013
b959e33
Naive Yarn Scheduler
MohtaMayank Nov 12, 2013
525d317
Improvements and implementation of RecommederScheduler and Recommende…
MohtaMayank Nov 14, 2013
d8604b0
Merge branch 'master' into rec-yarn
MohtaMayank Nov 14, 2013
2feb4e9
Working on single node YARN with HDFS. Still some problems with multi…
MohtaMayank Nov 18, 2013
65c34e6
Cosmetic changes and comments. Other minor changes
MohtaMayank Nov 19, 2013
0473413
Fixing indentation
MohtaMayank Nov 19, 2013
f7cabb2
Merge https://github.com/GraphChi/graphchi-java
MohtaMayank Nov 19, 2013
0278e5b
Automatic deployment and running of YARN on AWS
MohtaMayank Nov 20, 2013
df28228
Adding finishComputation to GraphChiContext. Other improvements in pe…
MohtaMayank Nov 22, 2013
5834f69
serialization predictTest
sam9595 Nov 22, 2013
6fa3d14
delete unnecessary comments
sam9595 Nov 22, 2013
27a8fd3
Merging the changes related to serialization of model and prediction
MohtaMayank Nov 22, 2013
f00f26f
Ported PMF to new model as well as some changes in RecommenderPool / …
MohtaMayank Nov 24, 2013
4fb1070
Adding code for estimating memory usage by graphchi engine
MohtaMayank Nov 26, 2013
60d2071
Committing before trying to install new OS
MohtaMayank Nov 27, 2013
71deb4d
Fixed parsing parameters, serializtion for all the 5 recommenders.
MohtaMayank Nov 27, 2013
2f74ea3
Adding code to serialize model into HDFS and Code to read raw data fr…
MohtaMayank Nov 27, 2013
b49343a
Adding missed file to read data from URL / S3
MohtaMayank Nov 27, 2013
98157b8
generic_prediction
sam9595 Nov 27, 2013
c3a0fde
merge rec-yarn and rec-yarn-serialize
sam9595 Nov 28, 2013
f35863c
Broken logic for sccheduling
MohtaMayank Nov 28, 2013
adafd27
Automatic setup of YARN cluster on AWS
MohtaMayank Nov 28, 2013
3a245c7
Merge branch 'rec-yarn' into rec-yarn-serialize
sam9595 Nov 28, 2013
b43d968
Using custom method for building paths instead of Java.nio.Paths
MohtaMayank Nov 28, 2013
4128c65
Add Error Measurement Interface
sam9595 Nov 28, 2013
c57605a
Better scheduling logic for YARN
MohtaMayank Dec 1, 2013
10ae999
Add yahoo data description and demo model json files
sam9595 Dec 1, 2013
d111685
Merge branch 'rec-yarn' into rec-yarn-serialize
sam9595 Dec 1, 2013
0fb5016
Add functionality to serialize in the middle
sam9595 Dec 1, 2013
687aa07
Adding bias and factor reg for bias sgd, max iterations for all recom…
MohtaMayank Dec 2, 2013
e24f682
New Testing class which uses data reader API
MohtaMayank Dec 2, 2013
2c781dd
Predicting PMF output with all the samples
MohtaMayank Dec 3, 2013
1fcf2bb
README, sample data and some minur improvements
MohtaMayank Dec 6, 2013
a0bf4c5
README
MohtaMayank Dec 6, 2013
102b41b
YARN README
MohtaMayank Dec 7, 2013
61ef118
Add Javadoc comments
sam9595 Dec 9, 2013
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 48 additions & 30 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,18 @@
<name>Sonatype Nexus Snapshots</name>
<url>https://oss.sonatype.org/content/repositories/snapshots/</url>
<snapshots><enabled>true</enabled></snapshots>

</repository>
<repository>
<repository>
<id>scala-tools.org</id>
<name>Scala-tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
<repository>
<id>ApacheReleases</id>
<name>Apache release repository</name>
<url>https://repository.apache.org/content/repositories/releases</url>
</repository>
</repositories>

<dependencies>
<dependency>
<groupId>com.yammer.metrics</groupId>
Expand All @@ -32,35 +35,35 @@

<!-- Scala version is very important. Luckily the plugin warns you if you don't specify:
[WARNING] you don't define org.scala-lang:scala-library as a dependency of the project -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.9.0-1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.6</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.10</version>
<type>jar</type>
<scope>test</scope>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<scope>compile</scope>
<version>0.10.0</version>
</dependency>
<dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.9.0-1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.6</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.10</version>
<type>jar</type>
<scope>test</scope>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.2</version>
<artifactId>hadoop-client</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.pig</groupId>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix the indentation. It is pretty screwed up here

<artifactId>pig</artifactId>
<scope>compile</scope>
<version>0.12.0</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-math</artifactId>
Expand All @@ -76,6 +79,21 @@
<artifactId>commons-cli</artifactId>
<version>1.2</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-math3</artifactId>
<version>3.1</version>
</dependency>
<dependency>
<groupId>gov.sandia.foundry</groupId>
<artifactId>gov-sandia-cognition-learning-core</artifactId>
<version>3.3.3</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.6.7</version>
</dependency>
</dependencies>

<build>
Expand Down
3 changes: 3 additions & 0 deletions sample_data/Movielens/conversion_scripts.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
python movielens_userfeatures.py /media/Data1/Capstone/Movielens/ml-100k/working_dir/u.user;
python movielens_item_features.py /media/Data1/Capstone/Movielens/ml-100k/working_dir/u.item;
python convert_to_mm.py -g /media/Data1/Capstone/Movielens/ml-100k/working_dir/u.data -e 100000 -u '{"file_name":"/media/Data1/Capstone/Movielens/ml-100k/working_dir/u.user.processed", "num":943}' -i '{"file_name":"/media/Data1/Capstone/Movielens/ml-100k/working_dir/u.item.processed", "num":1682}';
145 changes: 145 additions & 0 deletions sample_data/Movielens/ml-100k/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
SUMMARY & USAGE LICENSE
=============================================

MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.

This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies.
* Each user has rated at least 20 movies.
* Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th,
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set. Detailed descriptions of
the data file can be found at the end of this file.

Neither the University of Minnesota nor any of the researchers
involved can guarantee the correctness of the data, its suitability
for any particular purpose, or the validity of results based on the
use of the data set. The data set may be used for any research
purposes under the following conditions:

* The user may not state or imply any endorsement from the
University of Minnesota or the GroupLens Research Group.

* The user must acknowledge the use of the data set in
publications resulting from the use of the data set, and must
send us an electronic or paper copy of those publications.

* The user may not redistribute the data without separate
permission.

* The user may not use this information for any commercial or
revenue-bearing purposes without first obtaining permission
from a faculty member of the GroupLens Research Project at the
University of Minnesota.

If you have any further questions or comments, please contact Jon Herlocker
<[email protected]>.

ACKNOWLEDGEMENTS
==============================================

Thanks to Al Borchers for cleaning up this data and writing the
accompanying scripts.

PUBLISHED WORK THAT HAS USED THIS DATASET
==============================================

Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic
Framework for Performing Collaborative Filtering. Proceedings of the
1999 Conference on Research and Development in Information
Retrieval. Aug. 1999.

FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT
==============================================

The GroupLens Research Project is a research group in the Department
of Computer Science and Engineering at the University of Minnesota.
Members of the GroupLens Research Project are involved in many
research projects related to the fields of information filtering,
collaborative filtering, and recommender systems. The project is lead
by professors John Riedl and Joseph Konstan. The project began to
explore automated collaborative filtering in 1992, but is most well
known for its world wide trial of an automated collaborative filtering
system for Usenet news in 1996. The technology developed in the
Usenet trial formed the base for the formation of Net Perceptions,
Inc., which was founded by members of GroupLens Research. Since then
the project has expanded its scope to research overall information
filtering solutions, integrating in content-based methods as well as
improving current collaborative filtering technology.

Further information on the GroupLens Research project, including
research publications, can be found at the following web site:

http://www.grouplens.org/

GroupLens Research currently operates a movie recommender based on
collaborative filtering:

http://www.movielens.org/

DETAILED DESCRIPTIONS OF DATA FILES
==============================================

Here are brief descriptions of the data.

ml-data.tar.gz -- Compressed tar file. To rebuild the u data files do this:
gunzip ml-data.tar.gz
tar xvf ml-data.tar
mku.sh

u.data -- The full u data set, 100000 ratings by 943 users on 1682 items.
Each user has rated at least 20 movies. Users and items are
numbered consecutively from 1. The data is randomly
ordered. This is a tab separated list of
user id | item id | rating | timestamp.
The time stamps are unix seconds since 1/1/1970 UTC

u.info -- The number of users, items, and ratings in the u data set.

u.item -- Information about the items (movies); this is a tab separated
list of
movie id | movie title | release date | video release date |
IMDb URL | unknown | Action | Adventure | Animation |
Children's | Comedy | Crime | Documentary | Drama | Fantasy |
Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
Thriller | War | Western |
The last 19 fields are the genres, a 1 indicates the movie
is of that genre, a 0 indicates it is not; movies can be in
several genres at once.
The movie ids are the ones used in the u.data data set.

u.genre -- A list of the genres.

u.user -- Demographic information about the users; this is a tab
separated list of
user id | age | gender | occupation | zip code
The user ids are the ones used in the u.data data set.

u.occupation -- A list of the occupations.

u1.base -- The data sets u1.base and u1.test through u5.base and u5.test
u1.test are 80%/20% splits of the u data into training and test data.
u2.base Each of u1, ..., u5 have disjoint test sets; this if for
u2.test 5 fold cross validation (where you repeat your experiment
u3.base with each training and test set and average the results).
u3.test These data sets can be generated from u.data by mku.sh.
u4.base
u4.test
u5.base
u5.test

ua.base -- The data sets ua.base, ua.test, ub.base, and ub.test
ua.test split the u data into a training set and a test set with
ub.base exactly 10 ratings per user in the test set. The sets
ub.test ua.test and ub.test are disjoint. These data sets can
be generated from u.data by mku.sh.

allbut.pl -- The script that generates training and test sets where
all but n of a users ratings are in the training data.

mku.sh -- A shell script to generate all the u data sets from u.data.
52 changes: 52 additions & 0 deletions sample_data/Movielens/ml-100k/all_tests.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
[
{
"serializedFile":"/tmp/BIAS_SGD_003_1_10_001_100",
"outputFile":"/tmp/out.txt",
"errorMeasure": "RMSE"
},
{
"serializedFile":"/tmp/BIAS_SGD_003_1_10_001_100",
"outputFile":"/tmp/out.txt",
"errorMeasure": "MAE"
},
{
"serializedFile":"/tmp/SVDPP_10_01_1_007_005_50",
"outputFile":"/tmp/out.txt",
"errorMeasure": "RMSE"
},
{
"serializedFile":"/tmp/SVDPP_10_01_1_007_005_50",
"outputFile":"/tmp/out.txt",
"errorMeasure": "MAE"
},
{
"serializedFile":"/tmp/ALS_065",
"outputFile":"/tmp/out.txt",
"errorMeasure": "RMSE"
},
{
"serializedFile":"/tmp/ALS_065",
"outputFile":"/tmp/out.txt",
"errorMeasure": "MAE"
},
{
"serializedFile": "/tmp/LibFM_1__20_002_50",
"outputFile":"/tmp/out.txt",
"errorMeasure": "RMSE"
},
{
"serializedFile": "/tmp/LibFM_1__20_002_50",
"outputFile":"/tmp/out.txt",
"errorMeasure": "MAE"
},
{
"serializedFile":"/tmp/PMF_065_10_5b",
"outputFile":"/tmp/out.txt",
"errorMeasure": "RMSE"
},
{
"serializedFile":"/tmp/PMF_065_10_5b",
"outputFile":"/tmp/out.txt",
"errorMeasure": "MAE"
}
]
34 changes: 34 additions & 0 deletions sample_data/Movielens/ml-100k/allbut.pl
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/usr/local/bin/perl

# get args
if (@ARGV < 3) {
print STDERR "Usage: $0 base_name start stop max_test [ratings ...]\n";
exit 1;
}
$basename = shift;
$start = shift;
$stop = shift;
$maxtest = shift;

# open files
open( TESTFILE, ">$basename.test" ) or die "Cannot open $basename.test for writing\n";
open( BASEFILE, ">$basename.base" ) or die "Cannot open $basename.base for writing\n";

# init variables
$testcnt = 0;

while (<>) {
($user) = split;
if (! defined $ratingcnt{$user}) {
$ratingcnt{$user} = 0;
}
++$ratingcnt{$user};
if (($testcnt < $maxtest || $maxtest <= 0)
&& $ratingcnt{$user} >= $start && $ratingcnt{$user} <= $stop) {
++$testcnt;
print TESTFILE;
}
else {
print BASEFILE;
}
}
15 changes: 15 additions & 0 deletions sample_data/Movielens/ml-100k/file_ml-100k_desc.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"ratingsUrl": "file://./sample_data/Movielens/ml-100k/working_dir2/u.data_tr1.mm",
"userFeaturesUrl": "file://./sample_data/Movielens/ml-100k/working_dir2/u.user.processed.converted",
"itemFeaturesUrl": "file://./sample_data/Movielens/ml-100k/working_dir2/u.item.processed.converted",
"validationUrl": "file://./sample_data/Movielens/ml-100k/working_dir2/u.data_val1.mm",
"numUsers": "943",
"numItems": "1682",
"numRatings": "80000",
"meanRatings": "0",
"numUserFeatures": "38",
"numItemFeatures": "20",
"numRatingFeatures": "0",
"minval": "1",
"maxval": "5"
}
14 changes: 14 additions & 0 deletions sample_data/Movielens/ml-100k/file_test_ml-100k_desc.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"ratingsUrl": "file://./sample_data/Movielens/ml-100k/working_dir2/u.data_val1.mm",
"userFeaturesUrl": "file://./sample_data/Movielens/ml-100k/working_dir2/u.user.processed.converted",
"itemFeaturesUrl": "file://./sample_data/Movielens/ml-100k/working_dir2/u.item.processed.converted",
"numUsers": "943",
"numItems": "1682",
"numRatings": "20000",
"meanRatings": "0",
"numUserFeatures": "38",
"numItemFeatures": "20",
"numRatingFeatures": "0",
"minval": "1",
"maxval": "5"
}
15 changes: 15 additions & 0 deletions sample_data/Movielens/ml-100k/hdfs_ml-100k_desc.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"ratingsUrl": "hdfs://localhost:9000/user/hdfs/Movielens/ml-100k/working_dir2/u.data_tr1.mm",
"userFeaturesUrl": "hdfs://localhost:9000/user/hdfs/Movielens/ml-100k/working_dir2/u.user.processed.converted",
"itemFeaturesUrl": "hdfs://localhost:9000/user/hdfs/Movielens/ml-100k/working_dir2/u.item.processed.converted",
"validationUrl": "hdfs://localhost:9000/user/hdfs/Movielens/ml-100k/working_dir2/u.data_val1.mm",
"numUsers": "943",
"numItems": "1682",
"numRatings": "80000",
"meanRatings": "0",
"numUserFeatures": "38",
"numItemFeatures": "20",
"numRatingFeatures": "0",
"minval": "1",
"maxval": "5"
}
Loading