#Exercises In the code block below, import the StackOverflow dataset SFrame that you saved during earlier exercises. Note that this data is shared courtesy of StackExchange and is under the Creative Commons Attribution-ShareAlike 3.0 Unported License. This particular version of the data set was used in a recent Kaggle competition.
import os
if os.path.exists('stack_overflow'):
sf = graphlab.SFrame('stack_overflow')
else:
sf= graphlab.SFrame('https://static.turi.com/datasets/stack_overflow')
sf.save('stack_overflow')
Question 1: Visually explore the above data using GraphLab Canvas.
sf.show()
In this section we will make a model that can be used to recommend new tags to users.
Question 2:
Create a new column called Tags
where each element is a list of all the tags
used for that question. (Hint: Check out
sf.pack_columns
.)
sf = sf.pack_columns(column_prefix='Tag', new_column_name='Tags')
Question 3:
Make your SFrame only contain the OwnerUserId
column and the Tags
column you
created in the previous step.
sf = sf[['OwnerUserId', 'Tags']]
Question 4:
Use the following Python function to modify the Tags
column to not have any
empty strings in the list.
def remove_empty(tags):
return [tag for tag in tags if tag != '']
sf['Tags'] = sf['Tags'].apply(remove_empty)
Question 5:
Create a new SFrame called user_tag
that has a row for every (user, tag) pair.
(Hint: See
sf.stack
.)
user_tag = sf.stack(column_name='Tags', new_column_name='Tag')
Question 6:
Create a new SFrame called user_tag_count
that has three columns:
- `OwnerUserId`
- `Tag`
- `Count`
where Count
contains the number of times the given Tag
was used by that
particular OwnerUserId
. Hint: See
groupby
.
user_tag_count = user_tag.groupby(['OwnerUserId', 'Tag'], graphlab.aggregate.COUNT)
Question 7: Visually explore this summarized version of your data set with GraphLab Canvas.
user_tag_count.show()
Question 8:
Use graphlab.recommender.create()
to create a model that can be used to
recommend tags to each user.
m = graphlab.recommender.create(user_tag_count, user_id='OwnerUserId', item_id='Tag')
Question 9: Print a summary of the model by simply entering the name of the object.
m
Question 10:
Get all unique users from the first 10000 observations and save them as a
variable called users
.
users = user_tag_count.head(10000)['OwnerUserId'].unique()
Question 11:
Get 20 recommendations for each user in your list of users. Save these as a new
SFrame called recs
.
recs = m.recommend(users, k=20)
When people use recommendation systems for online commerice, it's often useful to be able to recommending products from a single category of items, e.g. recommending shoes to somebody who typically buys shirts.
To illustrate how this can be done with GraphLab Create, suppose we have a Javascript user who is trying to learn Python. Below we will take just the Javascript users and see what Python tags to recommend them.
Question 12:
Create a variable called javascript_users
that contains all unique users who
have used the javascript
tag.
javascript_users = user_tag_count['OwnerUserId'][user_tag_count['Tag'] == 'javascript'].unique()
Question 13:
Use the model you created above to find the 20 most similar items to the tag
"python". Create a variable called python_items
containing just these similar
items.
python_items = m.get_similar_items(['python'], k=20)
python_items = python_items['similar']
Question 14:
For each user in javascript_users
, make 5 recommendations among the items in
python_items
.
python_recs = m.recommend(users=javascript_users, items=python_items, k=5)
Question 15: Use GraphLab Canvas to find out the 10 most often recommended items.
python_recs.show() # Then click on the Summary tab and look at the histogram in the second column.
Question 16: Save your model to a file.
m.save('my_model')
Question 17:
Create a train/test split of the user_tag_count
data from the section above.
Hint: Use
random_split_by_user
.
train, test = graphlab.recommender.util.random_split_by_user(user_tag_count,
user_id='OwnerUserId',
item_id='Tag')
Question 18: Create a recommender model like you did above that only uses the training set.
m1 = graphlab.recommender.create(train, user_id='OwnerUserId', item_id='Tag')
Question 19:
Create a matrix factorization model that is better at ranking by setting
unobserved_rating_regularization
argument to 1.
m2 = graphlab.ranking_factorization_recommender.create(train,
user_id='OwnerUserId',
item_id='Tag',
target='Count',
ranking_regularization=1)
Question 20: Retrieve the coefficients for each user that were learned by this algorithm.
m2['coefficients']['OwnerUserId']
Question 21: Compare the predictive performance of the two models. Given the ability to make 10 recommendations, which model predicted the highest proportion of items in the test set (on average)?
results = graphlab.recommender.util.compare_models(test, [m1, m2],
metric='precision_recall')