Repository for reading and downloading Yelp Dataset Challenge
round 6 in Pandas pickle format. This repository makes it easy for anyone who want to mess around with Yelp data using Python.
I provide yelp_util
Python package that has read and download function.
The following is structure of S3,
science-of-science-bucket
└─yelp_academic_dataset
├───yelp_academic_dataset_business.pickle (61k rows)
├───yelp_academic_dataset_review.pickle (1.5M rows)
├───yelp_academic_dataset_user.pickle (366k rows)
├───yelp_academic_dataset_checkin.pickle (45k rows)
└───yelp_academic_dataset_tip.pickle (495k rows)
You can download data directly from AWS S3 repository as follows,
import yelp_util
yelp_util.download(file_list=["yelp_academic_dataset_business.pickle",
"yelp_academic_dataset_review.pickle",
"yelp_academic_dataset_user.pickle",
"yelp_academic_dataset_checkin.pickle",
"yelp_academic_dataset_tip.pickle"])
The file will be downloaded to data
folder. After finishing download, you can simply read
pickle
as follows
import pandas as pd
review = pd.read_pickle('data/yelp_academic_dataset_review.pickle')
review.head()
User table of user's information (366k rows)
average_stars | compliments | elite | fans | friends | name | review_count | type | user_id | votes | yelping_since |
---|
Business table of business with its location and city that it locates (61k rows)
attributes | business_id | categories | city | full_address | hours | latitude | longitude | name | neighborhoods | open | review_count | stars | state | type |
---|
Review reviews made by users (1.5M rows)
business_id | date | review_id | stars | text | type | user_id | type | votes_cool | votes_funny | votes_useful |
---|
Checkin check-in table (45k rows)
business_id | checkin_info | type |
---|
Tip tip table (495k rows)
business_id | date | likes | text | type | user_id |
---|
Read the business data
from sklearn.cluster import KMeans
business = pd.read_pickle('data/yelp_academic_dataset_business.pickle')
tags = business.categories.tolist()
then transform tags to matrix count
tag_countmatrix = yelp_util.taglist_to_matrix(tags)
This can be used to cluster businesses
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(tag_countmatrix)
business['cluster'] = km.predict(tag_countmatrix)
review = pd.read_pickle('data/yelp_academic_dataset_review.pickle')
yelp_review_sample = list(review.text.iloc[10000:20000])
model = yelp_util.create_word2vec_model(yelp_review_sample) # word2vec model
All django project is in random_reviews
folder. Get started by running python manage.py migrate
.
Then for local computer (main aim is to custom css files) run Django project by using python manage.py runserver
- pandas
- scikit-learn
- nltk with
punkt
(nltk.download('punkt')
) - gensim
- unidecode