Skip to content
This repository has been archived by the owner on Jul 12, 2022. It is now read-only.

titipata/yelp_dataset_challenge

Repository files navigation

Yelp Dataset Challenge for Python

Repository for reading and downloading Yelp Dataset Challenge round 6 in Pandas pickle format. This repository makes it easy for anyone who want to mess around with Yelp data using Python. I provide yelp_util Python package that has read and download function.

Datasets repository

The following is structure of S3,

science-of-science-bucket
└─yelp_academic_dataset
  ├───yelp_academic_dataset_business.pickle (61k rows)
  ├───yelp_academic_dataset_review.pickle (1.5M rows)
  ├───yelp_academic_dataset_user.pickle (366k rows)
  ├───yelp_academic_dataset_checkin.pickle (45k rows)
  └───yelp_academic_dataset_tip.pickle (495k rows)

You can download data directly from AWS S3 repository as follows,

import yelp_util
yelp_util.download(file_list=["yelp_academic_dataset_business.pickle",
                              "yelp_academic_dataset_review.pickle",
                              "yelp_academic_dataset_user.pickle",
                              "yelp_academic_dataset_checkin.pickle",
                              "yelp_academic_dataset_tip.pickle"])

The file will be downloaded to data folder. After finishing download, you can simply read pickle as follows

import pandas as pd
review = pd.read_pickle('data/yelp_academic_dataset_review.pickle')
review.head()

Structure of Datasets

User table of user's information (366k rows)

average_stars compliments elite fans friends name review_count type user_id votes yelping_since

Business table of business with its location and city that it locates (61k rows)

attributes business_id categories city full_address hours latitude longitude name neighborhoods open review_count stars state type

Review reviews made by users (1.5M rows)

business_id date review_id stars text type user_id type votes_cool votes_funny votes_useful

Checkin check-in table (45k rows)

business_id checkin_info type

Tip tip table (495k rows)

business_id date likes text type user_id

Cluster businesses according to how they are tagged

Read the business data

from sklearn.cluster import KMeans

business = pd.read_pickle('data/yelp_academic_dataset_business.pickle')
tags = business.categories.tolist()

then transform tags to matrix count

tag_countmatrix = yelp_util.taglist_to_matrix(tags)

This can be used to cluster businesses

from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(tag_countmatrix)
business['cluster'] = km.predict(tag_countmatrix)

Train word2vec model

review = pd.read_pickle('data/yelp_academic_dataset_review.pickle')
yelp_review_sample = list(review.text.iloc[10000:20000])
model = yelp_util.create_word2vec_model(yelp_review_sample) # word2vec model

Django runserver

All django project is in random_reviews folder. Get started by running python manage.py migrate. Then for local computer (main aim is to custom css files) run Django project by using python manage.py runserver

Dependencies

Members

About

Play around with Yelp dataset in Python (in progress and very messy repo)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published