This repository contains Scrapy spiders for crawling products and scraping all user-submitted reviews from the Steam game store. A few scripts for more easily managing and deploying the spiders are included as well.
This repository contains code accompanying the Scraping the Steam Game Store article published on the Scrapinghub blog and the Intoli blog.
After cloning the repository with
git clone [email protected]:prncc/steam-scraper.git
start and activate a Python 3.6+ virtualenv with
cd steam-scraper
virtualenv -p python3.6 env
. env/bin/activate
Install Python requirements via:
pip install -r requirements.txt
By the way, on macOS you can install Python 3.6 via homebrew:
brew install python3
On Ubuntu you can use instructions posted on askubuntu.com.
The purpose of ProductSpider
is to discover product pages on the Steam product listing and extract useful metadata from them.
A neat feature of this spider is that it automatically navigates through Steam's age verification checkpoints.
You can initiate the multi-hour crawl with
scrapy crawl products -o output/products_all.jl --logfile=output/products_all.log --loglevel=INFO -s JOBDIR=output/products_all_job -s HTTPCACHE_ENABLED=False
When it completes you should have metadata for all games on Steam in output/products_all.jl
.
Here's some example output:
{
"url": "https://store.steampowered.com/app/271590/Grand_Theft_Auto_V/",
"reviews_url": "http://steamcommunity.com/app/271590/reviews/?browsefilter=mostrecent&p=1",
"id": "271590", "title": "Grand Theft Auto V",
"genres": ["Action", "Adventure"],
"developer": "Rockstar North",
"publisher": "Rockstar Games",
"release_date": "Apr 14, 2015",
"app_name": "Grand Theft Auto V",
"specs": ["Single-player", "Steam Achievements", "Full controller support", "Remote Play on TV"],
"tags": ["Open World", "Action", "Multiplayer", "Automobile Sim", "Third Person", "Crime", "First-Person", "Shooter", "Adventure", "Singleplayer", "Third-Person Shooter", "Mature", "Racing", "Atmospheric", "Co-op", "Sandbox", "Funny", "Great Soundtrack", "Comedy", "Masterpiece"],
"price": 9.99,
"sentiment": "Mostly Positive",
"n_reviews": 724459,
"p_reviews": 562524,
"m_reviews": 161935,
"platform": ["win"],
"metascore": 96,
"early_access": false
}
The above output is cleaned a bit to remove duplicates in some fields.
The purpose of ReviewSpider
is to scrape all user-submitted reviews of a particular product from the Steam community portal.
By default, it starts from URLs listed in its test_urls
parameter:
class ReviewSpider(scrapy.Spider):
name = 'reviews'
test_urls = [
"http://steamcommunity.com/app/316790/reviews/?browsefilter=mostrecent&p=1", # Grim Fandango
"http://steamcommunity.com/app/207610/reviews/?browsefilter=mostrecent&p=1", # The Walking Dead
"http://steamcommunity.com/app/414700/reviews/?browsefilter=mostrecent&p=1" # Outlast 2
]
It can alternatively ingest a text file containing URLs such as
http://steamcommunity.com/app/316790/reviews/?browsefilter=mostrecent&p=1
http://steamcommunity.com/app/207610/reviews/?browsefilter=mostrecent&p=1
http://steamcommunity.com/app/414700/reviews/?browsefilter=mostrecent&p=1
via the url_file
command line argument:
scrapy crawl reviews -o reviews.jl -a url_file=url_file.txt -s JOBDIR=output/reviews
An output sample:
{
'date': '2017-06-04',
'early_access': False,
'found_funny': 5,
'found_helpful': 0,
'found_unhelpful': 1,
'hours': 9.8,
'page': 3,
'page_order': 7,
'product_id': '414700',
'products': 179,
'recommended': True,
'text': '3 spooky 5 me',
'user_id': '76561198116659822',
'username': 'Fowler'
}
If you want to get all the reviews for all products, split_review_urls.py
will remove duplicate entries from products_all.jl
and shuffle review_url
s into several text files.
This provides a convenient way to split up your crawl into manageable pieces.
The whole job takes a few days with Steam's generous rate limits.
This section briefly explains how to run the crawl on one or more t1.micro AWS instances.
First, create an Ubuntu 16.04 t1.micro instance and name it scrapy-runner-01
in your ~/.ssh/config
file:
Host scrapy-runner-01
User ubuntu
HostName <server's IP>
IdentityFile ~/.ssh/id_rsa
A hostname of this form is expected by the scrapydee.sh
helper script included in this repository.
Make sure you can connect with ssh scrappy-runner-01
.
The tool that will actually run the crawl is scrapyd running on the remote server. To set things up first install Python 3.6:
sudo add-apt-repository ppa:jonathonf/python-3.6
sudo apt update
sudo apt install python3.6 python3.6-dev virtualenv python-pip
Then, install scrapyd and the remaining requirements in a dedicated run
directory on the remote server:
mkdir run && cd run
virtualenv -p python3.6 env
. env/bin/activate
pip install scrapy scrapyd botocore smart_getenv
You can run scrapyd
from the virtual environment with
scrapyd --logfile /home/ubuntu/run/scrapyd.log &
You may wish to use something like screen to keep the process alive if you disconnect from the server.
You can issue commands to the scrapyd process running on the remote machine using a simple HTTP JSON API. First, create an egg for this project:
python setup.py bdist_egg
Copy the egg and your review url file to scrapy-runner-01
via
scp output/review_urls_01.txt scrapy-runner-01:/home/ubuntu/run/
scp dist/steam_scraper-1.0-py3.6.egg scrapy-runner-01:/home/ubuntu/run
and add it to scrapyd's job directory via
ssh -f scrapy-runner-01 'cd /home/ubuntu/run && curl http://localhost:6800/addversion.json -F project=steam -F egg=@steam_scraper-1.0-py3.6.egg'
Opening port 6800 to TCP traffic coming from your home IP would allow you to issue this command without going through SSH.
If this command doesn't work, you may need to edit scrapyd.conf
to contain
bind_address = 0.0.0.0
in the [scrapyd]
section.
This is a good time to mention that there exists a scrapyd-client project for deploying eggs to scrapyd equipped servers.
I chose not to use it because it doesn't know about servers already set up in ~/.ssh/config
and so requires repetitive configuration.
Finally, start the job with something like
ssh scrapy-runner-01 'curl http://localhost:6800/schedule.json -d project=steam -d spider=reviews -d url_file="/home/ubuntu/run/review_urls_01.txt" -d jobid=part_01 -d setting=FEED_URI="s3://'$STEAM_S3_BUCKET'/%(name)s/part_01/%(time)s.jl" -d setting=AWS_ACCESS_KEY_ID='$AWS_ACCESS_KEY_ID' -d setting=AWS_SECRET_ACCESS_KEY='$AWS_SECRET_ACCESS_KEY' -d setting=LOG_LEVEL=INFO'
This command assumes you have set up an S3 bucket and the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables.
It should be pretty easy to customize it for non-S3 output, however.
The scrapydee.sh
helper script included in the scripts
directory of this repository has some shortcuts for issuing commands to scrapyd-equipped servers with hostnames of the form scrapy-runner-01
.
For example, the command
./scripts/scrapydee.sh status 1
# Executing status()...
# On server(s): 1.
will run the status()
function defined in scrapydee.sh
on scrapy-runner-01
.
See that file for more command examples.
You can also run each of the included commands on multiple servers:
First, change the all()
function within scrapydee.sh
to match the number of servers you have configured.
Then, issue a command such as
./scripts/scrapydee.sh status all
The output is a bit messy, but it's a quick and easy way to run this job.