-
Notifications
You must be signed in to change notification settings - Fork 113
Setting up a dev environment
To gather data for Open Recipes, we are building spiders based on Scrapy, a web scraping framework written in Python. We are using Scrapy v0.16 at the moment. To contribute spiders for sites, you should have basic familiarity with:
- Python
- Git
- HTML and/or XML
Note: this is strongly biased towards OS X. Feel free to contribute instructions for other operating systems.
To get things going, you will need the following tools:
- Python 2.7 (including headers)
- Git
pip
virtualenv
You will probably already have the first two, although you may need to install Python headers on Linux with something like apt-get install python-dev
.
If you don't have pip
, follow the installation instructions in the pip docs. Then you can install virtualenv
using pip.
Once you have pip
and virtualenv
, you can clone our repo and install requirements with the following steps:
-
Open a terminal and
cd
to the directory that will contain your repo clone. For these instructions, we'll assume youcd ~/src
. -
git clone https://github.com/fictivekin/openrecipes.git
to clone the repo. This will make a~/src/openrecipes
directory that contains your local repo. -
cd ./openrecipes
to move into the newly-cloned repo. -
virtualenv --no-site-packages venv
to create a Python virtual environment inside~/src/openrecipes/venv
. -
source venv/bin/activate
to activate your new Python virtual environment. -
pip install -r requirements.txt
to install the required Python libraries, including Scrapy. -
scrapy -h
to confirm that thescrapy
command was installed. You should get a dump of the help docs. -
cd scrapy_proj/openrecipes
to move into the Scrapy project directory -
cp settings.py.default settings.py
to set up a working settings module for the project -
scrapy crawl thepioneerwoman.feed
to test the feed spider written for thepioneerwoman.com. You should get output like the following:2013-03-30 14:35:37-0400 [scrapy] INFO: Scrapy 0.16.4 started (bot: openrecipes) 2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled item pipelines: MakestringsPipeline, DuplicaterecipePipeline 2013-03-30 14:35:37-0400 [thepioneerwoman.feed] INFO: Spider opened 2013-03-30 14:35:37-0400 [thepioneerwoman.feed] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-03-30 14:35:37-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2013-03-30 14:35:37-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2013-03-30 14:35:38-0400 [thepioneerwoman.feed] DEBUG: Crawled (200) (referer: None) 2013-03-30 14:35:38-0400 [thepioneerwoman.feed] DEBUG: Crawled (200) (referer: http://feeds.feedburner.com/pwcooks) ...
If you do, baby you got a stew going!