Setting up a dev environment

To gather data for Open Recipes, we are building spiders based on Scrapy, a web scraping framework written in Python. We are using Scrapy v0.16 at the moment. To contribute spiders for sites, you should have basic familiarity with:

Python
Git
HTML and/or XML

Note: this is strongly biased towards OS X. Feel free to contribute instructions for other operating systems.

To get things going, you will need the following tools:

Python 2.7 (including headers)
Git
pip
virtualenv

You will probably already have the first two, although you may need to install Python headers on Linux with something like apt-get install python-dev.

If you don't have pip, follow the installation instructions in the pip docs. Then you can install virtualenv using pip.

Once you have pip and virtualenv, you can clone our repo and install requirements with the following steps:

Open a terminal and cd to the directory that will contain your repo clone. For these instructions, we'll assume you cd ~/src.
git clone https://github.com/fictivekin/openrecipes.git to clone the repo. This will make a ~/src/openrecipes directory that contains your local repo.
cd ./openrecipes to move into the newly-cloned repo.
virtualenv --no-site-packages venv to create a Python virtual environment inside ~/src/openrecipes/venv.
source venv/bin/activate to activate your new Python virtual environment.
pip install -r requirements.txt to install the required Python libraries, including Scrapy.
scrapy -h to confirm that the scrapy command was installed. You should get a dump of the help docs.
cd scrapy_proj/openrecipes to move into the Scrapy project directory
cp settings.py.default settings.py to set up a working settings module for the project

scrapy crawl thepioneerwoman.feed to test the feed spider written for thepioneerwoman.com. You should get output like the following:

2013-03-30 14:35:37-0400 [scrapy] INFO: Scrapy 0.16.4 started (bot: openrecipes)
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled item pipelines: MakestringsPipeline, DuplicaterecipePipeline
2013-03-30 14:35:37-0400 [thepioneerwoman.feed] INFO: Spider opened
2013-03-30 14:35:37-0400 [thepioneerwoman.feed] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-03-30 14:35:38-0400 [thepioneerwoman.feed] DEBUG: Crawled (200)  (referer: None)
2013-03-30 14:35:38-0400 [thepioneerwoman.feed] DEBUG: Crawled (200)  (referer: http://feeds.feedburner.com/pwcooks)
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting up a dev environment

Setting Up

Writing Spiders

Etc.

Clone this wiki locally