-
Notifications
You must be signed in to change notification settings - Fork 113
Testing your spiders
Automated tests run existing spiders against HTML documents to ensure that recipes are extracted correctly.
To run the test suite, do the following
cd scrapy_proj/tests
nosetests
You should get output like the following:
/Users/coj/Dropbox/Sites/openrecipes/scrapy_proj/openrecipes/pipelines.py:7: ScrapyDeprecationWarning: Module `scrapy.conf` is deprecated, use `crawler.settings` attribute instead
from scrapy.conf import settings
...........................................................................................
----------------------------------------------------------------------
Ran 91 tests in 10.213s
OK
The testing scripts will automatically run tests against HTML files placed in directories named for the corresponding spider. For example, if you have a spider class file at scrapy_proj/openrecipes/spiders/foobar_spider.py
, the testing scripts will look for a directory scrapy_proj/tests/html_data/foobar
. If a directory is found, any files with a .html
extension in that directory will be run through the spider's parse_item
method. The results are tested against assertions in the do_test_scraped_item()
method defined in scraper_tests.py
.
Note: assertions in do_test_scrapted_item()
will be used in every spider's output. We haven't yet defined a method for creating spider-specific tests. Input is welcomed.
You can grab html data to test against for a given spider by using the openrecipes/scrapy_proj/grab_html.py
utility script. You run it like so:
python grab_html.py foobar http://www.foobar.com/2013/04/10-minute-thai-shrimp-cucumber-avocado-salad-recipe/
The script will download the HTML document and write it to scrapy_proj/tests/html_data/foobar/item_<document_title>.html
The next time you run the automated tests, this file will be used to test the foobar_spider.parse_item()
method.
We recommend creating at least 3 HTML files for each spider.
You can use the scrapy shell and Python's reloading capabilities to quickly test your spiders. This example will use elanaspantry.com.
To test a spider:
-
cd
intoscrapy_proj
. - Open the scrapy shell with
scrapy shell
. - Fetch a recipe with
fetch('http://www.elanaspantry.com/ratio-rally-quick-breads/')
. - Import the spider with
from openrecipes.spiders import elanaspantry_spider
. - Test your spider with
elanaspantry_spider.ElanaspantryMixin().parse_item(response)
.
This should return something like this:
[{'datePublished': u'April 4, 2011', 'description': [u'This gluten free muffin recipe is made with almond flour and is part of the quick bread ratio rally and my attempt to make a basic template for a muffin recipe.'], 'image': [u'http://www.elanaspantry.com/blog/wp-content/uploads/2011/04/gluten-free-almond-flour-quick-bread-muffins-ratio-rally-recipe.jpg'], 'ingredients': [u'4 ounces blanched almond flour (about 1 cup)', u'4 ounces eggs (about 2 large eggs)', u'1 ounce agave nectar or honey (around 1 tablespoon)', u'\xbc teaspoon baking soda', u'\xbd teaspoon apple cider vinegar'], 'name': [u'Almond Flour Muffins'], 'recipeYield': u'Makes 4 muffins', 'source': 'elanaspantry', 'url': 'http://www.elanaspantry.com/ratio-rally-quick-breads/'}]'
After making changes to your spider, you'll need to:
- Reload the spider with
reload(elanaspantry_spider)
. - And test it with
elanaspantry_spider.ElanaspantryMixin().parse_item(response)
.