forked from scrapy/scrapy
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
c68798b
commit 05ac411
Showing
2 changed files
with
306 additions
and
265 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,306 @@ | ||
======= ============================= | ||
SEP 16 | ||
Title Leg Spider | ||
Author Insophia Team | ||
Created 2010-06-03 | ||
Status Superseded by :doc:`sep-018` | ||
======= ============================= | ||
|
||
=================== | ||
SEP-016: Leg Spider | ||
=================== | ||
|
||
This SEP introduces a new kind of Spider called ``LegSpider`` which provides | ||
modular functionality which can be plugged to different spiders. | ||
|
||
Rationale | ||
========= | ||
|
||
The purpose of Leg Spiders is to define an architecture for building spiders | ||
based on smaller well-tested components (aka. Legs) that can be combined to | ||
achieve the desired functionality. These reusable components will benefit all | ||
Scrapy users by building a repository of well-tested components (legs) that can | ||
be shared among different spiders and projects. Some of them will come bundled | ||
with Scrapy. | ||
|
||
The Legs themselves can be also combined with sub-legs, in a hierarchical | ||
fashion. Legs are also spiders themselves, hence the name "Leg Spider". | ||
|
||
``LegSpider`` API | ||
================= | ||
|
||
A ``LegSpider`` is a ``BaseSpider`` subclass that adds the following attributes and methods: | ||
|
||
- ``legs`` | ||
- legs composing this spider | ||
- ``process_response(response)`` | ||
- Process a (downloaded) response and return a list of requests and items | ||
- ``process_request(request)`` | ||
- Process a request after it has been extracted and before returning it from | ||
the spider | ||
- ``process_item(item)`` | ||
- Process an item after it has been extracted and before returning it from | ||
the spider | ||
- ``set_spider()`` | ||
- Defines the main spider associated with this Leg Spider, which is often | ||
used to configure the Leg Spider behavior. | ||
|
||
How Leg Spiders work | ||
==================== | ||
|
||
1. Each Leg Spider has zero or many Leg Spiders associated with it. When a | ||
response arrives, the Leg Spider process it with its ``process_response`` | ||
method and also the ``process_response`` method of all its "sub leg | ||
spiders". Finally, the output of all of them is combined to produce the | ||
final aggregated output. | ||
2. Each element of the aggregated output of ``process_response`` is processed | ||
with either ``process_item`` or ``process_request`` before being returned | ||
from the spider. Similar to ``process_response``, each item/request is | ||
processed with all ``process_{request,item``} of the leg spiders composing | ||
the spider, and also with those of the spider itself. | ||
|
||
Leg Spider examples | ||
=================== | ||
|
||
Regex (HTML) Link Extractor | ||
--------------------------- | ||
|
||
A typical application of LegSpider's is to build Link Extractors. For example: | ||
|
||
:: | ||
|
||
#!python | ||
class RegexHtmlLinkExtractor(LegSpider): | ||
|
||
def process_response(self, response): | ||
if isinstance(response, HtmlResponse): | ||
allowed_regexes = self.spider.url_regexes_to_follow | ||
# extract urls to follow using allowed_regexes | ||
return [Request(x) for x in urls_to_follow] | ||
|
||
class MySpider(LegSpider): | ||
|
||
legs = [RegexHtmlLinkExtractor()] | ||
url_regexes_to_follow = ['/product.php?.*'] | ||
|
||
def parse_response(self, response): | ||
# parse response and extract items | ||
return items | ||
|
||
RSS2 link extractor | ||
------------------- | ||
|
||
This is a Leg Spider that can be used for following links from RSS2 feeds. | ||
|
||
:: | ||
|
||
#!python | ||
class Rss2LinkExtractor(LegSpider): | ||
|
||
def process_response(self, response): | ||
if response.headers.get('Content-type') 'application/rss+xml': | ||
xs = XmlXPathSelector(response) | ||
urls = xs.select("//item/link/text()").extract() | ||
return [Request(x) for x in urls] | ||
|
||
Callback dispatcher based on rules | ||
---------------------------------- | ||
|
||
Another example could be to build a callback dispatcher based on rules: | ||
|
||
:: | ||
|
||
#!python | ||
class CallbackRules(LegSpider): | ||
|
||
def __init__(self, *a, **kw): | ||
super(CallbackRules, self).__init__(*a, **kw) | ||
for regex, method_name in self.spider.callback_rules.items(): | ||
r = re.compile(regex) | ||
m = getattr(self.spider, method_name, None) | ||
if m: | ||
self._rules[r] = m | ||
|
||
def process_response(self, response): | ||
for regex, method in self._rules.items(): | ||
m = regex.search(response.url) | ||
if m: | ||
return method(response) | ||
return [] | ||
|
||
class MySpider(LegSpider): | ||
|
||
legs = [CallbackRules()] | ||
callback_rules = { | ||
'/product.php.*': 'parse_product', | ||
'/category.php.*': 'parse_category', | ||
} | ||
|
||
def parse_product(self, response): | ||
# parse response and populate item | ||
return item | ||
|
||
URL Canonicalizers | ||
------------------ | ||
|
||
Another example could be for building URL canonicalizers: | ||
|
||
:: | ||
|
||
#!python | ||
class CanonializeUrl(LegSpider): | ||
|
||
def process_request(self, request): | ||
curl = canonicalize_url(request.url, rules=self.spider.canonicalization_rules) | ||
return request.replace(url=curl) | ||
|
||
class MySpider(LegSpider): | ||
|
||
legs = [CanonicalizeUrl()] | ||
canonicalization_rules = ['sort-query-args', 'normalize-percent-encoding', ...] | ||
|
||
# ... | ||
|
||
Setting item identifier | ||
----------------------- | ||
|
||
Another example could be for setting a unique identifier to items, based on | ||
certain fields: | ||
|
||
:: | ||
|
||
#!python | ||
class ItemIdSetter(LegSpider): | ||
|
||
def process_item(self, item): | ||
id_field = self.spider.id_field | ||
id_fields_to_hash = self.spider.id_fields_to_hash | ||
item[id_field] = make_hash_based_on_fields(item, id_fields_to_hash) | ||
return item | ||
|
||
class MySpider(LegSpider): | ||
|
||
legs = [ItemIdSetter()] | ||
id_field = 'guid' | ||
id_fields_to_hash = ['supplier_name', 'supplier_id'] | ||
|
||
def process_response(self, item): | ||
# extract item from response | ||
return item | ||
|
||
Combining multiple leg spiders | ||
------------------------------ | ||
|
||
Here's an example that combines functionality from multiple leg spiders: | ||
|
||
:: | ||
|
||
#!python | ||
class MySpider(LegSpider): | ||
|
||
legs = [RegexLinkExtractor(), ParseRules(), CanonicalizeUrl(), ItemIdSetter()] | ||
|
||
url_regexes_to_follow = ['/product.php?.*'] | ||
|
||
parse_rules = { | ||
'/product.php.*': 'parse_product', | ||
'/category.php.*': 'parse_category', | ||
} | ||
|
||
canonicalization_rules = ['sort-query-args', 'normalize-percent-encoding', ...] | ||
|
||
id_field = 'guid' | ||
id_fields_to_hash = ['supplier_name', 'supplier_id'] | ||
|
||
def process_product(self, item): | ||
# extract item from response | ||
return item | ||
|
||
def process_category(self, item): | ||
# extract item from response | ||
return item | ||
|
||
Leg Spiders vs Spider middlewares | ||
================================= | ||
|
||
A common question that would arise is when one should use Leg Spiders and when | ||
to use Spider middlewares. Leg Spiders functionality is meant to implement | ||
spider-specific functionality, like link extraction which has custom rules per | ||
spider. Spider middlewares, on the other hand, are meant to implement global | ||
functionality. | ||
|
||
When not to use Leg Spiders | ||
=========================== | ||
|
||
Leg Spiders are not a silver bullet to implement all kinds of spiders, so it's | ||
important to keep in mind their scope and limitations, such as: | ||
|
||
- Leg Spiders can't filter duplicate requests, since they don't have access to | ||
all requests at the same time. This functionality should be done in a spider | ||
or scheduler middleware. | ||
- Leg Spiders are meant to be used for spiders whose behavior (requests & items | ||
to extract) depends only on the current page and not previously crawled pages | ||
(aka. "context-free spiders"). If your spider has some custom logic with | ||
chained downloads (for example, multi-page items) then Leg Spiders may not be | ||
a good fit. | ||
|
||
``LegSpider`` proof-of-concept implementation | ||
============================================= | ||
|
||
Here's a proof-of-concept implementation of ``LegSpider``: | ||
|
||
:: | ||
|
||
#!python | ||
from scrapy.http import Request | ||
from scrapy.item import BaseItem | ||
from scrapy.spider import BaseSpider | ||
from scrapy.utils.spider import iterate_spider_output | ||
|
||
|
||
class LegSpider(BaseSpider): | ||
"""A spider made of legs""" | ||
|
||
legs = [] | ||
|
||
def __init__(self, *args, **kwargs): | ||
super(LegSpider, self).__init__(*args, **kwargs) | ||
self._legs = [self] + self.legs[:] | ||
for l in self._legs: | ||
l.set_spider(self) | ||
|
||
def parse(self, response): | ||
res = self._process_response(response) | ||
for r in res: | ||
if isinstance(r, BaseItem): | ||
yield self._process_item(r) | ||
else: | ||
yield self._process_request(r) | ||
|
||
def process_response(self, response): | ||
return [] | ||
|
||
def process_request(self, request): | ||
return request | ||
|
||
def process_item(self, item): | ||
return item | ||
|
||
def set_spider(self, spider): | ||
self.spider = spider | ||
|
||
def _process_response(self, response): | ||
res = [] | ||
for l in self._legs: | ||
res.extend(iterate_spider_output(l.process_response(response))) | ||
return res | ||
|
||
def _process_request(self, request): | ||
for l in self._legs: | ||
request = l.process_request(request) | ||
return request | ||
|
||
def _process_item(self, item): | ||
for l in self._legs: | ||
item = l.process_item(item) | ||
return item |
Oops, something went wrong.