converted sep 16

bitmakerla · Mar 7, 2014 · 05ac411 · 05ac411
1 parent c68798b
commit 05ac411
Show file tree

Hide file tree

Showing 2 changed files with 306 additions and 265 deletions.
diff --git a/sep/sep-016.rst b/sep/sep-016.rst
@@ -0,0 +1,306 @@
+=======  =============================
+SEP      16
+Title    Leg Spider
+Author   Insophia Team
+Created  2010-06-03
+Status   Superseded  by :doc:`sep-018`
+=======  =============================
+
+===================
+SEP-016: Leg Spider
+===================
+
+This SEP introduces a new kind of Spider called ``LegSpider`` which provides
+modular functionality which can be plugged to different spiders.
+
+Rationale
+=========
+
+The purpose of Leg Spiders is to define an architecture for building spiders
+based on smaller well-tested components (aka. Legs) that can be combined to
+achieve the desired functionality. These reusable components will benefit all
+Scrapy users by building a repository of well-tested components (legs) that can
+be shared among different spiders and projects. Some of them will come bundled
+with Scrapy.
+
+The Legs themselves can be also combined with sub-legs, in a hierarchical
+fashion. Legs are also spiders themselves, hence the name "Leg Spider".
+
+``LegSpider`` API
+=================
+
+A ``LegSpider`` is a ``BaseSpider`` subclass that adds the following attributes and methods:
+
+- ``legs``
+   - legs composing this spider
+- ``process_response(response)``
+   - Process a (downloaded) response and return a list of requests and items
+- ``process_request(request)``
+   - Process a request after it has been extracted and before returning it from
+     the spider
+- ``process_item(item)``
+   - Process an item after it has been extracted and before returning it from
+     the spider
+- ``set_spider()``
+   - Defines the main spider associated with this Leg Spider, which is often
+     used to configure the Leg Spider behavior.
+
+How Leg Spiders work
+====================
+
+1. Each Leg Spider has zero or many Leg Spiders associated with it. When a
+   response arrives, the Leg Spider process it with its ``process_response``
+   method and also the ``process_response`` method of all its "sub leg
+   spiders". Finally, the output of all of them is combined to produce the
+   final aggregated output.
+2. Each element of the aggregated output of ``process_response`` is processed
+   with either ``process_item`` or ``process_request`` before being returned
+   from the spider. Similar to ``process_response``, each item/request is
+   processed with all ``process_{request,item``} of the leg spiders composing
+   the spider, and also with those of the spider itself.
+
+Leg Spider examples
+===================
+
+Regex (HTML) Link Extractor
+---------------------------
+
+A typical application of LegSpider's is to build Link Extractors. For example:
+
+::
+
+   #!python
+   class RegexHtmlLinkExtractor(LegSpider):
+
+       def process_response(self, response):
+           if isinstance(response, HtmlResponse):
+               allowed_regexes = self.spider.url_regexes_to_follow
+               # extract urls to follow using allowed_regexes
+               return [Request(x) for x in urls_to_follow]
+
+   class MySpider(LegSpider):
+
+       legs = [RegexHtmlLinkExtractor()]
+       url_regexes_to_follow = ['/product.php?.*']
+
+       def parse_response(self, response):
+           # parse response and extract items
+           return items
+
+RSS2 link extractor
+-------------------
+
+This is a Leg Spider that can be used for following links from RSS2 feeds.
+
+::
+
+   #!python
+   class Rss2LinkExtractor(LegSpider):
+
+       def process_response(self, response):
+           if response.headers.get('Content-type') 'application/rss+xml':
+               xs = XmlXPathSelector(response)
+               urls = xs.select("//item/link/text()").extract()
+               return [Request(x) for x in urls]
+
+Callback dispatcher based on rules
+----------------------------------
+
+Another example could be to build a callback dispatcher based on rules:
+
+::
+
+   #!python
+   class CallbackRules(LegSpider):
+
+       def __init__(self, *a, **kw):
+           super(CallbackRules, self).__init__(*a, **kw)
+           for regex, method_name in self.spider.callback_rules.items():
+               r = re.compile(regex)
+               m = getattr(self.spider, method_name, None)
+               if m:
+                   self._rules[r] = m
+
+       def process_response(self, response):
+           for regex, method in self._rules.items():
+               m = regex.search(response.url)
+               if m:
+                   return method(response)
+           return []
+
+   class MySpider(LegSpider):
+
+       legs = [CallbackRules()]
+       callback_rules = {
+           '/product.php.*': 'parse_product',
+           '/category.php.*': 'parse_category',
+       }
+
+       def parse_product(self, response):
+           # parse response and populate item
+           return item
+
+URL Canonicalizers
+------------------
+
+Another example could be for building URL canonicalizers:
+
+::
+
+   #!python
+   class CanonializeUrl(LegSpider):
+
+       def process_request(self, request):
+           curl = canonicalize_url(request.url, rules=self.spider.canonicalization_rules)
+           return request.replace(url=curl)
+
+   class MySpider(LegSpider):
+
+       legs = [CanonicalizeUrl()]
+       canonicalization_rules = ['sort-query-args', 'normalize-percent-encoding', ...]
+
+       # ...
+
+Setting item identifier
+-----------------------
+
+Another example could be for setting a unique identifier to items, based on
+certain fields:
+
+::
+
+   #!python
+   class ItemIdSetter(LegSpider):
+
+       def process_item(self, item):
+           id_field = self.spider.id_field
+           id_fields_to_hash = self.spider.id_fields_to_hash
+           item[id_field] = make_hash_based_on_fields(item, id_fields_to_hash)
+           return item
+
+   class MySpider(LegSpider):
+
+       legs = [ItemIdSetter()]
+       id_field = 'guid'
+       id_fields_to_hash = ['supplier_name', 'supplier_id']
+
+       def process_response(self, item):
+           # extract item from response
+           return item
+
+Combining multiple leg spiders
+------------------------------
+
+Here's an example that combines functionality from multiple leg spiders:
+
+::
+
+   #!python
+   class MySpider(LegSpider):
+
+       legs = [RegexLinkExtractor(), ParseRules(), CanonicalizeUrl(), ItemIdSetter()]
+
+       url_regexes_to_follow = ['/product.php?.*']
+
+       parse_rules = {
+           '/product.php.*': 'parse_product',
+           '/category.php.*': 'parse_category',
+       }
+
+       canonicalization_rules = ['sort-query-args', 'normalize-percent-encoding', ...]
+
+       id_field = 'guid'
+       id_fields_to_hash = ['supplier_name', 'supplier_id']
+
+       def process_product(self, item):
+           # extract item from response
+           return item
+
+       def process_category(self, item):
+           # extract item from response
+           return item
+
+Leg Spiders vs Spider middlewares
+=================================
+
+A common question that would arise is when one should use Leg Spiders and when
+to use Spider middlewares. Leg Spiders functionality is meant to implement
+spider-specific functionality, like link extraction which has custom rules per
+spider. Spider middlewares, on the other hand, are meant to implement global
+functionality.
+
+When not to use Leg Spiders
+===========================
+
+Leg Spiders are not a silver bullet to implement all kinds of spiders, so it's
+important to keep in mind their scope and limitations, such as:
+
+- Leg Spiders can't filter duplicate requests, since they don't have access to
+  all requests at the same time. This functionality should be done in a spider
+  or scheduler middleware.
+- Leg Spiders are meant to be used for spiders whose behavior (requests & items
+  to extract) depends only on the current page and not previously crawled pages
+  (aka. "context-free spiders"). If your spider has some custom logic with
+  chained downloads (for example, multi-page items) then Leg Spiders may not be
+  a good fit.
+
+``LegSpider`` proof-of-concept implementation
+=============================================
+
+Here's a proof-of-concept implementation of ``LegSpider``:
+
+::
+
+   #!python
+   from scrapy.http import Request
+   from scrapy.item import BaseItem
+   from scrapy.spider import BaseSpider
+   from scrapy.utils.spider import iterate_spider_output
+
+
+   class LegSpider(BaseSpider):
+       """A spider made of legs"""
+
+       legs = []
+
+       def __init__(self, *args, **kwargs):
+           super(LegSpider, self).__init__(*args, **kwargs)
+           self._legs = [self] + self.legs[:]
+           for l in self._legs:
+               l.set_spider(self)
+
+       def parse(self, response):
+           res = self._process_response(response)
+           for r in res:
+               if isinstance(r, BaseItem):
+                   yield self._process_item(r)
+               else:
+                   yield self._process_request(r)
+
+       def process_response(self, response):
+           return []
+
+       def process_request(self, request):
+           return request
+
+       def process_item(self, item):
+           return item
+
+       def set_spider(self, spider):
+           self.spider = spider
+
+       def _process_response(self, response):
+           res = []
+           for l in self._legs:
+               res.extend(iterate_spider_output(l.process_response(response)))
+           return res
+
+       def _process_request(self, request):
+           for l in self._legs:
+               request = l.process_request(request)
+           return request
+
+       def _process_item(self, item):
+           for l in self._legs:
+               item = l.process_item(item)
+           return item