forked from scrapy/scrapy
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
690081b
commit 4b902d2
Showing
2 changed files
with
92 additions
and
73 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
======= ============================== | ||
SEP 12 | ||
Title Spider name | ||
Author Ismael Carnales, Pablo Hoffman | ||
Created 2009-12-01 | ||
Updated 2010-03-23 | ||
Status Final | ||
======= ============================== | ||
|
||
==================== | ||
SEP-012: Spider name | ||
==================== | ||
|
||
The spiders are currently referenced by its ``domain_name`` attribute. This SEP | ||
proposes adding a ``name`` attribute to spiders and using it as their | ||
identifier. | ||
|
||
Current limitations and flaws | ||
============================= | ||
|
||
1. You can't create two spiders that scrape the same domain (without using | ||
workarounds like assigning an arbitrary ``domain_name`` and putting the | ||
real domains in the ``extra_domain_names`` attributes) | ||
2. For spiders with multiple domains, you have to specify them in two different | ||
places: ``domain_name`` and ``extra_domain_names``. | ||
|
||
Proposed changes | ||
================ | ||
|
||
1. Add a ``name`` attribute to spiders and use it as their unique identifier. | ||
2. Merge ``domain_name`` and ``extra_domain_names`` attributes in a single | ||
list ``allowed_domains``. | ||
|
||
Implications of the changes | ||
=========================== | ||
|
||
General | ||
------- | ||
|
||
In general, all references to ``spider.domain_name`` will be replaced by | ||
``spider.name`` | ||
|
||
OffsiteMiddleware | ||
----------------- | ||
|
||
``OffsiteMiddleware`` will use ``spider.allowed_domains`` for determining the | ||
domain names of a spider | ||
|
||
scrapy-ctl.py | ||
------------- | ||
|
||
crawl | ||
~~~~~ | ||
|
||
The new syntax for crawl command will be: | ||
|
||
:: | ||
|
||
crawl [options] <spider|url> ... | ||
|
||
If you provide an url, it will try to find the spider the processes it. If no | ||
spider is found or more than one spider is found, it will raise an error. So, | ||
to crawl in those cases you must set the spider to use using the ``--spider`` | ||
option | ||
|
||
genspider | ||
~~~~~~~~~ | ||
|
||
The new signature for genspider will be: | ||
|
||
:: | ||
|
||
genspider [options] <name> <domain> | ||
|
||
example: | ||
|
||
:: | ||
|
||
$ scrapy-ctl genspider google google.com | ||
|
||
$ ls project/spiders/ | ||
project/spiders/google.py | ||
|
||
$ cat project/spiders/google.py | ||
|
||
:: | ||
|
||
class GooglecomSpider(BaseSpider): | ||
name = 'google' | ||
allowed_domains = ['google.com'] | ||
|
||
.. note:: ``spider_allowed_domains`` becomes optional as only ``OffsiteMiddleware`` uses it. |
This file was deleted.
Oops, something went wrong.