-
Notifications
You must be signed in to change notification settings - Fork 32
/
scrapy
244 lines (144 loc) · 10.4 KB
/
scrapy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
------------------------------------------------------------------------------------------------------------------------------------
要怎么停止爬虫呢?
在回调函数中raise CloseSpider 异常。 更多详情请参见: CloseSpider 。
------------------------------------------------------------------------------------------------------------------------------------
http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/broad-crawls.html
增加并发
并发是指同时处理的request的数量。其有全局限制和局部(每个网站)的限制。
Scrapy默认的全局并发限制对同时爬取大量网站的情况并不适用,因此您需要增加这个值。 增加多少取决于您的爬虫能占用多少CPU。 一般开始可以设置为 100 。不过最好的方式是做一些测试,获得Scrapy进程占取CPU与并发数的关系。 为了优化性能,您应该选择一个能使CPU占用率在80%-90%的并发数。
增加全局并发数:
CONCURRENT_REQUESTS = 100
禁止cookies
除非您 真的 需要,否则请禁止cookies。在进行通用爬取时cookies并不需要, (搜索引擎则忽略cookies)。禁止cookies能减少CPU使用率及Scrapy爬虫在内存中记录的踪迹,提高性能。
禁止cookies:
COOKIES_ENABLED = False
禁止重试
对失败的HTTP请求进行重试会减慢爬取的效率,尤其是当站点响应很慢(甚至失败)时, 访问这样的站点会造成超时并重试多次。这是不必要的,同时也占用了爬虫爬取其他站点的能力。
禁止重试:
RETRY_ENABLED = False
------------------------------------------------------------------------------------------------------------------------------------
请求序列化
请求是由 pickle 进行序列化的,所以你需要确保你的请求是可被pickle序列化的。 这里最常见的问题是在在request回调函数中使用 lambda 方法,导致无法序列化。
------------------------------------------------------------------------------------------------------------------------------------
Scrapy自动管理cookies么?
是的,Scrapy接收并保持服务器返回来的cookies,在之后的请求会发送回去,就像正常的网页浏览器做的那样。
------------------------------------------------------------------------------------------------------------------------------------
DjangoItem注意事项
DjangoItem提供了在Scrapy项目中集成DjangoItem的简便方法,不过需要注意的是, 如果在Scrapy中爬取大量(百万级)的item时,Django ORM扩展得并不是很好(not scale well)。 这是因为关系型后端对于一个密集型(intensive)应用(例如web爬虫)并不是一个很好的选择, 尤其是具有大量的索引的数据库
接着您需要将 /home/projects/mysite 加入到 PYTHONPATH 环境变量中并将 mysite.settings 设置为 DJANGO_SETTINGS_MODULE 环境变量。 这可以在Scrapy设置文件中添加下列代码:
import sys
sys.path.append('/home/projects/mysite')
import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'mysite.settings'
注意,由于我们在python运行环境中,所以我们修改 sys.path 变量而不是 PYTHONPATH 环境变量。 如果所有设置正确,您应该可以运行 scrapy shell 命令并且导入 Person 模型(例如 from myapp.models import Person)。
------------------------------------------------------------------------------------------------------------------------------------
In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback. You can use the Request.meta attribute for that.
Here’s an example of how to pass an item using this mechanism, to populate different fields from different pages:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````
num = int(follower_num) if follower_num else 0
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````
mongodb的例子在item pipeline中有一个,说明如何使用 from_crawler ; Write items to MongoDB
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````
ImagesPipeline
http://scrapy-chs.readthedocs.org/zh_CN/1.0/topics/images.html?highlight=%E5%9B%BE%E7%89%87
下载项目图片
import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````
状态收集器??
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````
crawlspider的set_crawler http://stackoverflow.com/questions/29762151/what-is-the-function-of-set-crawler-and-from-crawler-in-crawl-py-in-scrapy
Usually you don't need to override these functions, but it depends on what you want to do.
The from_crawler method (with the @classmethod decorator) is a factory method that will be used by Scrapy to instantiate the object (spider, extension, middleware, etc) where you added it.
It's often used for getting a reference to the crawler object (that holds references to objects like settings, stats, etc) and then either pass it as arguments to the object being created or set attributes to it.
In the particular example you've pasted, it's being used to read the value from a CRAWLSPIDER_FOLLOW_LINKS setting and set it to a _follow_links attribute in the spider.
You can see another simple example of usage of the from_crawler method in this extension that uses the crawler object to get the value of a setting and passing it as parameter to the extension and to connect some signals to some methods.
The set_crawler method has been deprecated in the latest Scrapy versions and should be avoided.
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````
scrapy-redis 的pipelines里面的process_item里面用到
deferToThread:twisted
什么是twisted?http://blog.sina.com.cn/s/blog_704b6af70100py9n.html
twisted是一个用python语言写的事件驱动的网络框架,他支持很多种协议,包括UDP,TCP,TLS和其他应用层协议,比如HTTP,SMTP,NNTM,IRC,XMPP/Jabber。 非常好的一点是twisted实现和很多应用层的协议,开发人员可以直接只用这些协议的实现。其实要修改Twisted的SSH服务器端实现非常简单。很多时候,开发人员需要实现protocol类。
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````
scrapy-redis 理解(项目地址https://github.com/rolando/scrapy-redis):
/example-project (可参考https://github.com/younghz/sr-chn):
/scrapy-redis
/pipelines.py 将处理的items以spider.name:items 作为key 存储在redis队列中,以实现post-processing
重写example里面的process_item.py来实现分布式处理item
/scheduler.py 调度器
/queue.py 实现爬虫队列策略
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````
#不清空redis queue,允许爬取过程中暂停并恢复
SCHEDULER_PERSIST = True
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````
如何访问设定(How to access settings)
设定可以通过Crawler的 scrapy.crawler.Crawler.settings 属性进行访问。其由插件及中间件的 from_crawler 方法所传入:
class MyExtension(object):
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
if settings['LOG_ENABLED']:
print "log is enabled!"
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````
使用bloomfilter 来进行去重
from pybloom import BloomFilter
from scrapy.utils.job import job_dir
from scrapy.dupefilter import BaseDupeFilter
class BLOOMDupeFilter(BaseDupeFilter):
"""Request Fingerprint duplicates filter"""
def __init__(self, path=None):
self.file = None
self.fingerprints = BloomFilter(2000000, 0.00001)
@classmethod
def from_settings(cls, settings):
return cls(job_dir(settings))
def request_seen(self, request):
fp = request.url
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
def close(self, reason):
self.fingerprints = None
# Somewhere in settings.py
DUPEFILTER_CLASS = "project_name.bloom_filter.BLOOMDupeFilter"
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````
有关分布式爬虫和scrapy-redis的问题?去重策略和怎么爬取整个网站?
爬虫的问题:
从首页里的链接一层一层爬取整个网站的页面。
那如何停止的?(判断把整个网站爬取完的条件是什么?)
scrapy-redis的疑问:
去重问题:
dupefilter.py里面的源码:
def request_seen(self, request):
fp = request_fingerprint(request)
added = self.server.sadd(self.key, fp)
return not added
去重是把request 的fingerprint存在redis上来实现的吧?那大规模抓取不就很耗费内存?每条没读过的链接都存。
有没有人试过用bloomfilter结合scrapy—redis来去重。有没有必要?
---------------------------------------------------------------------------------------------
问题:
tag.select('div[class="不能有空格"]') 如何解决?
是否在 pipeline中去掉/r/n?
list [] 是否 传进数据库是什么?