A bare minimum cookiecutter template for a Scrapy project.
What is included:
- Default stack is Scrapy 2.0.1 and Python 3.8.
- Standard project files generated by
scrapy startproject
. - A default
requirements.txt
managed bypip-compile
. - A default
scrapinghub.yml
, Scrapy Cloud configuration file. - A default
Dockerfile
to build a docker image to replicate the Scrapy Cloud running container. - Useful scripts to archive job items and dump items from a Collection.
- Install
cookiecutter
if you do not have it already:pip install cookiecutter
- Generate a new project:
cookiecutter gh:krectra/cookiecutter-scrapycloud
We will show how to create and deploy a Scrapy project to Scrapy Cloud. You will need to have an account in Scrapy Cloud and a Scrapy Project already created. In this example, we use OSX as development machine.
First we bootstrap our project and create a spider:
$ cookiecutter gh:rolando/cookiecutter-scrapycloud project_name [Project Name]: myproject project_slug [myproject]: project_module [myproject]: scrapycloud_id [Scrapy Cloud Project ID]: 12345 $ cd myproject $ pip install -r requirements.txt -r dev-requirements.txt $ cat > myproject/spiders/myspider.py <<EOF import scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['https://blog.scrapinghub.com'] def parse(self, response): for url in response.css('ul li a::attr("href")').re('.*/category/.*'): yield scrapy.Request(response.urljoin(url), self.parse_titles) def parse_titles(self, response): for post_title in response.css('div.entries > ul > li a::text').extract(): yield {'title': post_title} EOF
Now we can run the spider in our host:
$ scrapy crawl blogspider
Also we can build a docker image to replicate the Scrapy Cloud container:
$ docker build -t myproject . $ docker run -it myproject scrapy crawl blogspider
Finally, we can deploy to our Scrapy Cloud project and schedule the spider:
$ shub deploy $ shub schedule blogspider
bin/archive-items.py
: A full-featured script to export all jobs output to a given collection. This script can be set up as a Periodic Job on Scrapy Cloud.bin/dump-collection.py
: A simple script to dump collection's items.
This project is licensed under the terms of the MIT License