Scrapy Project Cookiecutter Template

A bare minimum cookiecutter template for a Scrapy project.

What is included:

Default stack is Scrapy 2.0.1 and Python 3.8.
Standard project files generated by scrapy startproject.
A default requirements.txt managed by pip-compile.
A default scrapinghub.yml, Scrapy Cloud configuration file.
A default Dockerfile to build a docker image to replicate the Scrapy Cloud running container.
Useful scripts to archive job items and dump items from a Collection.

Usage

Install cookiecutter if you do not have it already: pip install cookiecutter
Generate a new project: cookiecutter gh:krectra/cookiecutter-scrapycloud

Example

We will show how to create and deploy a Scrapy project to Scrapy Cloud. You will need to have an account in Scrapy Cloud and a Scrapy Project already created. In this example, we use OSX as development machine.

First we bootstrap our project and create a spider:

$ cookiecutter gh:rolando/cookiecutter-scrapycloud
project_name [Project Name]: myproject
project_slug [myproject]:
project_module [myproject]:
scrapycloud_id [Scrapy Cloud Project ID]: 12345
$ cd myproject
$ pip install -r requirements.txt -r dev-requirements.txt
$ cat > myproject/spiders/myspider.py <<EOF
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for url in response.css('ul li a::attr("href")').re('.*/category/.*'):
            yield scrapy.Request(response.urljoin(url), self.parse_titles)

    def parse_titles(self, response):
        for post_title in response.css('div.entries > ul > li a::text').extract():
            yield {'title': post_title}

EOF

Now we can run the spider in our host:

$ scrapy crawl blogspider

Also we can build a docker image to replicate the Scrapy Cloud container:

$ docker build -t myproject .
$ docker run -it myproject scrapy crawl blogspider

Finally, we can deploy to our Scrapy Cloud project and schedule the spider:

$ shub deploy
$ shub schedule blogspider

Scripts

bin/archive-items.py: A full-featured script to export all jobs output to a given collection. This script can be set up as a Periodic Job on Scrapy Cloud.
bin/dump-collection.py: A simple script to dump collection's items.

License

This project is licensed under the terms of the MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
tests		tests
{{cookiecutter.project_slug}}		{{cookiecutter.project_slug}}
.gitignore		.gitignore
.travis.yml		.travis.yml
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
README.rst		README.rst
VERSION		VERSION
cookiecutter.json		cookiecutter.json
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrapy Project Cookiecutter Template

Usage

Example

Scripts

License

About

Releases

Packages

Languages

License

krectra/cookiecutter-scrapycloud

Folders and files

Latest commit

History

Repository files navigation

Scrapy Project Cookiecutter Template

Usage

Example

Scripts

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages