Skip to content

Latest commit



578 lines (418 loc) · 13.5 KB


File metadata and controls

578 lines (418 loc) · 13.5 KB

import { CodeSurfer, CodeSurferColumns, Step, } from "code-surfer"; import { github } from "@code-surfer/themes";

export const theme = { ...github, aspectRatio: 16 / 9, };

Web Scraping com Python + Scrapy

Alessandro Martini

  • Desenvolvedor Full Stack @ Loadsmart
Meu nome é Alessandro, eu trabalho como Full Stack Developer na Loadsmart

O que é Web Scraping?

Web scraping é a técnica de coletar dados que estão disponíveis na Internet.

O que é Scrapy?

  • Pilhas incluídas 🔋

  • Scrapy Cloud ☁️

  • CLI

Scrapy é um framework extensível pra coletar dados de websites, falar sobre scrapy cloud, explicar pra que é e como usar.

Pra que serve?

Controle da administração pública como no Serenata de Amor, coleta de dados para treinamento de IAs, pesquisas de mercado, ou automação de tarefas manuais.

Como fazer?

Demo time!

Nosso alvo

<iframe src="" width="100%" height="100%" /> Fazer um tour rápido pelo site, mostrando o botão de `next` e a página de um livro

O que vamos coletar

  1. Título
  2. Preço
  3. Imagem de capa
  4. Código UPC
  5. Quantidade em estoque
  6. Avaliação
  7. Categoria
Esses são os dados que queremos coletar, mostrar que alguns poderiam ser extraídos da listagem, mas o UPC e a disponibilidade não.
import scrapy

class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['']
    start_urls = ['']

    def parse(self, response):
import scrapy

class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['']
    start_urls = ['']

    def parse(self, response):
        for next_href in response.xpath("//a[.='next']/@href").getall():
            next_page_url = response.urljoin(next_href)
            yield scrapy.Request(next_page_url, callback=self.parse)
import scrapy

class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['']
    start_urls = ['']

    def parse(self, response):
        for next_href in response.xpath("//a[.='next']/@href").getall():
            next_page_url = response.urljoin(next_href)
            yield scrapy.Request(next_page_url, callback=self.parse)

        for book in response.css("article.product_pod"):
            book_href = book.css("a::attr(href)").get()
            book_url = response.urljoin(book_href)
            yield scrapy.Request(book_url, callback=self.parse_books)
import scrapy

class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['']
    start_urls = ['']

    def parse(self, response):
        for next_href in response.xpath("//a[.='next']/@href").getall():
            next_page_url = response.urljoin(next_href)
            yield scrapy.Request(next_page_url, callback=self.parse)

        for book in response.css("article.product_pod"):
            book_href = book.css("a::attr(href)").get()
            book_url = response.urljoin(book_href)
            yield scrapy.Request(book_url, callback=self.parse_books)

    def parse_books(self, book):
        title = book.css("h1::text").get()
        price = book.css(".price_color::text").get()
        cover = book.urljoin(book.css("img::attr(src)").get())
        upc = book.css("tr:first-of-type > td::text").get()
        in_stock = book.xpath("//*[starts-with(text(),'In stock')]/text()").get()
        rating_class = book.css(".star-rating").xpath("@class").get().lower()
        rating = rating_class.split()[-1]
        category = book.css(".breadcrumb > li:nth-child(3) > a::text").get().lower()
        yield {
            "title": title,
            "price": price,
            "cover": cover,
            "upc": upc,
            "in_stock": in_stock,
            "rating": rating,
            "category": category,
            "url": book.url,
    "title": "Emma",
    "price": "£32.93",
    "cover": "",
    "upc": "2e69730561ed70ad",
    "in_stock": "In stock (1 available)",
    "rating": "two",
    "category": "classics",
    "url": ""
class DemoPipeline:
    def process_item(self, item, spider):
        return item
class DemoPipeline:
    session = None

    def open_spider(self, spider):
        self.session = get_session()

    def close_spider(self, spider):
        except Exception:

    def process_item(self, item, spider):
        return item
class DemoPipeline:
    session = None

    def open_spider(self, spider):
        self.session = get_session()

    def close_spider(self, spider):
        except Exception:

    def process_item(self, item, spider):
        book = Book(**item)
        return item

E o JavaScript? :thinking_face:

O Scrapy só faz requests simples, ele não possui nenhuma engine para renderizar JavaScript, o que é um problema se você quer coletar dados de um site feito com JS frameworks (Vue, React, JS) por exemplo, ou se voce quiser interagir com a pagina


  • Headless
  • Lightweight


  • Full-fledged

  • Sem interface gráfica

  • Seleniun

Possuem todas as funcionalidades de um browser e podem ser controlados via código, normalmente são mais lentos, mas são mais faceis de interagir.


  • Menos funcionalidades

  • Suporta JavaScript

  • Splash

Não possuem todas as funcionalidades de um browser normal, mas são mais rápidos e leves, ideais para projetos de scraping.


Funciona como um serviço com um endpoint que retorna o site renderizado, pode ser rodado em docker. Permite executar scripts em JavaScript e Lua.
import scrapy

class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['']
    start_urls = ['']

    def parse(self, response):
        for next_href in response.xpath("//a[.='next']/@href").getall():
            next_page_url = response.urljoin(next_href)
            yield scrapy.Request(next_page_url, callback=self.parse)

        for book in response.css("article.product_pod"):
            book_href = book.css("a::attr(href)").get()
            book_url = response.urljoin(book_href)
            yield scrapy.Request(book_url, callback=self.parse_books)
import scrapy
from scrapy_splash import SplashRequest

class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['']
    start_urls = ['']

    def parse(self, response):
        for next_href in response.xpath("//a[.='next']/@href").getall():
            next_page_url = response.urljoin(next_href)
            yield scrapy.Request(next_page_url, callback=self.parse)

        for book in response.css("article.product_pod"):
            book_href = book.css("a::attr(href)").get()
            book_url = response.urljoin(book_href)
            yield scrapy.Request(book_url, callback=self.parse_books)
import scrapy
from scrapy_splash import SplashRequest

class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['']
    start_urls = ['']
    splash_kwargs = {
        "args": {"wait": 1, "html": 1, "png": 1},
        "endpoint": "render.html",

    def parse(self, response):
        for next_href in response.xpath("//a[.='next']/@href").getall():
            next_page_url = response.urljoin(next_href)
            yield scrapy.Request(next_page_url, callback=self.parse)

        for book in response.css("article.product_pod"):
            book_href = book.css("a::attr(href)").get()
            book_url = response.urljoin(book_href)
            yield scrapy.Request(book_url, callback=self.parse_books)
import scrapy
from scrapy_splash import SplashRequest

class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['']
    start_urls = ['']
    splash_kwargs = {
        "args": {"wait": 1, "html": 1, "png": 1},
        "endpoint": "render.html",

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, **self.splash_kwargs)

    def parse(self, response):
        for next_href in response.xpath("//a[.='next']/@href").getall():
            next_page_url = response.urljoin(next_href)
            yield scrapy.Request(next_page_url, callback=self.parse)

        for book in response.css("article.product_pod"):
            book_href = book.css("a::attr(href)").get()
            book_url = response.urljoin(book_href)
            yield scrapy.Request(book_url, callback=self.parse_books)
import scrapy
from scrapy_splash import SplashRequest

class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['']
    start_urls = ['']
    splash_kwargs = {
        "args": {"wait": 1, "html": 1, "png": 1},
        "endpoint": "render.html",

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, **self.splash_kwargs)

    def parse(self, response):
        for next_href in response.xpath("//a[.='next']/@href").getall():
            next_page_url = response.urljoin(next_href)
            yield SplashRequest(next_page_url, self.parse, **self.splash_kwargs)

        for book in response.css("article.product_pod"):
            book_href = book.css("a::attr(href)").get()
            book_url = response.urljoin(book_href)
            yield SplashRequest(book_url, self.parse_books, **self.splash_kwargs)
import scrapy
from scrapy_splash import SplashRequest

class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['']
    start_urls = ['']
    splash_kwargs = {
        "args": {"wait": 1, "html": 1, "png": 1},
        "endpoint": "render.html",

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, **self.splash_kwargs)

    def parse(self, response):
        for next_href in response.xpath("//a[.='next']/@href").getall():
            next_page_url = response.urljoin(next_href)
            yield SplashRequest(next_page_url, self.parse, **self.splash_kwargs)

        for book in response.css("article.product_pod"):
            book_href = book.css("a::attr(href)").get()
            book_url = response.urljoin(book_href)
            yield SplashRequest(book_url, self.parse_books, **self.splash_kwargs)

        html = response.body
        png_bytes = base64.b64decode(['png'])
        # ...

# Obey robots.txt rules

User-agent: *
Crawl-delay: 4

User-agent: Googlebot
Disallow: /private

User-agent: annoyng-bot
Disallow: /

# Crawl responsibly by identifying yourself
# (and your website) on the user-agent
USER_AGENT = "Meu-Bot ([email protected])"

# Usar um delay aleatório entre 0.5 * DOWNLOAD_DELAY
# e 1.5 * DOWNLOAD_DELAY, em segundos

# Atrasar a velocidade do crawler baseado
# na carga do crawler e do site sendo coletado.

Use com moderação

  • Respeite o robots.txt

  • Fique atento a performance do site

  • Forneça informação para contato

Pode te interessar

Código fonte

Todo o código fonte esta disponível em


