Skip to content

Latest commit

 

History

History
159 lines (129 loc) · 6.49 KB

README.md

File metadata and controls

159 lines (129 loc) · 6.49 KB

scrapy-puppeteer-service

This is special service that runs puppeteer instances. It is a part of scrapy-pupeteer middleware that helps to handle javascript pages in scrapy using puppeteer. This allows to scrape sites that require JS to function properly and to make the scraper more similar to humans.

⚠️ This repository is under development.

This project is under development. Use it at your own risk.

Usage

On your host machine you should enable user namespace cloning.

sudo sysctl -w kernel.unprivileged_userns_clone=1

To start service run the docker container. Since the Dockerfile adds a pptr user as a non-privileged user, it may not have all the necessary privileges. So you should use docker run --cap-add=SYS_ADMIN option.

$ docker run -d -p 3000:3000 --name scrapy-puppeter-service --cap-add SYS_ADMIN isprascrawlers/scrapy-puppeteer-service 

To run example which shows how to deploy several instances of service with load balancer use this command.

$ docker-compose up -d

API

Here is the list of implemented methods that could be used to connect to puppeteer. For All requests puppeteer browser creates new incognito browser context and new page in it. If your want to reuse your browser context simple send context_id in your query. All request return their context ids in response. Also you could reuse your browser page and more actions with it. In order to do so you should send in your request pageId that is returned in your previous request, that would make service reuse current page and return again its pageId. If you want to close the page you are working with you should send in query param "closePage" with non-empty value. If you want your requests on page make through proxy, just add to normal request "proxy" param. Proxy username and password params are optional. Also you can add extra http headers to each request that is made on page.

{
  //request params  
  "proxy": "{protocol}://{username}:{password}@{proxy_ip}:{proxy_port}",
  "headers": {
    "My-Special-Header": "It's value."
  }
}

/goto

This method allow to goto a page with a specific url in puppeteer.

Params:

url - the url which puppeteer should navigate to.
navigationOptions - possible options to use for request.
waitOptions - wait for selector or timeout after navigation completes, same as in click or scroll.

/back and /forward

This methods helps to navigate back and forward to see previously seen pages.

/click

This method allow to click on first element that is matched by selector and return page result.

Example request body:

{
    "selector": "", //<string> A selector to search for element to click. If there are multiple elements satisfying the selector, the first will be clicked.
    "clickOptions": {
        "button":  "left", //<",left"|"right"|"middle"> Defaults to left.
        "clickCount": 1, //<number> defaults to 1.
        "delay": 0 //<number> Time to wait between mousedown and mouseup in milliseconds. Defaults to 0.
    },
    "waitOptions": {
        // if selectorOrTimeout is a string, then the first argument is treated as a selector or xpath, depending on whether or not it starts with '//', and the method is a shortcut for page.waitForSelector or page.waitForXPath
        // if selectorOrTimeout is a number, then the first argument is treated as a timeout in milliseconds and the method returns a promise which resolves after the timeout
        "selectorOrTimeout": 5, //default timeout is 1000ms
    },
    "navigationOptions": { // use if click triggers navigation to other page; same as in goXXX methods
        "waitUntil": "domcontentloaded",    
    } 
}

/scroll

This method allow to scroll page to the first element that is matched by selector and returns page result.

Example request body:

{
    "selector": "", //<string> A selector to search for element to click. If there are multiple elements satisfying the selector, the first will be clicked.
    "waitOptions": {
        // if selectorOrTimeout is a string, then the first argument is treated as a selector or xpath, depending on whether or not it starts with '//', and the method is a shortcut for page.waitForSelector or page.waitForXPath
        // if selectorOrTimeout is a number, then the first argument is treated as a timeout in milliseconds and the method returns a promise which resolves after the timeout
        "selectorOrTimeout": 5, //default timeout is 1000ms
    }
 }

/action

Body of this request should be a js code that declares function action with at least page parameter. The content type of request should be:

 Content-Type: application/javascript

Simple example request body of goto:

async function action(page, request) {
    await page.goto(request.query.uri);
    let response = { //return response that you want to see as result
        context_id: page.browserContext()._id,
        page_id: await page._target._targetId,
        html: await page.content(),
        cookies: await page.cookies()
    };
    await page.close();
    return response;
 }

/screenshot

This method returns screenshots of current page more.
Description of options you can see on puppeteer github. The path options is omitted in options. Also the only possibly encoding is base64.

Example request body:

{
    "options": {
        "type": "png",
        "quality": 100,
        "fullPage" : true 
     }
 }

/close_context

This method close browser context and all its pages. Be sure you finished all you requests to this context.

Notes on memory usage

You need to explicitly close the browser tab once you don't need it (e.g. at the end of the parse method).

TODO

  • skeleton that could handle goto, click, scroll, and actions.
  • proxy support for puppeteer
  • support of extra headers
  • error handling for requests
  • har support
  • scaling to several docker containers