Skip to content

Generic Scraping DSL and runtime for data acquisition

Notifications You must be signed in to change notification settings

open-automaton/automaton

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

@open-automaton/automaton

A web scraping/RPA solution for ease of use, maintenance and (soon™) deployment. It uses an XML based DSL which both defines the scraping process as well as the structure of the returned data. It compares favorably to uipath, Blue Prism ALM, Kapow(Now Kofax RPA) and apify. These solutions make the work of building and maintaining scrapers infinitely easier than directly using a primary scraping solution(like playwright, puppeteer, jsdom, cheerio, selenium, windmill, beautifulsoup or others).



Usage

Here we're going to do a simple scrape of unprotected data on craigslist(you should use their available RSS feed instead, but it serves as an excellent example for how to harvest results and works in all the engines):

<go url="https://sfbay.craigslist.org/search/apa">
    <set xpath="//li[@class='result-row']" variable="matches">
        <set
            xpath="//time[@class='result-date']/text()"
            variable="time"
        ></set>
        <set
            xpath="//span[@class='result-price']/text()"
            variable="price"
        ></set>
        <set
            xpath="//span[@class='housing']/text()"
            variable="housing"
        ></set>
        <set
            xpath="string(//img/@src)"
            variable="link"
        ></set>
    </set>
    <emit variables="matches"></emit>
</go>

automaton definitions can be used in whatever context they are needed: from the command line, your own code or from a GUI (Soon™).

In Code

First, import automaton

const Automaton = require('@open-automaton/automaton');

Then Import the mining engine you want to use

  • Cheerio
    const MiningEngine = require(
        '@open-automaton/cheerio-mining-engine'
    );
    let myEngine = new MiningEngine();
  • Puppeteer
    const Engine = require(
        '@open-automaton/puppeteer-mining-engine'
    );
    let myEngine = new MiningEngine();
  • Playwright: Chromium
    const Engine = require(
        '@open-automaton/playwright-mining-engine'
    );
    let myEngine = new MiningEngine({type:'chromium'});
  • Playwright: Firefox
    const Engine = require(
        '@open-automaton/playwright-mining-engine'
    );
    let myEngine = new MiningEngine({type:'firefox'});
  • Playwright: Webkit
    const Engine = require(
        '@open-automaton/playwright-mining-engine'
    );
    let myEngine = new MiningEngine({type:'webkit'});
  • JSDom
    const Engine = require(
        '@open-automaton/jsdom-mining-engine'
    );
    let myEngine = new MiningEngine();

Last you need to do the scrape(in an `async` function)

let results = await Automaton.scrape(
    'definition.xml',
    myEngine
);

That's all it takes, if you need a different usage pattern that is supported as well.

CLI

    npm install -g automaton-cli
    auto --help

GUI

[TBD]

Scraper Actions

The automaton DSL is centered around 3 actions which navigate and populate the returned dataset. Many attributes are common to all elements, and most common use cases are covered.
go

A progression from page to page, either by loading a url, submitting a form or clicking a UI element requires either url or form

type accepts json, application/json or form

Some engines that use the browser will only submit using the form configuration on the page and ignore the method and type options.

<go
    url="https://domain.com/path/"
    form="form-name"
    method="post"
    type="application/json"
></go>

set

Either use a variable to set a target input on a form or set a variable using an xpath or regex. Lists are extracted by putting sets inside another set

<set
    variable="variable-name"
    xpath="//xpath/expression"
    regex="[regex]+.(expression)"
    form="form-name"
    target="input-element-name"
></set>

emit

emit a value to the return and optionally post that value to a remote url

<emit
    variables="some,variables"
    remote="https://domain.com/path/"
></emit>

Maintaining Scrapers

Here's a basic process for data behind a simple form

First you'll want to understand xpath (and probably DOM, regex and css selectors) before we proceed, as most of the selectors in a good definition are xpath which is as general as possible.

Once you're done with that, the auto command( get by installing the CLI) has a few operations we'll be using.

1) Save the form url

You want to scrape the *state* of the DOM once the page is loaded, but if you use a tool like `CURL` you'll only get the *transfer state* of the page, which is probably not useful. `auto fetch` pulls the state of the DOM out of a running browser and displays that HTML.

auto fetch https://domain.com/path/ > page.html

2) Target the form

The first thing you might do against the HTML you've captured is pull all the forms out of the page, like this:

auto xpath "//form" page.html

3) Target the inputs

Assuming you've identified the form name you are targeting as my-form-name, you then want to get all the inputs out of it with something like:

auto xpath-form-inputs "//form[@name='my-form-name']" page.html

Then you need to write selectors for the inputs that need to be set (all of them in the case of cheerio, but otherwise the browser abstraction usually handles those that are prefilled)

<set
    form="<form-selector>"
    target="<input-name>"
    variable="<incoming-value-name>"
></set>

4) Submit the filled form

you just need to target the form element with:

<go form="<form-selector>">
    <!-- extraction logic to go here -->
</go>

5) Save the form results url

Here you'll need to manually use your browser go to the submitted page and save the HTML by opening the inspector, then copying the HTML from the root element, then pasting it into a file.

6) Save result set groups

Now we need to look for rows with something like:

auto xpath "//ul|//ol|//tbody" page.html

Once you settle on a selector for the correct element add a selector in the definition:

<set xpath="<xpath-selector>" variable="matches">
    <!--more selected fields here -->
</set>

7) Save result set fields

Last we need to looks for individual fields using something like:

auto xpath "//li|//tr" page_fragment.html

Once you settle on a selector for the correct element add a selector in the definition:

<set xpath="<xpath-selector>" variable="matches">
    <set
        xpath="<xpath-selector>"
        variable="<field-name>"
    ></set>
    <!--more selected fields here -->
</set>

To target the output emit the variables you want, otherwise it will dump everything in the environment.

From this you should be able to construct a primitive scrape definition(See the examples below for more concrete instruction). Once you have this definition you can do sample scrapes with:
auto scrape my-definition.auto.xml --data '{"JSON":"data"}'
#TODO on CLI, but already working in the API call options
8 - ∞) Wait, it's suddenly broken!!

The most frustrating thing about scrapers is, because they are tied to the structural representation of the presentation, which is designed to change, scrapers will inevitably break. While this is frustrating, using the provided tools on fresh fetches of the pages in question will quickly highlight what's failing. Usually:

  1. The url has changed, requiring an update to the definition,
  2. The page structure has changed requiring 1 or more selectors to be rewritten,
  3. The page has changed their delivery architecture, requiring you to use a more expensive engine (computationally: cheerio < jsdom < puppeteer, playwright).

Examples of building scrapers:

Deploying a Scraper

[TBD]

Publishing a Definition (Soon™)

First, create a directory that describes the site we're fetching, the work we're doing and ends with .auto, let's call this one some-site-register.auto. Once created, let's go into that directory.

Automaton definitions are published as useable nodejs npm modules, though making and maintaining them does not require any javascript. You'll need your own npm credentials to publish.

1) Create the repository

Once in the directory let's run

auto init ../some/path/some-site-register.auto

If a definition is not provided, a blank one will be initialized

2) Configure the repository

you'll need to import the engine you want to use by default:

# we are choosing to default to JSDOM
npm install @open-automaton/jsdom-mining-engine

then add an entry to package.json for the default engine

{
    "defaultAutomatonEngine" : "@open-automaton/jsdom-mining-engine"
}

3) Publish the repository

publishing is the standard:

npm publish

Before publishing, please consider updating the README to describe your incoming data requirements.

Once it's all set up, you have a bunch of features, out of the box.

Testing

npm run test

Scraping

you can run your definition with

npm run scrape '{"JSON":"data"}'

Definition Path

you can reference the definition directly (in parent projects) at:

let xmlPath = require('some-site-register.auto').xml;

which is short for:

path.merge(
    require.resolve('some-site-register.auto'),
    'src',
    'some-site-register.auto.xml'
)
// ./node_modules/some-site-register.auto/src/some-site-register.auto.xml

The top level Automaton.scrape() function knows how to transform some-site-register.auto into that, so you can just use the shorthand there.

You can include your scraper(once published) with:

let MyScraper = require('some-site-register.auto');
MyScraper.scrape(automatonEngine);
// or MyScraper.scrape(); to use the default engine

About Automaton

View the development roadmap.

Read a little about where this came from.

Testing

You can run the mocha test suite with:

    npm run test

Enjoy,

-Abbey Hawk Sparrow

About

Generic Scraping DSL and runtime for data acquisition

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published