A web scraping/RPA solution for ease of use, maintenance and (soon™) deployment. It uses an XML based DSL which both defines the scraping process as well as the structure of the returned data. It compares favorably to uipath, Blue Prism ALM, Kapow(Now Kofax RPA) and apify. These solutions make the work of building and maintaining scrapers infinitely easier than directly using a primary scraping solution(like playwright, puppeteer, jsdom, cheerio, selenium, windmill, beautifulsoup or others).
Here we're going to do a simple scrape of unprotected data on craigslist(you should use their available RSS feed instead, but it serves as an excellent example for how to harvest results and works in all the engines):
<go url="https://sfbay.craigslist.org/search/apa">
<set xpath="//li[@class='result-row']" variable="matches">
<set
xpath="//time[@class='result-date']/text()"
variable="time"
></set>
<set
xpath="//span[@class='result-price']/text()"
variable="price"
></set>
<set
xpath="//span[@class='housing']/text()"
variable="housing"
></set>
<set
xpath="string(//img/@src)"
variable="link"
></set>
</set>
<emit variables="matches"></emit>
</go>
| |||||
In Code
|
CLI
npm install -g automaton-cli
auto --help |
GUI
[TBD] |
The automaton DSL is centered around 3 actions which navigate and populate the returned dataset. Many attributes are common to all elements, and most common use cases are covered. |
go
A progression from page to page, either by loading a url, submitting a form or clicking a UI element requires either
Some engines that use the browser will only submit using the form configuration on the page and ignore the <go
url="https://domain.com/path/"
form="form-name"
method="post"
type="application/json"
></go> |
emit
emit a value to the return and optionally post that value to a remote url <emit
variables="some,variables"
remote="https://domain.com/path/"
></emit> |
Here's a basic process for data behind a simple form
First you'll want to understand xpath (and probably DOM, regex and css selectors) before we proceed, as most of the selectors in a good definition are xpath which is as general as possible. Once you're done with that, the |
1) Save the form urlYou want to scrape the *state* of the DOM once the page is loaded, but if you use a tool like `CURL` you'll only get the *transfer state* of the page, which is probably not useful. `auto fetch` pulls the state of the DOM out of a running browser and displays that HTML. auto fetch https://domain.com/path/ > page.html |
2) Target the form
The first thing you might do against the HTML you've captured is pull all the forms out of the page, like this: auto xpath "//form" page.html |
3) Target the inputs
Assuming you've identified the form name you are targeting as auto xpath-form-inputs "//form[@name='my-form-name']" page.html Then you need to write selectors for the inputs that need to be set (all of them in the case of cheerio, but otherwise the browser abstraction usually handles those that are prefilled) <set
form="<form-selector>"
target="<input-name>"
variable="<incoming-value-name>"
></set> |
4) Submit the filled formyou just need to target the form element with: <go form="<form-selector>">
<!-- extraction logic to go here -->
</go> |
5) Save the form results url
Here you'll need to manually use your browser go to the submitted page and save the HTML by opening the inspector, then copying the HTML from the root element, then pasting it into a file. |
6) Save result set groups
Now we need to look for rows with something like: auto xpath "//ul|//ol|//tbody" page.html Once you settle on a selector for the correct element add a selector in the definition: <set xpath="<xpath-selector>" variable="matches">
<!--more selected fields here -->
</set> |
7) Save result set fields
Last we need to looks for individual fields using something like: auto xpath "//li|//tr" page_fragment.html Once you settle on a selector for the correct element add a selector in the definition: <set xpath="<xpath-selector>" variable="matches">
<set
xpath="<xpath-selector>"
variable="<field-name>"
></set>
<!--more selected fields here -->
</set> To target the output emit the variables you want, otherwise it will dump everything in the environment. |
From this you should be able to construct a primitive scrape definition(See the examples below for more concrete instruction). Once you have this definition you can do sample scrapes with:
auto scrape my-definition.auto.xml --data '{"JSON":"data"}'
#TODO on CLI, but already working in the API call options |
8 - ∞) Wait, it's suddenly broken!!
The most frustrating thing about scrapers is, because they are tied to the structural representation of the presentation, which is designed to change, scrapers will inevitably break. While this is frustrating, using the provided tools on fresh fetches of the pages in question will quickly highlight what's failing. Usually:
|
Examples of building scrapers:
[TBD]
First, create a directory that describes the site we're fetching, the work we're doing and ends with .auto
, let's call this one some-site-register.auto
. Once created, let's go into that directory.
Automaton definitions are published as useable nodejs npm modules, though making and maintaining them does not require any javascript. You'll need your own npm credentials to publish. |
1) Create the repository
Once in the directory let's run auto init ../some/path/some-site-register.auto If a definition is not provided, a blank one will be initialized |
2) Configure the repository
you'll need to import the engine you want to use by default: # we are choosing to default to JSDOM
npm install @open-automaton/jsdom-mining-engine then add an entry to {
"defaultAutomatonEngine" : "@open-automaton/jsdom-mining-engine"
} |
3) Publish the repository
publishing is the standard: npm publish Before publishing, please consider updating the README to describe your incoming data requirements. |
Once it's all set up, you have a bunch of features, out of the box. npm run test you can run your definition with npm run scrape '{"JSON":"data"}' |
Definition Path
you can reference the definition directly (in parent projects) at:
let xmlPath = require('some-site-register.auto').xml;
which is short for:
path.merge(
require.resolve('some-site-register.auto'),
'src',
'some-site-register.auto.xml'
)
// ./node_modules/some-site-register.auto/src/some-site-register.auto.xml
The top level Automaton.scrape()
function knows how to transform some-site-register.auto
into that, so you can just use the shorthand there.
You can include your scraper(once published) with:
let MyScraper = require('some-site-register.auto');
MyScraper.scrape(automatonEngine);
// or MyScraper.scrape(); to use the default engine
View the development roadmap.
Read a little about where this came from.
You can run the mocha test suite with:
npm run test
Enjoy,
-Abbey Hawk Sparrow