WebDataFetcher

Simple yet flexible web data fetcher that allows you to run scaping jobs periodically to extract data from endpoints and send it to various destinations (database, file, print to console, etc.).

Features

Scraping jobs defined in yaml
Reshaping fetched data with transformation pipeline
Comprehensive logging
Scrape not only HTML, but also CSS, JSON and plain text data

Anatomy of a scraping job

The work of a scraping job is to fetch some data from a specified endpoint, transform it (optional) to shape it the way you need, and send it to one or more destinations (store it to a database, print it to console, etc.).

A scaping job defines:

A schedule : when to run the job
An input : what endpoint to scrape
One or more outputs : where to send the data

It also allows to specify input and outputs transformations that allow you to reshape your data like you want before sending it to the output destinations.

Destinations

A destination is a place where to send the data, for instance a database. The following destination types are currently available:

console : Print to console
influxdb : InfluxDB v2 time series database
victoriaMetrics : Victoria Metrics database

To use a destination, it must first be configured:

# myConfig.yaml

destinations:

  # Console
  myConsole: # Arbitrary destination name
    type: console

  # InfluxDB
  myInfluxdb:
    type: influxdb
    options:
      url: http://influxdb:8086
      token: <your influxdb token>
      organisation: myOrganisation
      bucket: <your destination bucket>

  # Victoria Metrics
  myVictoriaMetrics:
    type: victoriaMetrics
    options:
      url: http://victoriametrics:8428

Once defined, they can be used as outputs in the scraping jobs. Lets create a job :

jobs:
  # Let's track the price of an SSD drive
  ssdPrice: # Arbitrary job name

    # When to run the job
    schedule:
      cron: "0 0 * * *"
    
    # What to fetch
    input:
      url: https://www.newegg.ca/samsung-970-evo-plus-1tb/p/N82E16820147743

      # Transform the data
      transformations:
        
        # Extract the price from the html content
        - name: htmlScrape
          options:
            template:
              price: .price-current # css selector

        # Print the data to console
        - name: print   # We should get : { price: "$159.89" }
        
        # Remove the $ sign
        - name: replace
          target: price # from the "price" value
          options:
            search: $
            replaceWith: ""

        # Print the data to console
        - name: print   # We should get : { price: "159.89" }
        
        # Convert the price to a number
        - name: typecast
          target: price
          options:
            targetType: number

        # Print the data to console
        - name: print   # We should get : { price: 159.89 }

    # Where to send the data
    outputs:
      # Write data to Influxdb v2
      - to: myInfluxdb  # The destination name
        options:
          # InfluxDB specific options (create a point)
          measurement: price
          tags:
            product: "SSD"
            model: "Samsung 1Tb"
          fields:
            price: ${price} # Use json path to specify the target

      # Write data to victoria metrics
      - to: myVictoriaMetrics
        options:
          metrics:
            - name: price
              labels:
                product: "SSD"
                model: "Samsung 1Tb"
              value: ${price}

Transformations

Transformations allow to manipulate the data before it is sent to the output. Transformations can be applied to the input or any of the outputs. The input transformations are applied before the data is sent to the outputs whereas the output transformations are applied just before sending the data to the corresponding output destination. This gives the flexibility to specify different transformations for each output.

Refer to transformations.ts

`regexReplace`

transformations:
  - name: regexReplace

    # If the input data is in JSON, you need to specify
    # the json path of the value you want to affect
    target: <json path>
 
    options:

      # The regex pattern to match
      pattern: <regex pattern>

      # The output, where you can specify the group matches
      # in the regex pattern by using $1, $2, $3, ...
      output: <output pattern>

# Example
transformations:
  - name: regexReplace
    options:
      pattern: "(\w) (\w)"
      output: "Hello Mr. $2, or should I call you $1 ?"

# Input value:       "James Bond"
# Output value:      "Hello Mr. Bond, or should I call you James ?"

This transformation is equivalent to calling outputValue = inputValue.replace(new RegExp(pattern), output) in Javascript.

`replace`

transformations:
  - name: replace
  
    # If the input data is in JSON, you need to specify
    # the json path of the value you want to affect
    target: <json path>
    
    # The specific transformation options
    options:

      # The substring to search
      search: <search value>

      # The value to replace the substring with
      replaceWith: <replace with this value>


# Example
transformations:
  - name: replace
    options:
      search: "abc"
      replaceWith: "ABC"

# Input value:       "abcdef abcdef"
# Output value:      "ABCdef ABCdef"

This transformation is equivalent to calling outputValue = inputValue.replaceAll(search, replaceWith) in Javascript.

`textToJson`

Converts an text input value to JSON. This is equivalent to calling outputValue = JSON.parse(inputValue) in Javascript.

transformations:
  - name: textToJson
  
    # You may optionally specify a target if you want to
    # transform a nested string field into json
    target: <json path>

# Example
transformations:
  - name: textToJson

# Input value:       '{ "hello": "world" }'      (text string)
# Output value:       { "hello": "world" }       (json)

`restructure`

This transformation allows you to extract data from a JSON input value and restructure it, changing field names.

transformations:
  - name: restructure
    
    # You may optionally specify a target if you want to
    # transform a nested string field into json
    target: <json path>

    options:
      
      # The template represents the new object to create
      # by specifying the outputValue field names and the
      # json path to the value
      template:
        <field name>: <json path>

        # You can also build nested objects
        <field name>:
          <field name>: value

# Example
transformations:
  - name: restructure
    options:
      template:
        contactInfo:
          phone: $.phone
          email: $.email
        name:
          first: $.firstName
          last: $lastName


# Input value:
# {
#    "firstName": "James",
#    "lastName": "Bond",
#    "address": "123 A Street"
#    "phone": "111-111-1111"
#    "email": "[email protected]"
# }
#
# Output value:
# {
#     "contactInfo": {
#       "phone": "111-111-111"
#       "email": "[email protected]"
#     },
#     "name": {
#       "first": "James",
#       "last": "Bond"
#     }
# }
#

`count`

This transformation allows you count the number of element in an array or characters of a string.

transformations:
  - name: count
    
    # You may optionally specify a target if you want to
    # transform a nested string field into json
    target: <json path>

# Example
transformations:
  - name: count
    target: members

# Input value:
# {
#    "members": ["John", "Max", "David", "Helen", "Jo"]
# }
#
# Output value:
# {
#     "members": 5
# }
#

Other ideas of transforms

CsvToJson : Convert the given CSV to a JSON object
RenameField : Rename a field
SimpleEval : Resolves a math operation on one or multiple numeric fiels ([fieldName] + [fieldName])
Scrape : Scrape the value as HTML, output the json result
sed : Équivalent of sed command line tool

TODO

Capture exceptions during job (ex: if the website changes, some data may be null and some transform will throw exception)
Add support for intFields, booleanField (influxdb destination) ?
Implement json path to specify target
Support composing new values from multiple source fields using JsonPath expresions in restructure transform (ex: output: "${$.price} - ${$.model}")
Use brackets with json path for target in transformations for more consistency ?
Add a disable option for destinations to disable a specific destination
Add a disable option for outputs to disable a specific output (in jobs) ?

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
img		img
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
ConfigFileReference.yaml		ConfigFileReference.yaml
Dockerfile		Dockerfile
Readme.md		Readme.md
config.yaml		config.yaml
docker-compose.yaml		docker-compose.yaml
package.json		package.json
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebDataFetcher

Features

Anatomy of a scraping job

Destinations

Transformations

`regexReplace`

`replace`

`textToJson`

`restructure`

`count`

Other ideas of transforms

TODO

About

Releases

Packages

Languages

Totalus/WebDataFetcher

Folders and files

Latest commit

History

Repository files navigation

WebDataFetcher

Features

Anatomy of a scraping job

Destinations

Transformations

regexReplace

replace

textToJson

restructure

count

Other ideas of transforms

TODO

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`regexReplace`

`replace`

`textToJson`

`restructure`

`count`

Packages