Skip to content
/ ski Public

ski is a tool for extracting structured data from websites using extensible YAML syntax rules.

License

Notifications You must be signed in to change notification settings

shiroyk/ski

Repository files navigation

ski

GitHub go.mod Go version Go Report Card GitHub
ski is a tool written in Golang for extracting structured data.

Description

ski use YAML to define data-extracting Executors, which are executed sequentially like a pipeline.
Here's a simple example to extract the title and author of selected books from HTML document.

$gq.elements: .books .select
$each:
  $map:
    title:
      $gq: .title
    author:
      $gq: .author

output:

[{"title":"Book 1","author":"Author 1"},{"title":"Book 2","author":"Author 2"}]

Executors

Build in

fetch

$fetch fetches the resource from the network, default method is GET.

$fetch: https://example.com

kind

$kind converts the argument the specified type.

$raw: 123
$kind: int

list.of

$list.of returns a list of Executor result.

$list.of:
  - 123
  - 456

str.join

$str.join joins strings with specified separator.

$list.of:
  - 123
  - 456
$str.join: ~

str.split

$str.split splits string with specified separator.

$raw: 123~456
$str.split: ~

map

$map returns a map of Executor result. [k1, v1, k2, v2, ...]

$map:
  - 123
  - 456

each

$each loop the slice arg and execute the Executor.

$list.of:
  - 123
  - 456
$each:
  $kind: int

or

$or executes a slice of Executor. return result if the Executor result is not nil.

$or:
  - $raw:
  - 456

Control flow

filter the string contains "2" and convert to int, output: [123, 234]

$list.of:
  - 123
  - 234
  - 345
$each:
  $pipe:
    $if.contains: 2
    $kind: int

filter the string match "bar", output: {"bar": "some value"}

$list.of:
  - foo
  - bar
  - baz
$map:
  $if.contains: bar
  $raw: some value

Expression

  • gq: similar to jQuery expressions.
  • jq: JSONPath expressions.
  • js: JavaScript expressions.
  • regex: regular expressions.
  • xpath: XPath expressions.

gq

gq syntax consists of selectors and functions and is separated by ->.
$gq returns the match element text of the selector. return the first if node length is 1.

$gq: .books .title -> text

$gq.element returns the first element of the selector.

$gq.element: .books .select

$gq.elements returns all elements of the selector.

$gq.elements: .books

jq

$jq returns the value of the JSONPath expression.

$jq: $.books[0].author

js

$js returns the value of the JavaScript expression.

$js: export default (ctx) => ctx.get('content')

regex

available flags:

  • i Ignore case
  • m Multiple line
  • n Explicit capture
  • c Compiled
  • s Single line
  • x Ignore pattern whitespace
  • r Right to left
  • d Debug
  • e ECMAScript
  • u Unicode

$regex.replace /expr/replace/flags{start,count} replaces the pattern of the string.

$regex.replace: /[^\d]/

$regex.match /expr/flags{start,count} returns the match of the pattern of the string.

$regex.match: /\\//1

$regex.assert /expr/message/flags asserts the pattern of the string.

$regex.assert: /\d+/number not found/

xpath

$xpath returns the match element text of the XPath expression. return the first if node length is 1.

$xpath: div p

$xpath.element returns the first element of the XPath expression.

$xpath.element: div p

$xpath.elements returns all elements of the XPath expression.

$xpath.elements: div p

Usage

package main

import (
	"context"
	"fmt"

	"github.com/shiroyk/ski"
)

const content = `...`

const source = ``

func main() {
	executor, err := ski.Compile(source)
	if err != nil {
		panic(err)
	}

	result, err := executor.Exec(context.Background(), content)
	if err != nil {
		panic(err)
	}
	fmt.Println(result)
}

License

ski is distributed under the MIT license.

About

ski is a tool for extracting structured data from websites using extensible YAML syntax rules.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published