Selector Schema codegen

Introduction

ssc-gen - based-python DSL language for writing html parsers in dataclass style for converting to targeting language.

Project solving next problems:

boilerplate code
create types (type annotations) and documentation
simplify code support
portability to other languages

Support converters

Current support converters

Language	Library (html parser backend)	XPath Support	CSS Support	Generated types	Code formatter
Python (3.8+)	bs4	N	Y	TypedDict*, list, dict	ruff
...	parsel	Y	Y	...	-
...	selectolax (modest)	N	Y	...	-
...	scrapy (possibly use parsel - pass Response.selector object)	Y	Y	...	-
Dart (3)	universal_html	N	Y	record, List, Map	dart format
js (ES6)	pure (firefox/chrome)	Y	Y	Array, Map**	-
go (1.10+)	goquery	N	Y	struct(json anchors include), array, map	gofmt

*this annotation type was deliberately chosen as a compromise reasons. Python has many ways of serialization: dataclass, namedtuple, attrs, pydantic
- TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.
**js not exists build-in serialization methods

Limitations

For maximum portability of the configuration to the target language:

Use CSS selectors: they are guaranteed to be converted to XPATH
Unlike javascript, most html parse libs implement CSS3 selectors standard
- basic selectors: (tag, .class, #id)
- combined: (div p, ul > li, h2 +p[1])
- attribute: (a[href], input[type='text'])[2]
- pseudo classes: (:nth-child(n), :first-child, :last-child)[3]
- often, not support more complex, dynamic styles: (:has(), :nth-of-type(), :where(), :is())

Several libs not support + operations (eg: selectolax(modest), dart.universal_html)
Often, web scraping libs not supports attribute operations like *=, ~=, |=, ^= and $=
Several libs not support pseudo classes (eg: standard dart.html lib miss this feature). This project will not implement converters with such a cons

Getting started

ssc_gen required python 3.10 version or higher

Install

pip:

pip install ssc_codegen

uv:

uv pip install ssc_codegen

as cli converter tool:

package manager	command
pipx	`pipx install ssc_codegen`
uv	`uv tool install ssc_codegen`

Example

Create a file `schema.py` with:

from ssc_codegen import ItemSchema, D

class HelloWorld(ItemSchema):
    title = D().css('title').text()
    a_hrefs = D().css_all('a').attr('href')

try it in cli

Note

this tools developed for testing purposes, not for web-scraping

from file

Warning

DO NOT PASS CONFIGS FROM UNKNOWN SOURCES:

PYTHON CODE FROM CONFIGS COMPILE IN RUNTIME WOUT SECURITY CHECKS!!!

Download any html file and pass as argument:

ssc-gen parse-from-file index.html -t schema.py:HelloWorld

Short options descriptions:

-t --target - config schema file and class from where to start the parser

from url

ssc-gen parse-from-url https://example.com -t schema.py:HelloWorld

from Chromium browser (CDP protocol)

ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld

Note

if script cannot found chrome executable - provide it manually:

ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium

Convert to code

Convert to code for use in projects:

![note] for example, used js: it can be fast test in developer console

ssc-gen js schema.py -o .

Code output looks like this (code formatted by IDE):

// autogenerated by ssc-gen DO NOT_EDIT
/**
 *
 *
 * {
 *     "title": "String",
 *     "a_hrefs": "Array<String>"
 * }
 */
class HelloWorld {
    constructor(doc) {
        if (typeof doc === 'string') {
            this._doc = new DOMParser().parseFromString(doc, 'text/html');
        } else if (doc instanceof Document || doc instanceof Element) {
            this._doc = doc;
        } else {
            throw new Error("Invalid input: Expected a Document, Element, or string");
        }
    }

    _parseTitle(value) {
        let value1 = value.querySelector('title');
        return typeof value1.textContent === "undefined" ? value1.documentElement.textContent : value1.textContent;
    }

    _parseAHrefs(value) {
        let value1 = Array.from(value.querySelectorAll('a'));
        return value1.map(e => e.getAttribute('href'));
    }

    parse() {
        return {title: this._parseTitle(this._doc), a_hrefs: this._parseAHrefs(this._doc)};
    }
}

copy code output and past to developer console:

Print output:

alert(JSON.stringify((new HelloWorld(document).parse())))

You can use any html source:

read from html file
get from http request
get from browser (playwright, selenium, chrome-cdp)
paste code to developer console (js)
or call curl in shell and parse stdin

Name		Name	Last commit message	Last commit date
Latest commit History 553 Commits
docs		docs
examples		examples
scripts		scripts
ssc_codegen		ssc_codegen
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Selector Schema codegen

Introduction

Support converters

Limitations

Getting started

Install

Example

Create a file `schema.py` with:

try it in cli

from file

from url

from Chromium browser (CDP protocol)

Convert to code

copy code output and past to developer console:

See also

About

Languages

License

vypivshiy/selector_schema_codegen

Folders and files

Latest commit

History

Repository files navigation

Selector Schema codegen

Introduction

Support converters

Limitations

Getting started

Install

Example

Create a file schema.py with:

try it in cli

from file

from url

from Chromium browser (CDP protocol)

Convert to code

copy code output and past to developer console:

See also

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Create a file `schema.py` with: