ssc-gen - based-python DSL language for writing html parsers in dataclass style for converting to targeting language.
Project solving next problems:
- boilerplate code
- create types (type annotations) and documentation
- simplify code support
- portability to other languages
Current support converters
Language | Library (html parser backend) | XPath Support | CSS Support | Generated types | Code formatter |
---|---|---|---|---|---|
Python (3.8+) | bs4 | N | Y | TypedDict*, list, dict | ruff |
... | parsel | Y | Y | ... | - |
... | selectolax (modest) | N | Y | ... | - |
... | scrapy (possibly use parsel - pass Response.selector object) | Y | Y | ... | - |
Dart (3) | universal_html | N | Y | record, List, Map | dart format |
js (ES6) | pure (firefox/chrome) | Y | Y | Array, Map** | - |
go (1.10+) | goquery | N | Y | struct(json anchors include), array, map | gofmt |
- *this annotation type was deliberately chosen as a compromise reasons.
Python has many ways of serialization:
dataclass, namedtuple, attrs, pydantic
- TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.
- **js not exists build-in serialization methods
For maximum portability of the configuration to the target language:
- Use CSS selectors: they are guaranteed to be converted to XPATH
- Unlike javascript, most html parse libs implement CSS3 selectors standard
- basic selectors: (
tag
,.class
,#id
) - combined: (
div p
,ul > li
,h2 +p
[1]) - attribute: (
a[href]
,input[type='text']
)[2] - pseudo classes: (
:nth-child(n)
,:first-child
,:last-child
)[3] - often, not support more complex, dynamic styles: (
:has()
,:nth-of-type()
,:where()
,:is()
)
- basic selectors: (
- Several libs not support
+
operations (eg: selectolax(modest), dart.universal_html) - Often, web scraping libs not supports attribute operations like
*=
,~=
,|=
,^=
and$=
- Several libs not support pseudo classes (eg: standard dart.html lib miss this feature). This project will not implement converters with such a cons
ssc_gen required python 3.10 version or higher
pip:
pip install ssc_codegen
uv:
uv pip install ssc_codegen
as cli converter tool:
package manager | command |
---|---|
pipx | pipx install ssc_codegen |
uv | uv tool install ssc_codegen |
from ssc_codegen import ItemSchema, D
class HelloWorld(ItemSchema):
title = D().css('title').text()
a_hrefs = D().css_all('a').attr('href')
Note
this tools developed for testing purposes, not for web-scraping
Warning
DO NOT PASS CONFIGS FROM UNKNOWN SOURCES:
PYTHON CODE FROM CONFIGS COMPILE IN RUNTIME WOUT SECURITY CHECKS!!!
Download any html file and pass as argument:
ssc-gen parse-from-file index.html -t schema.py:HelloWorld
Short options descriptions:
-t --target
- config schema file and class from where to start the parser
ssc-gen parse-from-url https://example.com -t schema.py:HelloWorld
ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld
Note
if script cannot found chrome executable - provide it manually:
ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium
Convert to code for use in projects:
![note] for example, used js: it can be fast test in developer console
ssc-gen js schema.py -o .
Code output looks like this (code formatted by IDE):
// autogenerated by ssc-gen DO NOT_EDIT
/**
*
*
* {
* "title": "String",
* "a_hrefs": "Array<String>"
* }
*/
class HelloWorld {
constructor(doc) {
if (typeof doc === 'string') {
this._doc = new DOMParser().parseFromString(doc, 'text/html');
} else if (doc instanceof Document || doc instanceof Element) {
this._doc = doc;
} else {
throw new Error("Invalid input: Expected a Document, Element, or string");
}
}
_parseTitle(value) {
let value1 = value.querySelector('title');
return typeof value1.textContent === "undefined" ? value1.documentElement.textContent : value1.textContent;
}
_parseAHrefs(value) {
let value1 = Array.from(value.querySelectorAll('a'));
return value1.map(e => e.getAttribute('href'));
}
parse() {
return {title: this._parseTitle(this._doc), a_hrefs: this._parseAHrefs(this._doc)};
}
}
Print output:
alert(JSON.stringify((new HelloWorld(document).parse())))
You can use any html source:
- read from html file
- get from http request
- get from browser (playwright, selenium, chrome-cdp)
- paste code to developer console (js)
- or call curl in shell and parse stdin