html-parse

html-parse is an efficient, reasonably robust HTML tokenizer based on the HTML5 tokenization specification. The parser is written using the fast attoparsec parsing library and can exposes both a native attoparsec Parser as well as convenience functions for lazily parsing token streams out of strict and lazy Text values.

For instance,

>>> parseTokens "<div><h1>Hello World</h1><br/><p class=widget>Example!</p></div>"
[TagOpen "div" [],TagOpen "h1" [],ContentText "Hello World",TagClose "h1",TagSelfClose "br" [],TagOpen "p" [Attr "class" "widget"],ContentText "Example!",TagClose "p",TagClose "div"]

Performance

Here are some typical performance numbers taken from parsing a fairly long Wikipedia article,

benchmarking Forced/tagsoup fast Text
time                 171.2 ms   (166.4 ms .. 177.3 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 171.9 ms   (169.4 ms .. 173.2 ms)
std dev              2.516 ms   (1.104 ms .. 3.558 ms)
variance introduced by outliers: 12% (moderately inflated)

benchmarking Forced/tagsoup normal Text
time                 176.9 ms   (167.3 ms .. 188.5 ms)
                     0.998 R²   (0.994 R² .. 1.000 R²)
mean                 180.7 ms   (177.5 ms .. 183.7 ms)
std dev              4.246 ms   (2.316 ms .. 5.803 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking Forced/html-parser
time                 20.88 ms   (20.60 ms .. 21.25 ms)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 20.99 ms   (20.81 ms .. 21.20 ms)
std dev              446.1 μs   (336.4 μs .. 596.2 μs)

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
Benchmark.hs		Benchmark.hs
LICENSE		LICENSE
Microbench.hs		Microbench.hs
README.mkd		README.mkd
Setup.hs		Setup.hs
changelog.md		changelog.md
gen_entities.py		gen_entities.py
html-parse.cabal		html-parse.cabal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

html-parse

Performance

About

Releases

Packages

Languages

License

typeable/html-parse

Folders and files

Latest commit

History

Repository files navigation

html-parse

Performance

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages