html-parse
is an efficient, reasonably robust HTML tokenizer based on the
HTML5 tokenization
specification.
The parser is written using the fast attoparsec
parsing library and can
exposes both a native attoparsec
Parser
as well as convenience functions for
lazily parsing token streams out of strict and lazy Text
values.
For instance,
>>> parseTokens "<div><h1>Hello World</h1><br/><p class=widget>Example!</p></div>"
[TagOpen "div" [],TagOpen "h1" [],ContentText "Hello World",TagClose "h1",TagSelfClose "br" [],TagOpen "p" [Attr "class" "widget"],ContentText "Example!",TagClose "p",TagClose "div"]
Here are some typical performance numbers taken from parsing a fairly long Wikipedia article,
benchmarking Forced/tagsoup fast Text
time 171.2 ms (166.4 ms .. 177.3 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 171.9 ms (169.4 ms .. 173.2 ms)
std dev 2.516 ms (1.104 ms .. 3.558 ms)
variance introduced by outliers: 12% (moderately inflated)
benchmarking Forced/tagsoup normal Text
time 176.9 ms (167.3 ms .. 188.5 ms)
0.998 R² (0.994 R² .. 1.000 R²)
mean 180.7 ms (177.5 ms .. 183.7 ms)
std dev 4.246 ms (2.316 ms .. 5.803 ms)
variance introduced by outliers: 14% (moderately inflated)
benchmarking Forced/html-parser
time 20.88 ms (20.60 ms .. 21.25 ms)
0.999 R² (0.998 R² .. 0.999 R²)
mean 20.99 ms (20.81 ms .. 21.20 ms)
std dev 446.1 μs (336.4 μs .. 596.2 μs)