Regular expression engine sizes

In my opinion, one-more-re-nightmare is a pretty small library. First, I should admit it does not do everything that most "big" regular expression libraries do: we don't really get word or line boundaries at the very least.

But we have a few tricks for reducing code size. Generating fast code relies on the Lisp compiler to optimize well, freeing us of doing any post-processing on the generated DFA. Rewrite rules for regular expressions are written using pattern matching, and used to create simplifying constructors for expressions behind the scenes. We use a (subjectively) simpler derivative technique instead of generating a non-deterministic machine and then pulling out a deterministic machine.

What do these tricks get us? Here is a list of some regular expression engines I am aware of.

one-more-re-nightmare (1.5×10³ LOC, 1.0× size) You're looking right at it.
CLEX2, Gilbert Baumann's lexer (3.2×10³ LOC, 2.05× size) Currently unpublished, but was the first implementation of the tagged-DFA approach we use. Includes RE vectors for lexing as well as boundaries, its own dataflow analysis, clever handling of character sets, and uses the Lisp compiler as a backend.
Redgrep (3.7×10³ LOC, 2.44× size) Uses plain Brzozowski derivatives and LLVM as a backend.
CL-PPCRE (5.8×10³ LOC, 3.72× size) A Perl regex compiler (it's in the name). Compiles a NFA to a chain of Lisp closures, which used to be faster than a VM, but is beaten by "real" compilation.
RE2 (30×10³ LOC, 19.2× size) Uses bytecode for either a deterministic or non-deterministic machine (usually the former, but can fall back to the latter). Designed to handle untrusted input, so algorithms look wonky after being relieved of using recursive techniques.
rust-lang/regex (65×10³ LOC, 41.7× size) Picks between some virtual machines, including a DFA where the regular expression is simple enough. Can compile constant regular expressions to Rust code at compile-time, but otherwise interprets. Uses a NFA when submatches are involved.
PCRE2 (70×10³ LOC, 45.0× size) Has its own JIT compiler with several code generators (sljit). Can do DFAs, but obviously only for a subset of Perl "regular" expressions.
Hyperscan (198×10³ LOC, 127× size) Does some trickery with SIMD apparently, and can take large numbers of REs to match simultaneously, but it's not obvious or documented anywhere (that I can find it) how it runs REs. Oh well. It can't do submatches at the very least. If you inspect it on a non-Intel processor it looks 3× larger :)

I'm pretty happy with the size. If we add in more tricks, it will obviously get larger, but then we have more tricks too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regular expression engine sizes

Clone this wiki locally