-
Notifications
You must be signed in to change notification settings - Fork 9
Regular expression engine sizes
In my opinion, one-more-re-nightmare is a pretty small library. First, I should admit it does not do everything that most "big" regular expression libraries do: we don't really get word or line boundaries at the very least.
But we have a few tricks for reducing code size. Generating fast code relies on the Lisp compiler to optimize well, freeing us of doing any post-processing on the generated DFA. Rewrite rules for regular expressions are written using pattern matching, and used to create simplifying constructors for expressions behind the scenes. We use a (subjectively) simpler derivative technique instead of generating a non-deterministic machine and then pulling out a deterministic machine.
What do these tricks get us? Here is a list of some regular expression engines I am aware of.
- one-more-re-nightmare (1.5×10³ LOC, 1.0× size) You're looking right at it.
- CLEX2, Gilbert Baumann's lexer (3.2×10³ LOC, 2.05× size) Currently unpublished, but was the first implementation of the tagged-DFA approach we use. Includes RE vectors for lexing as well as boundaries, its own dataflow analysis, clever handling of character sets, and uses the Lisp compiler as a backend.
- Redgrep (3.7×10³ LOC, 2.44× size) Uses plain Brzozowski derivatives and LLVM as a backend.
- CL-PPCRE (5.8×10³ LOC, 3.72× size) A Perl regex compiler (it's in the name). Compiles a NFA to a chain of Lisp closures, which used to be faster than a VM, but is beaten by "real" compilation.
- RE2 (30×10³ LOC, 19.2× size) Uses bytecode for either a deterministic or non-deterministic machine (usually the former, but can fall back to the latter). Designed to handle untrusted input, so algorithms look wonky after being relieved of using recursive techniques.
- rust-lang/regex (65×10³ LOC, 41.7× size) Picks between some virtual machines, including a DFA where the regular expression is simple enough. Can compile constant regular expressions to Rust code at compile-time, but otherwise interprets. Uses a NFA when submatches are involved.
- PCRE2 (70×10³ LOC, 45.0× size) Has its own JIT compiler with several code generators (sljit). Can do DFAs, but obviously only for a subset of Perl "regular" expressions.
- Hyperscan (198×10³ LOC, 127× size) Does some trickery with SIMD apparently, and can take large numbers of REs to match simultaneously, but it's not obvious or documented anywhere (that I can find it) how it runs REs. Oh well. It can't do submatches at the very least. If you inspect it on a non-Intel processor it looks 3× larger :)
I'm pretty happy with the size. If we add in more tricks, it will obviously get larger, but then we have more tricks too.