Skip to content

Representing character classes

Hayley Patton edited this page Mar 25, 2023 · 4 revisions

isums

isums (integer sums, in DFA-construction/sets.lisp) are of the forms $[s_1, e_1) \cup \ldots \cup [s_n, e_n)$ or $\Sigma \setminus ([s_1, e_1) \cup \ldots \cup [s_n, e_n))$ where all $s$ and $e$ are character codes. Both allow having no ranges; the former produces the empty set $\emptyset$, the latter produces the universal set $\Sigma$.

Distilling POSIX character classes

Here are the POSIX character classes, and some definitions:

alnum: alpha ∪ digit
alpha: alpha
blank: {Space, Tab} *
cntrl: [0, 32) ∪ {127}
digit: digit
graph: Sigma \ (cntrl ∪ {Space}) *
lower: lower
print: graph ∪ {Space} *
punct: graph \ (alpha ∪ digit) * **
space: {Space, Return, Newline, Tab} *
upper: upper
xdigit: digit ∪ [A-Fa-f]

So the classes I can't represent as nice isums are alpha, digit, lower, and upper.

* These come from https://github.com/micromatch/posix-character-classes which is obviously not normative of POSIX.
** Is 💩 punctuation?

csums

csums (character sums) are of the form $(c_1 \cap [0, e_1)) \cup \ldots \cup (c_n \cap [s_n, \text{char-code-limit}))$, with all $c$ being elements of $\mathbb{P}(\mathbb{P}(\text{classes}))$, i.e. sets of accepted character classes. There are $2^{2^4} = 65536$ such sets, which will most likely fit in a fixnum.

There are no explicit complements like with isums, but twiddling the class set allows for complements. For example, $\{\} \cap [0, \text{char-code-limit})$ is the empty set, and $\mathbb{P}(\{\text{a},\text{d},\text{l},\text{u}\}) \cap [0, \text{char-code-limit})$ the universal set. All set operations only require doing the related bit operations on class sets.

To interpret the class part of a csum, a vector mapping character codes to their bit-set of classes may be used, and then (logbitp bit-set c) may be used to test membership in $c$. The code generator should not generate a test for empty or universal sets of classes.

Clone this wiki locally