Skip to content

Commit

Permalink
Support absent repeaters (#13)
Browse files Browse the repository at this point in the history
  • Loading branch information
slevithan committed Jan 21, 2025
1 parent bbeb2ce commit e3bf3f0
Show file tree
Hide file tree
Showing 7 changed files with 70 additions and 22 deletions.
21 changes: 15 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -908,7 +908,7 @@ Notice that nearly every feature below has at least subtle differences from Java
</tr>

<tr valign="top">
<th align="left" rowspan="5">Other</th>
<th align="left" rowspan="6">Other</th>
<td>Comment group</td>
<td><code>(?#…)</code></td>
<td align="middle">✅</td>
Expand All @@ -928,6 +928,15 @@ Notice that nearly every feature below has at least subtle differences from Java
✔ Same as JS<br>
</td>
</tr>
<tr valign="top">
<td>Absent repeater</td>
<td><code>(?~…)</code></td>
<td align="middle">✅</td>
<td align="middle">✅</td>
<td>
✔ Supported<sup>[6]</sup><br>
</td>
</tr>
<tr valign="top">
<td>Keep</td>
<td><code>\K</code></td>
Expand Down Expand Up @@ -985,6 +994,7 @@ The table above doesn't include all aspects that Oniguruma-To-ES emulates (inclu
3. Target `ES2018` doesn't support nested *negated* character classes.
4. It's not an error for *numbered* backreferences to come before their referenced group in Oniguruma, but an error is the best path for Oniguruma-To-ES because ① most placements are mistakes and can never match (based on the Oniguruma behavior for backreferences to nonparticipating groups), ② erroring matches the behavior of named backreferences, and ③ the edge cases where they're matchable rely on rules for backreference resetting within quantified groups that are different in JavaScript and aren't emulatable. Note that it's not a backreference in the first place if using `\10` or higher and not as many capturing groups are defined to the left (it's an octal or identity escape).
5. Oniguruma's recursion depth limit is `20`. Oniguruma-To-ES uses the same limit by default but allows customizing it via the `rules.recursionLimit` option. Two rare uses of recursion aren't yet supported: overlapping recursions, and use of backreferences when a recursed subpattern contains captures. Patterns that would trigger an infinite recursion error in Oniguruma might find a match in Oniguruma-To-ES (since recursion is bounded), but future versions will detect this and error at transpilation time.
6. Exotic (and extremely rare) forms of absent functions that start with `(?~|` (absent expressions, stoppers, and clearers) aren't yet supported.

## ❌ Unsupported features

Expand All @@ -996,19 +1006,18 @@ The following throw errors since they aren't yet supported. They're all extremel
- Grapheme boundaries: `\y`, `\Y`.
- Flags `P` (POSIX is ASCII) and `y{g}`/`y{w}` (grapheme boundary modes).
- Whole-pattern modifier: Don't capture group `(?C)`.
- Callout: `(*FAIL)`.
- Named callout: `(*FAIL)`.
- Supportable for some uses:
- Absence functions: `(?~…)`, etc.
- Conditionals: `(?(…)…)`, etc.
- Whole-pattern modifiers: Ignore-case is ASCII `(?I)`, find longest `(?L)`.
- Callout pair: `(*SKIP)(*FAIL)`.
- Named callout pair: `(*SKIP)(*FAIL)`.
- Not supportable:
- Other callouts: `(?{…})`, `(*…)`, etc.

Note that Oniguruma-To-ES supports 99.9+% of real-world Oniguruma regexes, based on a sample of tens of thousands of regexes used in TextMate grammars. Of the features listed above, absence functions and conditionals were used in 2–3 regexes each. The rest weren't used at all.

See also the [supported features](#-supported-features) table (above) which describes some additional rarely-used sub-features that aren't currently supported.

Note that Oniguruma-To-ES supports 99.9+% of real-world Oniguruma regexes, based on a sample of tens of thousands of regexes used in TextMate grammars. Of the features listed above, conditionals were used in three regexes. The rest weren't used at all. Some Oniguruma features are so exotic that they're *used* zero times in all of public GitHub.

Contributions are welcome if you want to add support for currently unsupported features.

<a name="unicode"></a>
Expand Down
4 changes: 2 additions & 2 deletions src/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,8 @@ function toDetails(pattern, options) {
const strategy = regexAst._strategy;
if (useEmulationGroups || strategy) {
result.options = {
...(strategy ? {strategy} : null),
...(useEmulationGroups ? {useEmulationGroups} : null),
...(strategy && {strategy}),
...(useEmulationGroups && {useEmulationGroups}),
};
}
return result;
Expand Down
38 changes: 33 additions & 5 deletions src/parse.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import {getOrCreate, r, throwIfNot} from './utils.js';
import {hasOnlyChild} from './utils-ast.js';

const AstTypes = {
AbsentFunction: 'AbsentFunction',
Alternative: 'Alternative',
Assertion: 'Assertion',
Backreference: 'Backreference',
Expand All @@ -26,6 +27,11 @@ const AstTypes = {
Recursion: 'Recursion',
};

const AstAbsentFunctionKinds = {
// See <github.com/slevithan/oniguruma-to-es/issues/13>
repeater: 'repeater',
};

const AstAssertionKinds = {
line_end: 'line_end',
line_start: 'line_start',
Expand Down Expand Up @@ -320,24 +326,30 @@ function parseGroupOpen(context, state) {
getOrCreate(namedGroupsByName, node.name, []).push(node);
}
}
if (node.type === AstTypes.AbsentFunction && state.isInAbsentFunction) {
// Doesn't throw in Onig but produces weird results and is described as unsupported in docs
throw new Error('Nested absent function not supported by Oniguruma');
}
let nextToken = throwIfUnclosedGroup(tokens[context.current]);
while (nextToken.type !== TokenTypes.GroupClose) {
if (nextToken.type === TokenTypes.Alternator) {
node.alternatives.push(createAlternative());
// Skip the alternator
context.current++;
} else {
const alt = node.alternatives.at(-1);
const isAbsentFunction = node.type === AstTypes.AbsentFunction;
const isLookbehind = node.kind === AstAssertionKinds.lookbehind;
const isNegLookbehind = isLookbehind && node.negate;
const alt = node.alternatives.at(-1);
const child = walk(alt, {
...state,
isInAbsentFunction: state.isInAbsentFunction || isAbsentFunction,
isInLookbehind: state.isInLookbehind || isLookbehind,
isInNegLookbehind: state.isInNegLookbehind || isNegLookbehind,
});
alt.elements.push(child);
// Centralized validation of lookbehind contents
if ((isLookbehind || state.isInLookbehind) && !skipLookbehindValidation) {
// JS supports all features within lookbehind, but Onig doesn't. Absence functions of form
// JS supports all features within lookbehind, but Onig doesn't. Absent functions of form
// `(?~|)` and `(?~|…)` are also invalid in lookbehind (the `(?~…)` and `(?~|…|…)` forms
// are allowed), but all forms with `(?~|` throw since they aren't yet supported
const msg = 'Lookbehind includes a pattern not allowed by Oniguruma';
Expand All @@ -355,6 +367,7 @@ function parseGroupOpen(context, state) {
}
}
}
alt.elements.push(child);
}
nextToken = throwIfUnclosedGroup(tokens[context.current]);
}
Expand Down Expand Up @@ -434,6 +447,17 @@ function parseSubroutine(context) {
return node;
}

function createAbsentFunction(kind) {
if (kind !== AstAbsentFunctionKinds.repeater) {
throw new Error(`Unexpected absent function kind "${kind}"`);
}
return {
type: AstTypes.AbsentFunction,
kind,
alternatives: [createAlternative()],
};
}

function createAlternative() {
return {
type: AstTypes.Alternative,
Expand All @@ -447,7 +471,7 @@ function createAssertion(kind, options) {
return {
type: AstTypes.Assertion,
kind,
...(kind === AstAssertionKinds.word_boundary ? {negate} : null),
...(kind === AstAssertionKinds.word_boundary && {negate}),
};
}

Expand Down Expand Up @@ -478,6 +502,8 @@ function createBackreference(ref, options) {

function createByGroupKind({flags, kind, name, negate, number}) {
switch (kind) {
case TokenGroupKinds.absent_repeater:
return createAbsentFunction(AstAbsentFunctionKinds.repeater);
case TokenGroupKinds.atomic:
return createGroup({atomic: true});
case TokenGroupKinds.capturing:
Expand Down Expand Up @@ -634,7 +660,7 @@ function createPattern() {
};
}

function createQuantifier(element, min, max, greedy, possessive) {
function createQuantifier(element, min, max, greedy = true, possessive = false) {
const node = {
type: AstTypes.Quantifier,
min,
Expand Down Expand Up @@ -766,11 +792,13 @@ function throwIfUnclosedGroup(token) {
}

export {
AstAbsentFunctionKinds,
AstAssertionKinds,
AstCharacterSetKinds,
AstDirectiveKinds,
AstTypes,
AstVariableLengthCharacterSetKinds,
createAbsentFunction,
createAlternative,
createAssertion,
createBackreference,
Expand Down
4 changes: 2 additions & 2 deletions src/subclass.js
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,8 @@ class EmulatedRegExp extends RegExpSubclass {
pattern,
flags: flags ?? '',
options: {
...(opts.strategy ? {strategy: opts.strategy} : null),
...(opts.useEmulationGroups ? {useEmulationGroups: true} : null),
...(opts.strategy && {strategy: opts.strategy}),
...(opts.useEmulationGroups && {useEmulationGroups: true}),
},
};
}
Expand Down
8 changes: 4 additions & 4 deletions src/tokenize.js
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ const TokenDirectiveKinds = {
};

const TokenGroupKinds = {
absence: 'absence',
absent_repeater: 'absent_repeater',
atomic: 'atomic',
capturing: 'capturing',
group: 'group',
Expand Down Expand Up @@ -358,15 +358,15 @@ function getTokenWithDetails(context, pattern, m, lastIndex) {
}
return {
token,
}
};
}
if (m2 === '~') {
if (m === '(?~|') {
throw new Error(`Unsupported absence function type "${m}"`);
throw new Error(`Unsupported absent function kind "${m}"`);
}
return {
token: createToken(TokenTypes.GroupOpen, m, {
kind: TokenGroupKinds.absence,
kind: TokenGroupKinds.absent_repeater,
}),
};
}
Expand Down
16 changes: 13 additions & 3 deletions src/transform.js
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import {Accuracy, Target} from './options.js';
import {AstAssertionKinds, AstCharacterSetKinds, AstDirectiveKinds, AstTypes, AstVariableLengthCharacterSetKinds, createAlternative, createAssertion, createBackreference, createCapturingGroup, createCharacterSet, createGroup, createLookaround, createUnicodeProperty, parse} from './parse.js';
import {AstAssertionKinds, AstCharacterSetKinds, AstDirectiveKinds, AstTypes, AstVariableLengthCharacterSetKinds, createAlternative, createAssertion, createBackreference, createCapturingGroup, createCharacterSet, createGroup, createLookaround, createQuantifier, createUnicodeProperty, parse} from './parse.js';
import {tokenize} from './tokenize.js';
import {traverse} from './traverse.js';
import {JsUnicodeProperties, PosixClassesMap} from './unicode.js';
Expand Down Expand Up @@ -100,6 +100,17 @@ function transform(ast, options) {
}

const FirstPassVisitor = {
AbsentFunction({node, replaceWith}) {
// Convert absent repeater `(?~…)` to `(?:(?:(?!…)\p{Any})*)`
const group = prepContainer(createGroup(), [
adoptAndSwapKids(createLookaround({negate: true}), node.alternatives),
createUnicodeProperty('Any'),
]);
const quantifier = createQuantifier(group, 0, Infinity);
group.parent = quantifier;
replaceWith(prepContainer(createGroup(), [quantifier]));
},

Alternative: {
enter({node, parent, key}, {flagDirectivesByAlt}) {
// Look for own-level flag directives when entering an alternative because after traversing
Expand Down Expand Up @@ -587,7 +598,7 @@ const ThirdPassVisitor = {
if (!participants.length) {
// If no participating capture, convert backref to to `(?!)`; backrefs to nonparticipating
// groups can't match in Onig but match the empty string in JS
replaceWith(createLookaround({negate: true}));
replaceWith(prepContainer(createLookaround({negate: true})));
} else if (participants.length > 1) {
// Multiplex
const alts = participants.map(reffed => adoptAndSwapKids(
Expand Down Expand Up @@ -910,6 +921,5 @@ function traverseReplacement(replacement, {parent, key, container}, state, visit
}

export {
adoptAndSwapKids,
transform,
};
1 change: 1 addition & 0 deletions src/traverse.js
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ function traverse(path, state, visitor) {
case AstTypes.Subroutine:
case AstTypes.VariableLengthCharacterSet:
break;
case AstTypes.AbsentFunction:
case AstTypes.CapturingGroup:
case AstTypes.Group:
case AstTypes.Pattern:
Expand Down

0 comments on commit e3bf3f0

Please sign in to comment.