-
Notifications
You must be signed in to change notification settings - Fork 2
Reference Manual
Name | Output | Output when (unicode) |
---|---|---|
alpha |
[a-zA-Z] |
\p{Alphabetic} |
upper |
[A-Z] |
\p{Uppercase} |
lower |
[a-z] |
\p{Lowercase} |
alnum |
[a-zA-Z0-9] |
\p{Alphanumeric} |
linechar |
[\r\n\x0B\x0C] |
[\r\n\x0B\x0C\x85\u2028\u2029] |
| `\n` when `(-word)`| `\n` when `(-word)`
padchar
| [ \t]
|
space
| [ ]
|
tab
| \t
|
digit
| \d
|
whitechar
| \s
|
wordchar
| \w
|
backslash
| \\
|
-
WOB
is special: it is not a character-class, but you can apply thenon-
operator to it.non-WOB
compiles to\B
. -
wordchar
is special: it can be redefined (other built-ins cannot be redefined). See: Redefining Wordchar. - Oprex-equivalent of regex's
.
with DOTALL turned OFF isnon-linechar
(seelinechar
in Built-in Character Classes table).
The general syntax of oprex is in the form of:
>>> oprex('''
... (flags) -- comments
... expression
... def_x = expression
... ''')
Syntax:
-
(flagname)
to turn it on. -
(-flagname)
to turn it off. -
(flag1 flag2 -flag3)
for multiple flags. -
(flag) expression
to apply only to thatexpression
(inline/scoped flag). -
(flag)
at the very beginning (before the main expression, on its own line) to apply globally.
- Can only be applied globally.
- Can only be turned on, never off.
- Supported global flags are
ascii
,bestmatch
,enhancedmatch
,locale
,reverse
,unicode
,version0
, andversion1
.
- Can be applied both globally and inline/scoped.
- Can be turned on and off
- Supported scoped flags are
dotall
,fullcase
,ignorecase
,multiline
,verbose
, andword
.
(unicode ignorecase) -- multiple global flags, scoped flag can be applied globally but not vice versa
/password/
password = (-ignorecase) 'correctHorseBatteryStaple' -- scoped flag, turn-off
Compiles to (?V1wui)(?-i:correctHorseBatteryStaple)
-
version1
andword
are turned on by default (hence theV1w
in the example's output). - For what each flag does, refer to the regex module documentation.
Comment starts with --
to the end of the line. Block comment is not supported.
Expression can be one of the following:
- String literal
- String-of-digits range literal
- A variable name
- Negation
- Lookup chain
- Alternation block
- Lookaround block
- Quantification/repetition
Syntax:
-
'single quoted'
or"double quoted"
- Quotes can be escaped, e.g.
'Mother\'s Day'
- The output will be properly regex-escaped, e.g.
"A+"
compiles toA\\+
- Backslash-escapes that are not regex-escape will NOT be escaped, e.g.
"\d\t"
compiles to\\d\t
(the\d
got escaped but the\t
didn't)
Word boundary (oprex's WOB
, regex's \b
) and non-boundary (oprex's non-WOB
, regex's \B
) can be easily appended and/or prepended to a string literal using the following syntax:
-
.
for word boundary, e.g.'cat'.
compiles tocat\b
,.'cat'
compiles to\bcat
. -
_
for non-boundary, e.g.'cat'_
compiles tocat\B
,_'cat'_
compiles to\Bcat\B
.
Since a word boundary means a position where one side is wordchar while the other side is non-wordchar, if wordchar is redefined word boundaries change too.
- Syntax:
"min".."max"
or'min'..'max'
. -
min
can have leading0
(to require) oro
(to allow) leading zero(es). Examples:
- Syntax:
"min"..
or'min'..
. - Optional-leading-zero can be used with maxless range, e.g.
'o1'..
. - Mandatory-leading-zero does NOT work with maxless range, e.g.
'01'..
will raise an error.
Syntax:
-
varname = expression
to define a subexpression -
varname: c h a r s
to define a character-class -
[varname]
to define a named capture group, e.g.[varname] = expression
or[varname]: c h a r s
The following example demonstrates variable scoping in oprex:
/first/second/third/last/ -- can access direct children
first = '1st'
second = '2nd'
third = /first/x/ -- can access older siblings
x = /second/a/b/ -- can access parent's older-siblings
a = /third?/x?/ -- can access parent, grandparent, great-grandparent, and so on (recursion)
b = /B1/B2/b?/ -- can access self (recursion)
B1 = first -- can access great-grandparent's (and so on's) older siblings
*) B2 = second
-- B1 is no longer defined beyond this point
-- a and b are no longer defined beyond this point
-- x is no longer defined beyond this point
last = x -- last's x is different from third's x
x = B2 -- B2 is global, so it's accessible here (normally it isn't)
A variable can be made global by prefixing its definition with *)
. See B2
in the example above. Once defined, global variables are available in all following scopes.
- Can't access grandchildren (and great-grandchildren, and so on).
/x/y/ -- can't access grandchildren
x = y
y = 'yadda'
- Can't access younger siblings.
/x/y/
x = y -- can't access younger siblings
y = 'yadda'
- Can't access sibling's children.
/x/yadda/
x = y
y = 'yadda'
yadda = y -- can't access sibling's children
- All variable must be referenced by its immediate parent.
/x/y/
x = 'pow'
y = 'wow'
z = 'how' -- ERROR: defined but not used
/x/y/
pow = 'pow' -- ERROR: not referenced by parent
x = pow
y = pow
/x/y/
x = pow
pow = 'pow'
*) wow = 'wow' -- ERROR: not referenced by parent
y = wow
- Global variables still need to be defined-before-used.
/x/y/
x = yadda -- can't access global variable before its definition
y = yadda
*) yadda = 'yadda'
A variable can refer to itself and/or its parent expressions (immediate parent, grandparent, great-grandparent, and so on). This will generate regex containing recursion. For examples, see Comma-Separated Values and Balanced Parentheses in the Examples section and Palindromes in the Usage section.
In oprex, the colon symbol :
is used to start a character class:
- When defining a variable, e.g.
upvowel: A I U E O
- When quantifying, using
of:
- When negating, using
not:
After the colon, list out the character-class' members, separated by space.
Example:
<<|
|1 of: a i u e o
|not: b..d f..h j..n p..t v..z
|upvowel
upvowel: A I U E O
In a character class, single-characters are interpreted literally, e.g.
vowel: a i u e o A I U E O
arith: + - * /
colon: :
Can be unicode too:
basic_math_constant: π e i
danger: ⚠ ☣ ☢ ☠
Works as expected:
newline: \r \n
Octal, hex, and unicode escapes are also supported
xyz: \170 \171 \172
xyz: \x78 \x79 \x7A
xyz: \u0078 \u0079 \u007A
xyz: \U00000078 \U00000079 \U0000007A
You can also use the character's unicode-name. For example, instead of:
dash: - – —
Which is not very clear, you can spell the names out for clarity:
dash: \N{HYPHEN-MINUS} \N{EN DASH} \N{EM DASH}
And to make it even clearer, oprex has sugar for that:
dash: :HYPHEN-MINUS :EN_DASH :EM_DASH
Character-classes can include other character-classes:
upnum: upper digit
base64: alnum + / =
To build a character class based on unicode character properties, use the following syntax:
money_char: /Number /Currency_Symbol . ,
nonalpha: not: /IsAlphabetic
nonalpha: not: /Alphabetic
nonalpha: /Alphabetic=No
nonalpha: /Alphabetic:No
japanese_char: /Script=Hiragana /Script=Katakana
japanese_char: /Script:Hiragana /Script:Katakana
japanese_char: /InHiragana /InKatakana
japanese_char: /IsHiragana /IsKatakana
japanese_char: /Hiragana /Katakana
Syntax: from..to
e.g.
hex: 0..9 a..f A..F
grade_char: A..F
nonzero: 1..9
Names and escape codes can be used with range too:
nonzero: \N{DIGIT ONE}..\N{DIGIT NINE}
nonzero: :DIGIT_ONE..:DIGIT_NINE
nonzero: \u0031..\u0039
Operator | Operation | Placement |
---|---|---|
and |
intersection | infix (x and y ) |
not (without colon) |
subtraction | infix (x not y ) |
not: (with colon) |
negation | prefix (not: x ) |
Examples:
arabic_number: /Number and /IsArabic
greek_alphabet: /Alphabetic and /Script:Greek
japanese_number: /Number and /Hiragana /Katakana
japanese_number: /Hiragana /Katakana and /Number
nonzero: digit not 0
upnum: alnum not lower
nonlatin_alpha: /Alphabetic not /InBasicLatin
gaijin_alpha: /Alphabetic not /Hiragana /Katakana
consonant: alpha not a i u e o A I U E O
non_quote: not: ' "
inside_paren: not: ( )
csv_data: not: ,
Syntax:
-
non-
followed by a character-class variable name, e.g.non-digit
. -
not:
followed by character-class member(s), e.g.not: a i u e o
.
The first form is for easy chaining, e.g. if you need "non-digit followed by non-alphabet" use /non-digit/non-alpha/
.
Only character classes can be negated. But for the non-
operator, the built-in variable WOB
is an exception: WOB
is not a character-class, but non-WOB
compiles to \B
.
Syntax: /lookup1/lookup2/lookup3/etc/
Each lookup can be one of the following:
- A variable name
-
Negation using
non-
- Backreference
-
Match-until operator (the double underscore
__
)
To enhance readability, several sugars are available to use with lookup-chain syntax (you might want to first read about BOS
, BOL
, EOS
, and EOL
in the Built-in Variables section):
Syntactic Sugar | Meaning | Example | Equivalent To | Compiles To |
---|---|---|---|---|
./ at the beginning |
BOS |
./digit/ |
/BOS/digit/ |
\A\d |
// at the beginning |
BOL |
//digit/ |
/BOL/digit/ |
(?m:^)\d |
/. at the end |
EOS |
/digit/. |
/digit/EOS/ |
\d\Z |
// at the end |
EOL |
/digit// |
/digit/EOL/ |
\d(?m:$) |
Syntax: =varname
, the varname must be a named capture group. Example:
/number/=number/=number/
[number] = digit
matches three of same numbers e.g. 777
or 000
. For more examples, see Date and Quoted String in the Examples section.
The Match-Until operator __
matches one or more characters until what follows it in the lookup chain. For example:
/open/__/close/
open: (
close: )
The __
will match one or more characters until closing-parenthesis is encountered. The example will fail on string ()
because __
eats one-or-more. To make it zero-or-more, append the optionalize operator ?
:
/open/__?/close/
open: (
close: )
The above example is akin to the regex's "lazy dotstar" idiom .*?
. The difference is, oprex's __
will try to make some optimizations in some cases:
- If it is followed by character-class, example:
/__?/stop/
stop: . ;
Compiles to [^.;]*+[.;]
. The __?
uses what follows (the character-class [.;]
) so it compiles to [^.;]*+
which is the way to optimize lazy-dotstar for such case.
- If it is followed by string literal, example:
/__?/stop/
stop = 'END'
Compiles to (?:[^E]++|E(?!ND))*+END
. Again, the __?
uses what follows (the literal END
) so it compiles to (?:[^E]++|E(?!ND))*+
which is the way to optimize lazy-dotstar for a case like this.
In any case, the oprex stays super-readable while giving optimized, high-performance regex output.
If the __
is not followed by character-class nor string literal, it will compile to the usual lazy-dotplus .+?
(or lazy-dotstar .*?
in the case of __?
). Example:
/__/any// -- match all characters except the last one
Compiles to .+?(?s:.)(?m:$)
If a Match-Until usage falls in unoptimized case, it will compile to the regular lazy-dotstar/dotplus. In such case, the dot in the lazy-dotstar/dotplus will adhere to the DOTALL flag setting (will not match newline characters without DOTALL). So if you want the dot to really match anything, including newline characters, you'll need to turn on the dotall
flag:
(dotall) /__/any/. -- match all characters (including newlines) except the last one
Compiles to (?s:.+?.\Z)
Alternation block starts with either <<|
or @|
and ends with empty line. <<|
starts a backtrackable alternation block, while @|
starts an atomic one.
<<|
|alt1
|alt2
|etc
Each alt in an alternation block can be:
- An expression, with restrictions:
- It cannot be a lookaround block.
- It cannot be another alternation block.
- (These limitations force you to refactor sub-alternation/lookaround into a variable. This ensures readability.)
- A conditional expression.
- FAIL!
- empty (will always succeed -- it will try to match empty string, which always succeeds).
The syntax is similar to the backtrackable one, with one minor difference: atomic alternation starts with @|
rather than <<|
. Here's an excellent reference describing the difference between the two. Most of the time they are interexchangable, with the atomic version having better performance. But most is not all, and some cases that require backtracking will not work with the atomic version. For example, the Palindrome example shown in the Usage section will not work if its alternation block is changed to atomic.
Syntax:
<<| -- can use @| too
|[cond1] ? alt1
|[cond2] ? alt2
|alt_else
-
cond1
andcond2
must be names of capture groups. - Specs of alts can be seen in Backtrackable Alternation subsection, with one additional restriction: alts cannot be conditional expression.
- The last branch of an alternation must NOT be a conditional expression, because (see the following example):
<<|
|[x] ? alpha
|[y] ? digit
If neither x nor y is defined, what should it match? Should it just succeed? Or should it fail? It's not clear. That's why the last branch cannot be a conditional.
- If you want it to "just succeed", add an empty branch.
- If you want it to fail, add
FAIL!
. - If you want it to match some (non-conditional) expression, put it as the last branch.
Example:
<<|
|[x] ? alpha
|[y] ? digit
| -- or FAIL!, or (non-alternation non-lookaround) expression
For more example, see the Stuff-Inside-Brackets example in the Tutorial section.
FAIL!
compiles to (?!)
which will always fail.
A lookaround block starts with <@>
and ends with empty line.
<@>
|lookahead> -- positive lookahead
|!lookahead> -- negative lookahead
<lookbehind| -- positive lookbehind
<!lookbehind| -- negative lookbehind
|match_things_normally|
|lookahead>
|!lookahead>
<lookbehind|
<!lookbehind|
Each of the lookahead
, lookbehind
, and match_things_normally
in the above example can be one of:
- A variable name
-
Negation using
non-
- Backreference
- Lookup Chain
Description | Example |
---|---|
Lookahead a variable | ` |
Lookbehind a negation | `<non-digit |
Negative-lookahead a negation | ` |
Negative-lookahead a backreference | ` |
Negative-lookahead a lookup-chain | ` |
Negative-lookbehind a lookup-chain | `<!/alpha/digit/ |
Lookaheads and lookbehinds match characters, but then gives up the match. They do not consume characters in the string. The "match things normally" part is that, match normally/don't just look around/don't give up the match/do consume the characters. Examples
<@>
<backslash|
|quote|
<@>
|!dash>
|dashes_and_alnums|
<!dash|
Everywhere else in oprex, the at-sign @
means atomic:
- In alternation:
@|
starts an atomic alternation block. - In quantification: quantifiers that start with
@
(e.g.@1..
) are possessive quantifiers, which work atomically.
So how about <@>
? The @
there reminds you that lookarounds are atomic. With <
in a lookaround-block means lookbehind, and >
means lookahead, <@>
(which looks like an eye) is the perfect symbol to begin a lookaround-block.
of
is oprex's keyword for doing quantification/repetition:
quantifier of expression
quantifier of: c h a r s
For quantifier description, see below. For the target, see expression and character-class.
Example (result in comments):
zipcode = 5 of digit -- \d{5}
byte = 8 of: 0 1 -- [01]{8}
wont_listen = @3.. of "la" -- (?:la){3,}+
hex_number = @1.. of: 0..9 A..F -- [0-9A-F]++
deck = @0..52 of: 2 3 4 5 6 7 8 9 X J Q K A
batman_fight = @7..11 of <<|
|'bam'
|'pow'
|'kapow'
In oprex, @
means atomic, <<
means allow backtrack. So, on quantifiers:
@
= atomic = possessive
<<+
= backtrack to add more = lazy
<<-
= backtrack to lessen = greedy
The exact syntax is:
Type | Syntax (max is optional) |
---|---|
possessive | @min..max |
greedy | min..max <<- |
lazy | min <<+..max |
Example | Meaning | Compiles To |
---|---|---|
@1.. |
atomic one or more | ++ |
1.. <<- |
match one or more allow backtrack to reduce the number of matches |
+ |
1 <<+.. |
match one allow backtrack to match more, no max |
+? |
@0..10 |
atomic zero to ten | {,10}+ |
0..10 <<- |
match zero to ten allow backtrack to lessen |
{,10} |
0 <<+..10 |
match zero may backtrack to match more, maximum ten |
{,10}? |
- Can be used as quantifier, i.e.
? of expression
,? of: c h a r s
- Can be applied directly to:
- A variable lookup, e.g.
digit?
- Backreference, e.g.
=captur?
- Match-Until operator, i.e.
__?
- A variable lookup, e.g.
Example: ? of /digit?/=captur?/__?/
If ?
is applied to an expression while the expression is already quantified by 1..
, it will change the quantifier into 0..
while keeping its greediness/laziness/possessivity.
Example:
digits?
digits = @1.. of digit
Compiles to \d*+
.
If we change the quantifier from @1..
to 1.. <<-
, the output becomes \d*
, and if we change it to 1 <<+..
, the output becomes \d*?
.
This improves readability. Consider the following two examples:
/quote/contents/quote/ -- contents is not optional
quote: "
contents = @0.. of not: quote -- but it allows zero match
/quote/contents?/quote/ -- contents is optional
quote: "
contents = @1.. of not: quote -- minimum 1 match
Both compile to the same regex: "[^"]*+"
. But the second example is better. It more clearly shows that there can be no content between the quotes.
Unlike other built-in variables, wordhar
can be redefined. This is useful because regex's \w
means letters, numbers, and underscore; while e.g. in English, what considered word chars are letters and maybe hyphen. So, to redefine wordchar
:
- Just define a variable named
wordchar
. - It must be the first definition.
- It must be a character-class.
- It must be global.
- NOTE: it will also affect
WOB
andnon-WOB
(including the prefix/suffix.
and_
used with string literal).
Example:
_'cat'.
*) wordchar: alpha -
Compiles to (?>(?<=[a-zA-Z\-])(?=[a-zA-Z\-])|(?<![a-zA-Z\-])(?![a-zA-Z\-]))cat(?>(?<=[a-zA-Z\-])(?![a-zA-Z\-])|(?<![a-zA-Z\-])(?=[a-zA-Z\-]))
- IPv4 Address
- BEGIN-something-END
- Date
- Time
- Blood Type
- Quoted String
- Comma-Separated Values
- Password Checks
- Balanced Parentheses
- Number-string Range Literal
- Backreference
- Flags
- Match-anything-until
- Recursion
- Global Variables
- Anchors
- Built-in Character Classes
- Built-in Expressions
-
Special Built-ins:
WOB
,wordchar
,non-linechar