Regexes can be as simple as substring patterns:
The match operator (m//
, abbreviated //
) identifies a regular expression--in this example, hat
. This pattern is not a word. Instead it means "the h
character, followed by the a
character, followed by the t
character." Each character in the pattern is an indivisible element, or atom. It matches or it doesn't.
The regex binding operator (=~
) is an infix operator (fixity) which applies the regex of its second operand to a string provided by its first operand. When evaluated in scalar context, a match evaluates to a true value if it succeeds. The negated form of the binding operator (!~
) evaluates to a true value unless the match succeeds.
The substitution operator, s///
, is in one sense a circumfix operator (fixity) with two operands. Its first operand is a regular expression to match when used with the regex binding operator. The second operand is a substring used to replace the matched portion of the first operand used with the regex binding operator. For example, to cure pesky summer allergies:
The qr//
operator creates first-class regexes. Interpolate them into the match operator to use them:
... or combine multiple regex objects into complex patterns:
Regular expressions get more powerful through the use of regex quantifiers, which allow you to specify how often a regex component may appear in a matching string. The simplest quantifier is the zero or one quantifier, or ?
:
Any atom in a regular expression followed by the ?
character means "match zero or one of this atom." This regular expression matches if zero or one a
characters immediately follow a c
character and immediately precede a t
character, either the literal substring cat
or ct
.
The one or more quantifier, or +
, matches only if there is at least one of the quantified atom:
There is no theoretical limit to the maximum number of quantified atoms which can match.
The zero or more quantifier, *
, matches zero or more instances of the quantified atom:
As silly as this seems, it allows you to specify optional components of a regex. Use it sparingly, though: it's a blunt and expensive tool. Most regular expressions benefit from using the ?
and +
quantifiers far more than *
. Precision of intent often improves clarity.
Numeric quantifiers express specific numbers of times an atom may match. {n}
means that a match must occur exactly n times.
{n,}
matches an atom at least n times:
{n,m}
means that a match must occur at least n times and cannot occur more than m times:
You may express the symbolic quantifiers in terms of the numeric quantifiers, but most programs use the former far more often than the latter.
The +
and *
quantifiers are greedy, as they try to match as much of the input string as possible. This is particularly pernicious. Consider a naïve use of the "zero or more non-newline characters" pattern of .*
:
Greedy quantifiers start by matching everything at first, and back off a character at a time only when it's obvious that the match will not succeed.
The ?
quantifier modifier turns a greedy-quantifier parsimonious:
When given a non-greedy quantifier, the regular expression engine will prefer the shortest possible potential match and will increase the number of characters identified by the .*?
token combination only if the current number fails to match. Because *
matches zero or more times, the minimal potential match for this token combination is zero characters:
Use +?
to match one or more items non-greedily:
The ?
quantifier modifier also applies to the ?
(zero or one matches) quantifier as well as the range quantifiers. In every case, it causes the regex to match as little of the input as possible.
The greedy patterns .+
and .*
are tempting but dangerous. A cruciverbalistA crossword puzzle afficionado. who needs to fill in four boxes of 7 Down ("Rich soil") will find too many invalid candidates with the pattern:
She'll have to discard Alabama
, Belgium
, and Bethlehem
long before the program suggests loam
. Not only are those words too long, but the matches start in the middle of the words. A working understanding of greediness helps, but there is no substitute for the copious testing with real, working data.
Regex anchors force the regex engine to start or end a match at an absolute position. The start of string anchor (\A
) dictates that any match must start at the beginning of the string:
The end of line string anchor (\Z
) requires that a match end at the end of a line within the string.
The word boundary anchor (\b
) matches only at the boundary between a word character (\w
) and a non-word character (\W
). Use an anchored regex to find loam
while prohibiting Belgium
:
Perl interprets several characters in regular expressions as metacharacters, characters represent something other than their literal interpretation. Metacharacters give regex wielders power far beyond mere substring matches. The regex engine treats all metacharacters as atoms.
The .
metacharacter means "match any character except a newline". Remember that caveat; many novices forget it. A simple regex search--ignoring the obvious improvement of using anchors--for 7 Down might be /l..m/
. Of course, there's always more than one way to get the right answer:
If the potential matches in @words
are more than the simplest English words, you will get false positives. .
also matches punctuation characters, whitespace, and numbers. Be specific! The \w
metacharacter represents all alphanumeric characters (unicode) and the underscore:
The \d
metacharacter matches digits (also in the Unicode sense):
Use the \s
metacharacter to match whitespace, whether a literal space, a tab character, a carriage return, a form-feed, or a newline:
When none of those metacharacters is specific enough, specify your own character class by enclosing them in square brackets:
The hyphen character (-
) allows you to specify a contiguous range of characters in a class, such as this $ascii_letters_only
regex:
To include the hyphen as a member of the class, move it to the start or end:
... or escape it:
Use the caret (^
) as the first element of the character class to mean "anything except these characters":
Regular expressions allow you to group and capture portions of the match for later use. To extract an American telephone number of the form (202) 456-1111
from a string:
Note especially the escaping of the parentheses within $area_code
. Parentheses are special in Perl 5 regular expressions. They group atoms into larger units and also capture portions of matching strings. To match literal parentheses, escape them with backslashes as seen in $area_code
.
Perl 5.10 added named captures, which allow you to capture portions of matches from applying a regular expression and access them later, such as finding a phone number in a string of contact information:
Regexes tend to look like punctuation soup until you can group various portions together as chunks. Named capture syntax has the form:
Parentheses enclose the capture. The ?< name >
construct names this particular capture and must immediately follow the left parenthesis. The remainder of the capture is a regular expression.
When a match against the enclosing pattern succeeds, Perl stores the portion of the string which matches the enclosed pattern in the magic variable %+
. In this hash, the key is the name of the capture and the value is the appropriate portion of the matched string.
Perl has supported numbered captures for ages:
This form of capture provides no identifying name and does not store in %+
. Instead, Perl stores the captured substring in a series of magic variables. The first matching capture that Perl finds goes into $1
, the second into $2
, and so on. Capture counts start at the opening parenthesis of the capture; thus the first left parenthesis begins the capture into $1
, the second into $2
, and so on.
While the syntax for named captures is longer than for numbered captures, it provides additional clarity. Counting left parentheses is tedious work, and combining regexes which each contain numbered captures is far too difficult. Named captures improve regex maintainability--though name collisions are possible, they're relatively infrequent. Minimize the risk by using named captures only in top-level regexes.
In list context, a regex match returns a list of captured substrings:
Numbered captures are also useful in simple substitutions, where named captures may be more verbose:
Previous examples have all applied quantifiers to simple atoms. You may apply them to any regex element:
If you expand the regex manually, the results may surprise you:
Sometimes specificity helps pattern accuracy:
Some regexes need to match either one thing or another. The alternation metacharacter (|
) expresses this intent:
The alternation metacharacter indicates that either preceding fragment may match. Keep in mind that alternation has a lower precedence (precedence) than even atoms:
While it's easy to interpret rice|beans
as meaning ric
, followed by either e
or b
, followed by eans
, alternations always include the entire fragment to the nearest regex delimiter, whether the start or end of the pattern, an enclosing parenthesis, another alternation character, or a square bracket.
To reduce confusion, use named fragments in variables ($rice|$beans
) or group alternation candidates in non-capturing groups:
The (?:)
sequence groups a series of atoms without making a capture.
To match a literal instance of a metacharacter, escape it with a backslash (\
). You've seen this before, where \(
refers to a single left parenthesis and \]
refers to a single right square bracket. \.
refers to a literal period character instead of the "match anything but an explicit newline character" atom.
You will likely need to escape the alternation metacharacter (|
) as well as the end of line metacharacter ($
) and the quantifiers (+
, ?
, *
).
The metacharacter disabling characters (\Q
and \E
) disable metacharacter interpretation within their boundaries. This is especially useful when taking match text from a source you don't control when writing the program:
The $literal_text
argument can contain anything--the string ** ALERT **
, for example. Within the fragment bounded by \Q
and \E
, Perl interpret the regex as \*\* ALERT \*\*
and attempt to match literal asterisk characters, rather than greedy quantifiers.
Regex anchors such as \A
, \b
, \B
, and \Z
are a form of regex assertion, which requires that the string meet some condition. These assertions do not match individual characters within the string. No matter what the string contains, the regex qr/\A/
will always match..
Zero-width assertions match a pattern. Most importantly, they do not consume the portion of the pattern that they match. For example, to find a cat on its own, you might use a word boundary assertion:
... but if you want to find a non-disastrous feline, you might use a zero-width negative look-ahead assertion:
The construct (?!...)
matches the phrase cat
only if the phrase astrophe
does not immediately follow.
The zero-width positive look-ahead assertion:
... matches the phrase cat
only if the phrase astrophe
immediately follows. While a normal regular expression can accomplish the same thing, consider a regex to find all non-catastrophic words in the dictionary which start with cat
:
The zero-width assertion consumes none of the source string, leaving the anchored fragment <.*\Z> to match. Otherwise, the capture would only capture the cat
portion of the source string.
To assert that your feline never occurs at the start of a line, you might use a zero-width negative look-behind assertion. These assertions must have fixed sizes; you may not use quantifiers:
The construct (?<!...)
contains the fixed-width pattern. You could also express that the cat
must always occur immediately after a space character with a zero-width positive look-behind assertion:
The construct (?<=...)
contains the fixed-width pattern. This approach can be useful when combining a global regex match with the \G
modifier, but it's an advanced feature you likely won't use often.
A newer feature of Perl 5 regexes is the keep assertion \K
. This zero-width positive look-behind assertion can have a variable length:
\K
is surprisingly useful for certain substitutions which remove the end of a pattern:
Several modifiers change the behavior of the regular expression operators. These modifiers appear at the end of the match, substitution, and qr//
operators. For example, to enable case-insensitive matching:
The first like()
will fail, because the strings contain different letters. The second like()
will pass, because the /i
modifier causes the regex to ignore case distinctions. M
and m
are equivalent in the second regex due to the modifier.
You may also embed regex modifiers within a pattern:
The (?i)
syntax enables case-insensitive matching only for its enclosing group: in this case, the named capture. You may use multiple modifiers with this form. Disable specific modifiers by preceding them with the minus character (-
):
The multiline operator, /m
, allows the \A
and \Z
anchors to match at any newline embedded within the string.
The /s
modifier treats the source string as a single line such that the .
metacharacter matches the newline character. Damian Conway suggests the mnemonic that /m
modifies the behavior of multiple regex metacharacters, while /s
modifies the behavior of a single regex metacharacter.
The /r
modifier causes a substitution operation to return the result of the substitution, leaving the original string as-is. If the substitution succeeds, the result is a modified copy of the original. If the substitution fails (because the pattern does not match), the result is an unmodified copy of the original:
The /x
modifier allows you to embed additional whitespace and comments within patterns. With this modifier in effect, the regex engine ignores whitespace and comments. The results are often much more readable:
This regex isn't simple, but comments and whitespace improve its readability. Even if you compose regexes together from compiled fragments, the /x
modifier can still improve your code.
The /g
modifier matches a regex globally throughout a string. This makes sense when used with a substitution:
When used with a match--not a substitution--the \G
metacharacter allows you to process a string within a loop one chunk at a time. \G
matches at the position where the most recent match ended. To process a poorly-encoded file full of American telephone numbers in logical chunks, you might write:
Be aware that the \G
anchor will take up at the last point in the string where the previous iteration of the match occurred. If the previous match ended with a greedy match such as .*
, the next match will have less available string to match. Lookahead assertions can also help.
The /e
modifier allows you to write arbitrary Perl 5 code on the right side of a substitution operation. If the match succeeds, the regex engine will run the code, using its return value as the substitution value. The earlier global substitution example could be simpler with code like:
Each additional occurrence of the /e
modifier will cause another evaluation of the result of the expression, though only Perl golfers use anything beyond /ee
.
Hey! The above document had some coding errors, which are explained below:
- Around line 294:
-
Deleting unknown formatting code N<>
- Around line 442:
-
A non-empty Z<>
- Around line 513:
-
A non-empty Z<>
- Around line 534:
-
A non-empty Z<>