- Ruby (RegExp class, https://ruby-doc.org/core-2.7.1/Regexp.html)
foo = "faczynski ma malego kiutka"
foo =~ /iutka/
- Java
String str = "some string";
if (str.matches("^string")) { ... }
- JavaScript
'^some'.test("some string")
'^some'.exec("some string")
- Python
import re
re.search('^some', "some string")
- Rust
use regex::Regex;
Regex::new("^some").unwrap().is_match("some string")
ruby-doc.org/core-2.7.1/Regexp.html
Excerpt from the docs:
/hay/ =~ 'haystack' #=> 0
/y/.match('haystack') #=> #<MatchData "y">
If a string contains the pattern it is said to match. A literal string matches itself.
Here 'haystack' does not contain the pattern 'needle', so it doesn't match:
/needle/.match('haystack') #=> nil
Here 'haystack' contains the pattern 'hay', so it matches:
/hay/.match('haystack') #=> #<MatchData "hay">
-
validation (e.g. check if a user input is well-formed or meets the defined criteria)
-
parsing (e.g. to catch all URL parameters, capture text, etc.)
-
data scraping (like in web scraping, find all pages that contain a certain set of keywords)
-
string replacement (e.g. when coding - to rename a method or a variable)
-
other transformations (e.g. to translate one form of text, like application output, to another)
.
matches any character
\d
matches a single character that is a digit
\w
matches a word character (alphanumeric character plus underscore)
\s
matches a whitespace character (includes tabs and line breaks)
Negations:
\D
is the negation of \d
\W
is the negation of \w
\S
is the negation of \s
abc*
matches a string that has ab
followed by zero or more c
's
abc+
matches a string that has ab
followed by one or more c
's
abc?
matches a string that has ab
followed by zero or one c
's
abc{2}
matches a string that has ab
followed by 2 c
's
abc{2,}
matches a string that has ab
followed by 2 or more c
's
abc{2,5}
matches a string that has ab
followed by 2 up to 5 c
's
a(bc)*
matches a string that has a
followed by zero or more copies of the sequence bc
a(bc){2,5}
matches a string that has a
followed by 2 up to 5 copies of the sequence bc
^The
matches any string that starts with The
end$
matches a string that ends with end
^The end$
exact string match (starts and ends with The end
)
roar
matches any string that has the text roar
in it
a(b|c)
matches a string that has a
followed by b
or c
(and captures b
or c
)
a[bc]
same as previous, but without capturing b
or c
Example usages:
[abc]
matches a string that has either an a
or a b
or a c
-> is the same as a|b|c
[a-c]
same as previous, but with range operator -
[a-fA-F0-9]
a string that represents a single hexadecimal digit, case insensitively
[0-9]%
a string that has a character from 0
to 9
before a %
sign
Negation operator:
[^a-zA-Z]
a string that has not a letter from a
to z
or from A
to Z
. In this case the ^
is used as negation of the expression
Most popular:
g
- global - does not return after the first match, restarting the subsequent searches from the end of the previous match
m
- multi-line - when enabled ^
and $
will match the start and end of a line, instead of the whole string
i
- insensitive - makes the whole expression case-insensitive (for instance /aBc/i
would match AbC
)
Flags are language-specific, e.g. in PHP you use s
to enable multi-line mode.
a(bc)
parentheses create a capturing group with value bc
a(?:bc)*
using ?:
we disable the capturing group, so here the match object will not contain bc
a(?<foo>bc)
using ?<foo>
we put a name to the group
The quantifiers (*
+
{}
) are greedy operators - they expand the match as far as they can through the provided text.
E.g. <.+>
matches <div>simple div</div>
in This is a <div> simple div</div> test
.
In order to catch only the div tag we can use a ?
to make it lazy:
<.+?>
matches any character one or more times included inside < and >, expanding as needed
https://regex101.com/r/cO8lqs/24
\babc\b
performs a "whole words only" search (here it won't match aabcd
in "abc aabcd" )
\Babc\B
matches only if the pattern is fully surrounded by word characters (here it won't match abc
in "abc aabcd" )
([abc])\1
using \1
it matches the same text that was matched by the first capturing group
([abc])([de])\2\1
we can use \2
(\3
,\4
etc.) to identify the same text that was matched by the second (third, fourth, etc.) capturing group
(?<foo>[abc])\k<foo>
we add the name foo to the group and we reference it later (\k<foo>
).
The result is the same as in the first regex.
d(?=r)
matches a d
only if is followed by r
, but r
will not be part of the overall regex match
(?<=r)d
matches a d
only if is preceded by an r
, but r
will not be part of the overall regex match
- it ain't easy if you apply RFC822 strictly... https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
- ...but can be quite simple if you're not paranoid
https://regex101.com/r/70ARRh/3
- Webapp: validation of bank details provided by PAX https://github.com/AirHelp/ah-webapp/blob/88dae720285026a9465f41a3f05bd91abe5ebaeb/app/services/validate_free_bank_transfer_details.rb regexps taken from private Ruby gem https://github.com/AirHelp/ah-payments-reference-data/blob/8c35f6133b55e2064ae2d20313e4aac3438bc6f3/lib/ah/payments/reference/data/fields.json
- Midass: validation of bank details provided by PAX https://github.com/AirHelp/ah-midass/blob/89b20280f782e68a2445d2fe966012fa0558365c/app/services/dlocal/validate_bank_transfer_details.rb
- Skynet: parsing Boarding Pass scan from mobile app https://github.com/AIrHelp/ah-skynet/blob/e8afa9087b19dce5a4596eb41c9d24c073400233/app/services/parse_boarding_pass.rb
IDE will highlight matches so you can test if your regex works as expected.
Here we use 2 sed commands: replace and delete; also we apply g flag to apply to all occurrences in line
# redirect output of one gsed operation to another until desired effect is reached
gsed 's/|/,/g' db-output.txt | \ # replace pipes with commas in db-output.txt
gsed 's/\s\+,/,/g' | \ # replace spaces followed by comma with comma
gsed 's/,\s\+/,/g' | \ # replace comma followed by spaces with comma
gsed 's/^\s\+//' | \ # delete (replace by nothing) spaces at the beginning of line
gsed '/--/d' # delete lines with '--'
# or use multiple gsed commands joined with semicolons in one invocation
gsed 's/|/,/g; s/\s\+,/,/g; s/,\s\+/,/g; s/^\s\+//; /--/d' db-output.txt
# the same as above but split to multiple lines for readability
gsed -e 's/|/,/g' \
-e 's/\s\+,/,/g' \
-e 's/,\s\+/,/g' \
-e 's/^\s\+//' \
-e '/--/d' \
db-output.txt
-
GNU sed & BSD/POSIX sed differ
Rule of thumb: use the modern one, i.e. GNU sed
brew install gsed # on OSX
-
there are some syntactic differences programming language regex and GNU sed regexes
-
sed uses POSIX syntax (basic regular expressions), so some escape sequences (eg.
\|
,\+
,\?
) are not definedsee Regex syntax clashes at http://www.gnu.org/software/sed/manual/sed.html
-
some macros/character classes don't work in sed eg.
\d
, https://stackoverflow.com/questions/14671293/why-doesnt-d-work-in-regular-expressions-in-sed)
-
-
try to keep it simple (for better performance & understandability)
-
sometimes it's easier to use or operator (
|
) than creating more general regular expression -
read the language documentation
-
(especially when re-using a regexp written in one programming language in a different language)
-
The Stack Overflow Regular Expressions FAQ https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075#22944075
-
MDN Regular Expressions (JavaScript) https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
-
Lots of examples with explanations (kudos to Jonny Fox, I used lots of them in this presentation) https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
-
Examples in various programming languages