Notes on regular expressions

Flavors and features of regular expressions

Also see this document

POSIX Basic Regular
Expressions (BRE)
POSIX Extended Regular
Expressions (ERE)
grep egrep Perl Compatible Regular
Expressions (PCRE)
* ^ $ [ ] YesYesYesYesYes
? + | No Yes Yes, but need to add \, e.g. \| Yes Yes
Matching/capture groups Yes: \(...\) Yes: (...) Yes: \(...\) Yes: (...) Yes: (...)
{ } Yes, but need to add \, e.g. \{ \} YesYes, but need to add \, e.g. \{ \}YesYes
\b \B
(Word boundaries)
NoNoNoYesYes
\w \W
(Alphanumeric characters without _)
NoNoYesYesYes (but _ will be matched too)

Also, POSIX Regular Expressions and grep/egrep always return the longest match, while PCRE also allows shortest match (ungreedy/lazy quantifiers).

POSIX Regular Expressions and grep/egrep do not accept special characters such as \n \r \t \f \v etc.

See here for the kinds of regular expressions GNU core utilities (grep, find, awk, etc) accept

Special characters

Characters Meaning
. Any single character, except the new line (\n) (and carriage return \r if you work on Windows/DOS)

To match any single character, use [\s\S]

[ ] A single character that is contained within the brackets.
[^ ] Any single character that is not contained within the brackets.
^ Beginning position of the string.
$ Ending position of the string.
( ) Matching/capture group.
\n Refer to the n-th matching/capture group.
* Match the preceding element 0 or more times.
+ Match the preceding element 1 or more times.
? Match the preceding element at most 1 time.
{m} Match the preceding element exactly m times.
{m,n} Match the preceding element at least m and no more than n times.

n is optional.

| Match either regular expressions (alternation)
\xf0 Match hex character (in this case f0 hexidecimal)
\021 Match octal character (in this case 21 octal)
\b Match "word" boundary
\B Match non-"word" boundary

Character classes

POSIX Perl Meaning
[[:alnum:]] [A-Za-z0-9]

Alphanumeric characters

\w [A-Za-z0-9_]

Alphanumeric characters plus _

\W [^A-Za-z0-9_]

Non-word characters

[[:alpha:]] [A-Za-z]

Letters

[[:digit:]] \d [0-9]

Digits

\D [^0-9]

Non-digits

[[:blank:]] [ \t]

Space and tab

[[:space:]] \s [ \t\r\n\v\f]

Whitspace

\S [^ \t\r\n\v\f]

Non-whitspace

[[:upper:]] [A-Z]

Upper case letters

[[:lower:]] [a-z]

Lower case letters

[[:punct:]] Punctuation characters
[[:print:]] Printable characters

Blog Archive