clu2's notes: Notes on regular expressions

Notes on regular expressions

Flavors and features of regular expressions

Also see this document

	POSIX Basic Regular Expressions (BRE)	POSIX Extended Regular Expressions (ERE)	grep	egrep	Perl Compatible Regular Expressions (PCRE)
`* ^ $ [ ]`	Yes	Yes	Yes	Yes	Yes
`? + \|`	No	Yes	Yes, but need to add `\`, e.g. `\\|`	Yes	Yes
Matching/capture groups	Yes: `\(...\)`	Yes: `(...)`	Yes: `\(...\)`	Yes: `(...)`	Yes: `(...)`
`{ }`	Yes, but need to add `\`, e.g. `\{ \}`	Yes	Yes, but need to add `\`, e.g. `\{ \}`	Yes	Yes
`\b \B` (Word boundaries)	No	No	No	Yes	Yes
`\w \W` (Alphanumeric characters without `_`)	No	No	Yes	Yes	Yes (but `_` will be matched too)

Also, POSIX Regular Expressions and grep/egrep always return the longest match, while PCRE also allows shortest match (ungreedy/lazy quantifiers).

POSIX Regular Expressions and grep/egrep do not accept special characters such as \n \r \t \f \v etc.

See here for the kinds of regular expressions GNU core utilities (grep, find, awk, etc) accept

Special characters

Characters	Meaning
`.`	Any single character, except the new line (\n) (and carriage return \r if you work on Windows/DOS) To match any single character, use [\s\S]
`[ ]`	A single character that is contained within the brackets.
`[^ ]`	Any single character that is not contained within the brackets.
`^`	Beginning position of the string.
`$`	Ending position of the string.
`( )`	Matching/capture group.
`\n`	Refer to the `n`-th matching/capture group.
`*`	Match the preceding element 0 or more times.
`+`	Match the preceding element 1 or more times.
`?`	Match the preceding element at most 1 time.
`{m}`	Match the preceding element exactly `m` times.
`{m,n}`	Match the preceding element at least `m` and no more than `n` times. `n` is optional.
`\|`	Match either regular expressions (alternation)
`\xf0`	Match hex character (in this case `f0` hexidecimal)
`\021`	Match octal character (in this case `21` octal)
`\b`	Match "word" boundary
`\B`	Match non-"word" boundary

Character classes

POSIX	Perl	Meaning
`[[:alnum:]]`		`[A-Za-z0-9]` Alphanumeric characters
	`\w`	`[A-Za-z0-9_]` Alphanumeric characters plus `_`
	`\W`	`[^A-Za-z0-9_]` Non-word characters
`[[:alpha:]]`		`[A-Za-z]` Letters
`[[:digit:]]`	`\d`	`[0-9]` Digits
	`\D`	`[^0-9]` Non-digits
`[[:blank:]]`		`[ \t]` Space and tab
`[[:space:]]`	`\s`	`[ \t\r\n\v\f]` Whitspace
	`\S`	`[^ \t\r\n\v\f]` Non-whitspace
`[[:upper:]]`		`[A-Z]` Upper case letters
`[[:lower:]]`		`[a-z]` Lower case letters
`[[:punct:]]`		Punctuation characters
`[[:print:]]`		Printable characters