Notes on awk

GNU awk manual

Converting awk scripts to Perl scripts with a2p

See here

How awk works ?

awk is basically the UNIX cut utility on steriods.

awk reads in the input line by line (a line is called a "record" in awk parlance), and for each line, tokenizes it into "fields" (the default delimiter is whitespace, but this can be changed), and runs the script on these fields.

Invoking awk

awk 'script' [list of files]
Run the script on list of files.

Note the script is better quoted with single quotes to avoid issues.

awk -Fdelim 'script' [list of files]
Similar to above, except the delimiter for tokenizing a line being delim.
awk -f scriptFile [list of files]
Run the scriptFile on list of files

Basic syntax

An awk script consists a series of the following
    pattern  { procedure }
Both pattern and { procedure } are optional.

If pattern is omitted, awk applies procedure to all lines.

If { procedure } is omitted, awk prints lines which match pattern, that is, awk will pretend { procedure } is

    { print $0 }

pattern syntax

pattern can take the following formats:

Format Meaning
begin,end Process lines within the range, which includes both the first line that matches begin and the last line that matches end.

begin and end are awk patterns too, and usually they are the following two formats: expression or /regex/.

/regex/ Process lines which match the regular expression regex.

Strictly speaking, this format is just a shorthand of the following awk expression:

$0 ~ /regex/
expression Process lines which match expression
BEGIN Run procedure before the first line is read.
END Run procedure after the last line is read.

procedure syntax

procedure are statements, which have a C-like syntax.

For example, variables are assigned and referred in the same way as in C; there is no need to use the dollar sign $ as in many scripting languages (except $0, $1, ... which have special meaning.)

Regular expressions, when used in conjuction with ~ or ~! operators or sub/gsub/gensub/match functions, are always quoted in slashes /, as in Perl.

Operators

Most of the operators are similar to those in C. In addition, awk accepts

~
~!
Match/Don't match a regular expression
/ Floating-point division.
(space) String concatenation.

Parentheses should be used around concatenation to avoid issues

**
^
Exponentiation

Since awk is a weakly-typed language, it is important to check the type of a variable before doing any operation on it. To check if a variable is a number or not, use

if ((v+0)==v) ...

Special variables

$n n-th word ("field") of the current line.

Note that n can be an awk variable, e.g.

awk 'i=1; print $i' foo
$0 The entire current line.
NF Number of words ("fields") in the current line.
NR Current line number.
FS
OFS
Input/Output delimiter.

The default is space.

RS
ORS
Input/Output line delimiter.

The default is newline.

FILENAME The input file name.

Built-in functions

print item1, item2, ... Display item1, item2 ... (separated by space or whatever that is specified by the special variable OFS) followed by a newline.
printf format, item1, item2, ... Formatted print
sprintf(format, item1, item2, ...) Get the formatted string
atan2 cos exp
log sin sqrt
Arithmetic functions
and or xor
compl lshift rshift
Bitwise functions
rand srand Random number generator
int Truncate a number toward 0
[g]sub(regex,replacement,target) Replace the first (sub) or all (gsub) occurrences of regex with replacement in the target string.

If target is omitted, defaults to $0

gensub(regex,replacement,n,target) (GNU awk) Replace the n-th occurrence of regex with replacement in the target string.

If target is omitted, defaults to $0

match(target,regex) Get the beginning position of the longest, leftmost substring which matches regex
index(haystack,needle) Get the position of needle in haystack
length(string) Get the string length
substr(string,m,n) Get the substring starting at m and for length n.

n is optional.

tolower toupper Convert to upper/lower cases.
split(string,array,delim) Tokenize the string using the delimiter delim and put the result in array. Returns the size of array.

If delim is omitted, use the value of the special variable FS

asort(src,dest) (GNU awk) Sort the src array and put the result in dest array. Returns the size of src.

dest is optional.

Set the special variable IGNORECASE to 1 to enable case-insensitive sorting.

system(command) Execute command.
systime() Get UNIX timestamp.
mktime(datespec) Get UNIX timestamp in datespec format.

See here for details.

strftime(datespec) Get UNIX timestamp in datespec format.

See here for details.