clu2's notes: Notes on awk

GNU awk manual

Converting awk scripts to Perl scripts with a2p

See here

How awk works ?

awk is basically the UNIX cut utility on steriods.

awk reads in the input line by line (a line is called a "record" in awk parlance), and for each line, tokenizes it into "fields" (the default delimiter is whitespace, but this can be changed), and runs the script on these fields.

Invoking awk

awk 'script' [list of files]	Run the `script` on `list of files`. Note the `script` is better quoted with single quotes to avoid issues.
awk -Fdelim 'script' [list of files]	Similar to above, except the delimiter for tokenizing a line being `delim.`
awk -f scriptFile [list of files]	Run the `scriptFile` on `list of files`

Basic syntax

An awk script consists a series of the following

    pattern  { procedure }

Both pattern and { procedure } are optional.

If pattern is omitted, awk applies procedure to all lines.

If { procedure } is omitted, awk prints lines which match pattern, that is, awk will pretend { procedure } is

    { print $0 }

`pattern` syntax

pattern can take the following formats:

Format	Meaning
`begin,end`	Process lines within the range, which includes both the first line that matches `begin` and the last line that matches `end`. `begin` and `end` are awk `pattern`s too, and usually they are the following two formats: `expression` or `/regex/`.
`/regex/`	Process lines which match the regular expression `regex`. Strictly speaking, this format is just a shorthand of the following awk `expression`: $0 ~ /regex/
`expression`	Process lines which match `expression`
`BEGIN`	Run `procedure` before the first line is read.
`END`	Run `procedure` after the last line is read.

`procedure` syntax

procedure are statements, which have a C-like syntax.

For example, variables are assigned and referred in the same way as in C; there is no need to use the dollar sign $ as in many scripting languages (except $0, $1, ... which have special meaning.)

Regular expressions, when used in conjuction with ~ or ~! operators or sub/gsub/gensub/match functions, are always quoted in slashes /, as in Perl.

Operators

Most of the operators are similar to those in C. In addition, awk accepts

`~` `~!`	Match/Don't match a regular expression
`/`	Floating-point division.
`(space)`	String concatenation. Parentheses should be used around concatenation to avoid issues
`**` `^`	Exponentiation

Since awk is a weakly-typed language, it is important to check the type of a variable before doing any operation on it. To check if a variable is a number or not, use

if ((v+0)==v) ...

Special variables

`$n`	`n`-th word ("field") of the current line. Note that `n` can be an awk variable, e.g. awk 'i=1; print $i' foo
`$0`	The entire current line.
`NF`	Number of words ("fields") in the current line.
`NR`	Current line number.
`FS` `OFS`	Input/Output delimiter. The default is space.
`RS` `ORS`	Input/Output line delimiter. The default is newline.
`FILENAME`	The input file name.

Built-in functions

`print item1, item2, ...`	Display `item1`, `item2` ... (separated by space or whatever that is specified by the special variable `OFS`) followed by a newline.
`printf format, item1, item2, ...`	Formatted `print`
`sprintf(format, item1, item2, ...)`	Get the formatted string
`atan2 cos exp log sin sqrt`	Arithmetic functions
`and or xor compl lshift rshift`	Bitwise functions
`rand srand`	Random number generator
`int`	Truncate a number toward 0
`[g]sub(regex,replacement,target)`	Replace the first (`sub`) or all (`gsub`) occurrences of `regex` with `replacement` in the `target` string. If `target` is omitted, defaults to `$0`
`gensub(regex,replacement,n,target)`	(GNU awk) Replace the `n`-th occurrence of `regex` with `replacement` in the `target` string. If `target` is omitted, defaults to `$0`
`match(target,regex)`	Get the beginning position of the longest, leftmost substring which matches `regex`
`index(haystack,needle)`	Get the position of `needle` in `haystack`
`length(string)`	Get the string length
`substr(string,m,n)`	Get the substring starting at `m` and for length `n`. `n` is optional.
`tolower toupper`	Convert to upper/lower cases.
`split(string,array,delim)`	Tokenize the `string` using the delimiter `delim` and put the result in `array`. Returns the size of `array`. If `delim` is omitted, use the value of the special variable `FS`
`asort(src,dest)`	(GNU awk) Sort the `src` array and put the result in `dest` array. Returns the size of `src`. `dest` is optional. Set the special variable `IGNORECASE` to 1 to enable case-insensitive sorting.
`system(command)`	Execute `command`.
`systime()`	Get UNIX timestamp.
`mktime(datespec)`	Get UNIX timestamp in `datespec` format. See here for details.
`strftime(datespec)`	Get UNIX timestamp in `datespec` format. See here for details.

Notes on awk