LJ 62: The awk Utility

A Word About Regular Expressions

Regular expressions, as the name may imply, are not to be taken literally. In natural language, expressions such as ``you can't change the spots on a leopard'', convey the intended message only if it is not taken literally. Regular expressions provide a set of rules for expressing a pattern or a sequence of characters to utilities such as grep, ed, vi, awk, sed and others. The simplest regular expressions can be taken literally. For example, the string ``ABC'' is a regular expression that means A, followed by B, followed by C. Regular expressions get complex (and ugly) when a sophisticated and repeating pattern needs to be expressed. In these cases, special characters called meta-characters are intermixed with characters to be taken literally. For a simple example, the regular expression ^alone$ expresses the pattern of the word ``alone'' by itself on a textual line. The meta-character ``^'' implies ``first column of a text line'' and the meta-character ``$'' implies ``the end of a text line'' (just before the linefeed).

This barely scratches the surface of regular-expression rules. They are important to utilities such as awk and sed, because you can request that certain processing rules be applied to their input data based on the pattern defined by a regular expression. Using the above example, you can easily write an awk script that counts the number of times a line of input contains only the word ``alone''.

One of the better references available is the O'Reilly publication Mastering Regular Expressions by Jeffrey E. F. Friedl.