Appendix B. A Sed and Awk Micro-Primer

This is a very brief introduction to the sed and awk text processing utilities. We will deal with only a few basic commands here, but that will suffice for understanding simple sed and awk constructs within shell scripts.

sed: a non-interactive text file editor

awk: a field-oriented pattern processing language with a C-like syntax

For all their differences, the two utilities share a similar invocation syntax, both use regular expressions , both read input by default from stdin, and both output to stdout. These are well-behaved UNIX tools, and they work together well. The output from one can be piped into the other, and their combined capabilities give shell scripts some of the power of Perl.

Note

One important difference between the utilities is that while shell scripts can easily pass arguments to sed, it is more complicated for awk (see Example 34-3 and Example 9-22).

B.1. Sed

Sed is a non-interactive line editor. It receives text input, whether from stdin or from a file, performs certain operations on specified lines of the input, one line at a time, then outputs the result to stdout or to a file. Within a shell script, sed is usually one of several tool components in a pipe.

Sed determines which lines of its input that it will operate on from the address range passed to it. [1] Specify this address range either by line number or by a pattern to match. For example, 3d signals sed to delete line 3 of the input, and /windows/d tells sed that you want every line of the input containing a match to "windows" deleted.

Of all the operations in the sed toolkit, we will focus primarily on the three most commonly used ones. These are printing (to stdout), deletion, and substitution.


Table B-1. Basic sed operators

OperatorNameEffect
[address-range]/pprintPrint [specified address range]
[address-range]/ddeleteDelete [specified address range]
s/pattern1/pattern2/substituteSubstitute pattern2 for first instance of pattern1 in a line
[address-range]/s/pattern1/pattern2/substituteSubstitute pattern2 for first instance of pattern1 in a line, over address-range
[address-range]/y/pattern1/pattern2/transformreplace any character in pattern1 with the corresponding character in pattern2, over address-range (equivalent of tr)
gglobalOperate on every pattern match within each matched line of input

Note

Unless the g (global) operator is appended to a substitute command, the substitution operates only on the first instance of a pattern match within each line.

From the command line and in a shell script, a sed operation may require quoting and certain options.

   1 sed -e '/^$/d' $filename
   2 # The -e option causes the next string to be interpreted as an editing instruction.
   3 #  (If passing only a single instruction to "sed", the "-e" is optional.)
   4 #  The "strong" quotes ('') protect the RE characters in the instruction
   5 #+ from reinterpretation as special characters by the body of the script.
   6 # (This reserves RE expansion of the instruction for sed.)
   7 #
   8 # Operates on the text contained in file $filename.

In certain cases, a sed editing command will not work with single quotes.

   1 filename=file1.txt
   2 pattern=BEGIN
   3 
   4   sed "/^$pattern/d" "$filename"  # Works as specified.
   5 # sed '/^$pattern/d' "$filename"    has unexpected results.
   6 #        In this instance, with strong quoting (' ... '),
   7 #+      "$pattern" will not expand to "BEGIN".

Note

Sed uses the -e option to specify that the following string is an instruction or set of instructions. If there is only a single instruction contained in the string, then this option may be omitted.

   1 sed -n '/xzy/p' $filename
   2 # The -n option tells sed to print only those lines matching the pattern.
   3 # Otherwise all input lines would print.
   4 # The -e option not necessary here since there is only a single editing instruction.


Table B-2. Examples of sed operators

NotationEffect
8dDelete 8th line of input.
/^$/dDelete all blank lines.
1,/^$/dDelete from beginning of input up to, and including first blank line.
/Jones/pPrint only lines containing "Jones" (with -n option).
s/Windows/Linux/Substitute "Linux" for first instance of "Windows" found in each input line.
s/BSOD/stability/gSubstitute "stability" for every instance of "BSOD" found in each input line.
s/ *$//Delete all spaces at the end of every line.
s/00*/0/gCompress all consecutive sequences of zeroes into a single zero.
/GUI/dDelete all lines containing "GUI".
s/GUI//gDelete all instances of "GUI", leaving the remainder of each line intact.

Substituting a zero-length string for another is equivalent to deleting that string within a line of input. This leaves the remainder of the line intact. Applying s/GUI// to the line
 The most important parts of any application are its GUI and sound effects
results in
 The most important parts of any application are its  and sound effects

The backslash represents a newline as a substitution character. In this special case, the replacement expression continues on the next line.
   1 s/^  */\
   2 /g
This substitution replaces line-beginning spaces with a newline. The net result is to replace paragraph indents with a blank line between paragraphs.

An address range followed by one or more operations may require open and closed curly brackets, with appropriate newlines.
   1 /[0-9A-Za-z]/,/^$/{
   2 /^$/d
   3 }
This deletes only the first of each set of consecutive blank lines. That might be useful for single-spacing a text file, but retaining the blank line(s) between paragraphs.

Tip

A quick way to double-space a text file is sed G filename.

For illustrative examples of sed within shell scripts, see:

  1. Example 34-1

  2. Example 34-2

  3. Example 12-2

  4. Example A-3

  5. Example 12-12

  6. Example 12-20

  7. Example A-13

  8. Example A-18

  9. Example 12-24

  10. Example 10-9

  11. Example 12-33

  12. Example A-2

  13. Example 12-10

  14. Example 12-8

  15. Example A-11

  16. Example 17-12

For a more extensive treatment of sed, check the appropriate references in the Bibliography.

Notes

[1]

If no address range is specified, the default is all lines.