Shell Scripting & CLI Tools: From the Command Line to Real Automation
Section 9 of 18

How to Use grep, sed, and awk for Text Processing

This is where shell scripting goes from useful to extraordinary. Unix was built around text — configuration files are text, log files are text, data files are text. The three tools we're going to cover — grep, sed, and awk — form a triumvirate of text-processing power that programmers and sysadmins have trusted for decades. Learn these three deeply, and you'll be able to transform almost any text into almost any other text.

The Unix Text-Processing Philosophy

The philosophy is beautifully simple: every tool reads text line by line, from stdin or from files, and writes text to stdout. That means you can plug any of them into any pipeline. They don't need to know about each other — they just need to agree that the medium is text.

GNU Grep finds lines matching a pattern. GNU Sed edits text streams by applying transformation rules. GNU Awk processes structured text with what is, honestly, a complete programming language. Together, they handle almost any text transformation task you can imagine.

grep: Finding What You're Looking For

grep (Global Regular Expression Print) searches for lines matching a pattern. It's probably the most-used command in the Unix world, and for good reason — searching for things is something you do constantly.

# Basic search
grep "error" logfile.txt

# Case-insensitive
grep -i "error" logfile.txt

# Show line numbers
grep -n "error" logfile.txt

# Invert match (lines that DON'T match)
grep -v "debug" logfile.txt

# Extended regex (enables +, ?, |, grouping without escaping)
grep -E "error|warning|critical" logfile.txt

# Count matching lines
grep -c "error" logfile.txt

# Show filename only
grep -l "error" *.log

# Recursive search in directories
grep -r "TODO" /path/to/project/

# Show surrounding context
grep -B 2 -A 3 "error" logfile.txt   # 2 lines before, 3 after

# Match whole words only
grep -w "fail" logfile.txt   # Won't match "failure"

# Perl-compatible regex (very powerful)
grep -P "^\d{4}-\d{2}-\d{2}" logfile.txt

# Print only the matching portion (not the whole line)
grep -E -o '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' access.log   # Extract IP addresses

Regular expressions basics: grep uses regex patterns, which are miniature languages for describing text patterns:

  • . — any single character
  • * — zero or more of the preceding element
  • + — one or more (with -E)
  • ? — zero or one (with -E)
  • ^ — start of line
  • $ — end of line
  • [abc] — any one of a, b, or c
  • [a-z] — any lowercase letter
  • [^abc] — any character except a, b, c
  • \d — any digit (with -P)
  • \w — any word character (letter, digit, underscore) (with -P)

Real-world example — find all failed SSH login attempts in auth.log:

grep "Failed password" /var/log/auth.log | awk '{print $11}' | sort | uniq -c | sort -rn | head -20

That one-liner tells you exactly which IP addresses are hammering your server with bad passwords, ranked by frequency. Security work doesn't always require fancy tools.

sed: Stream Editing for Transformations

sed (Stream EDitor) reads text line by line and applies editing commands to it. Its most common use — by far — is the substitution command:

sed 's/old/new/' file.txt           # Replace first occurrence per line
sed 's/old/new/g' file.txt          # Replace all occurrences per line
sed 's/old/new/gi' file.txt         # Case-insensitive, all occurrences
sed -i 's/old/new/g' file.txt       # Edit file in-place (careful with this!)
sed -i.bak 's/old/new/g' file.txt   # Edit in-place, keep backup as file.txt.bak

The s/pattern/replacement/flags command is the workhorse. The pattern is a regex, and the replacement can reference captured groups with \1, \2, and so on:

# Reformat dates from YYYY-MM-DD to DD/MM/YYYY
echo "Date: 2024-01-15" | sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\3\/\2\/\1/'
# Output: Date: 15/01/2024

But sed has a lot more commands beyond substitution:

# Delete matching lines
sed '/^#/d' config.txt          # Delete comment lines

# Print specific lines
sed -n '10,20p' file.txt        # Print lines 10 through 20
sed -n '/START/,/END/p' file.txt # Print lines between START and END

# Append or insert
sed '5a\Added after line 5' file.txt
sed '5i\Added before line 5' file.txt

# Multiple commands
sed -e 's/foo/bar/' -e 's/baz/qux/' file.txt

A practical sed use case — remove all blank lines from a file:

sed '/^[[:space:]]*$/d' file.txt

awk: The Field-Processing Powerhouse

awk is, by itself, a complete programming language — designed specifically for processing structured text like CSV files, log files, and tabular data. [Its name comes from its creators: Alfred Aho, Peter Weinberger, and Brian Kernighan (yes, the Kernighan who co-wrote The C Programming Language).](https://en.wikipedia.org/wiki/AWK) When the people who helped invent Unix needed a text-processing tool, this is what they built.

The basic model: awk processes each line of input, splits it into fields (by whitespace by default), and runs a program against each line. The pattern is:

awk 'pattern { action }' file

Fields are accessed as $1, $2, etc. $0 is the entire line. NF is the number of fields. NR is the current line number.

# Print the third field of each line
awk '{print $3}' file.txt

# Print lines where the second field is greater than 100
awk '$2 > 100' file.txt

# Print specific fields with a custom separator
awk '{print $1, $3}' file.txt     # space-separated
awk '{print $1 ":" $3}' file.txt  # colon-separated

# Use a custom field separator (for CSV or colon-separated files)
awk -F: '{print $1, $6}' /etc/passwd    # username and home dir

# Sum a column
awk '{sum += $2} END {print "Total:", sum}' data.txt

# Count occurrences
awk '{count[$1]++} END {for (item in count) print count[item], item}' data.txt

# Print lines matching a pattern
awk '/error/ {print NR, $0}' logfile.txt  # Line number + line for errors

# BEGIN and END blocks run before/after processing
awk 'BEGIN {print "Start"} {print $1} END {print "End"}' file.txt

A real-world example — parse an Apache access log and find the top 10 IP addresses:

awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10

The combination of awk for field extraction, sort for ordering, and uniq -c for counting is so common in Unix work that it's practically a formula you memorize.

Combining the Trinity

The real power comes from combining these tools in pipelines. Here's an excellent video that demonstrates grep, sed, and awk working together in practice:

And here are some hands-on examples of the tools working together:

# Find all unique error codes in logs, with counts
grep "ERROR" app.log \
    | sed 's/.*ERROR \[code=\([0-9]*\)\].*/\1/' \
    | sort \
    | uniq -c \
    | sort -rn

# Extract all email addresses from a file
grep -E -o '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' contacts.txt | sort -u

# Process a CSV: get rows where column 3 > 1000, print columns 1 and 3
awk -F',' '$3 > 1000 {print $1, $3}' data.csv

# Find the 5 most common words in a file
tr -s '[:space:]' '\n' < essay.txt \
    | tr '[:upper:]' '[:lower:]' \
    | grep -v '^$' \
    | sort \
    | uniq -c \
    | sort -rn \
    | head -5