Awk Notes

19 Jul 2019

The notes below are from my initial readings into Awk, and they demonstrate using it (and some other unix tools) to perform simple text processing. Full disclaimer, I’m still learning this stuff 🙂

Example 1: Tabulating Data

Below are some fruits and their prices.

$ cat fruits.txt
Apple £2.00
Banana £1.50
Kumquat £2.00
Peach £1.50
Strawberry £2.00
Raspberry £2.00
Kiwi £1.00
Pear £1.00
Tomato £1.50

It’d be really useful if I could take that list and print it in a tabulated way:

$ awk '{print $2 "\t" $1}' fruits.txt
£2.00   Apple
£1.50   Banana
£2.00   Kumquat
£1.50   Peach
£2.00   Strawberry
£2.00   Raspberry
£1.00   Kiwi
£1.00   Pear
£1.50   Tomato

Might be nice to sort them by price too:

$ awk '{print $2 "\t" $1}' fruits | sort
£1.00   Kiwi
£1.00   Pear
£1.50   Banana
£1.50   Peach
£1.50   Tomato
£2.00   Apple
£2.00   Kumquat
£2.00   Raspberry
£2.00   Strawberry

That looks great, but actually, seeing the duplicated price is making all this data look too noisy. If the price is the same as the previous fruit, let’s just not print it:

$ awk '{print $2 "\t" $1 }' fruits.txt |
sort |
awk '{
  price = $1
  name = $2
  if (price == previous_price) {
    print "\t" name
  } else {
    previous_price = price
    print price "\t" name
  }
}'

You’ll notice that’s a bit of a mouthful to write all in one go. At this point we could move that to an executable file called fruit_formatting if we want:

$ ls -la fruit_formatting
-rwxr--r--  1 andy  staff  190  7 Jul 22:29 fruit_formatting
$ cat fruit_formatting
awk '{print $2 "\t" $1 }' fruits.txt |
sort |
awk '{
  price = $1
  name = $2
  if (price == previous_price) {
    print "\t" name
  } else {
    previous_price = price
    print price "\t" name
  }
}'

$ ./fruit_formatting

£1.00   Kiwi
        Pear
£1.50   Banana
        Peach
        Tomato
£2.00   Apple
        Kumquat
        Raspberry
        Strawberry

Nice, we did it!

Example 2: Filtering

OK, here’s a new problem. I’d like to find the 5 slowest response times on my site with a 200 status code.

Here’s what my log file looks like:

$ tail -20 logfile
Started GET "/balance_forecasts/97?date=18-12-2016" for ::1 at 2017-05-21 13:33:12 +0100
Processing by BalanceForecastsController#show as */*
  Parameters: {"date"=>"18-12-2016", "id"=>"97"}
  User Load (0.2ms)  SELECT  "users".* FROM "users" WHERE "users"."id" = ?  ORDER BY "users"."id" ASC LIMIT 1  [["id", 97]]
  Balance Load (0.2ms)  SELECT  "balances".* FROM "balances" WHERE "balances"."user_id" = ?  ORDER BY "balances"."on" DESC LIMIT 1  [["user_id", 97]]
  Transfer Load (0.1ms)  SELECT "transfers".* FROM "transfers" WHERE "transfers"."user_id" = ?  [["user_id", 97]]
  CACHE (0.0ms)  SELECT  "balances".* FROM "balances" WHERE "balances"."user_id" = ?  ORDER BY "balances"."on" DESC LIMIT 1  [["user_id", 97]]
  Rendered balance_forecasts/_blank_slate.html.erb (0.1ms)
Completed 200 OK in 22ms (Views: 14.2ms | ActiveRecord: 0.5ms)


Started GET "/balance_forecasts/97?date=26-6-2017" for ::1 at 2017-05-21 13:33:14 +0100
Processing by BalanceForecastsController#show as */*
  Parameters: {"date"=>"26-6-2017", "id"=>"97"}
  User Load (0.1ms)  SELECT  "users".* FROM "users" WHERE "users"."id" = ?  ORDER BY "users"."id" ASC LIMIT 1  [["id", 97]]
  Balance Load (0.3ms)  SELECT  "balances".* FROM "balances" WHERE "balances"."user_id" = ?  ORDER BY "balances"."on" DESC LIMIT 1  [["user_id", 97]]
  Transfer Load (0.1ms)  SELECT "transfers".* FROM "transfers" WHERE "transfers"."user_id" = ?  [["user_id", 97]]
  CACHE (0.0ms)  SELECT  "balances".* FROM "balances" WHERE "balances"."user_id" = ?  ORDER BY "balances"."on" DESC LIMIT 1  [["user_id", 97]]
  Rendered balance_forecasts/_show.html.erb (1.2ms)
Completed 200 OK in 261ms (Views: 12.4ms | ActiveRecord: 0.6ms)

First, let’s see if we can find all of the Completed 200 OK rows:

$ tail -50 logfile | awk '/Completed 200 OK/'
Completed 200 OK in 490ms (Views: 12.4ms | ActiveRecord: 0.6ms)
Completed 200 OK in 388ms (Views: 11.7ms | ActiveRecord: 0.4ms)
Completed 200 OK in 32ms (Views: 15.3ms | ActiveRecord: 0.5ms)
Completed 200 OK in 22ms (Views: 14.2ms | ActiveRecord: 0.5ms)
Completed 200 OK in 261ms (Views: 12.4ms | ActiveRecord: 0.6ms)

Next up is extracting the time in ms from those rows. Since the times are always in column 5 we can use $5 to get the values.

$ tail -50 logfile | awk '/Completed 200 OK/{print $5}'
490ms
388ms
32ms
22ms
261ms

Now let’s sort them in descending order to see the slowest values at the top

$ tail -50 logfile | awk '/Completed 200 OK/{print $5}' | sort -nr
490ms
388ms
261ms
32ms
22ms

I used sort -nr there to sort by numerical value and reverse the order.

That’s looking quite useful, but if I run that over a much larger log file I’d get a lot of text printed to my terminal. I’ll run this script over 5000 rows and then use the head command to only look at the top 5 slowest times.

$ tail -5000 logfile | awk '/Completed 200 OK/{print $5}' | sort -nr | head -5
8140ms
6257ms
5409ms
5382ms
5118ms

Example 3: More tabulation

Back to food again (naturally). I have a new price list:

$ cat foods.txt
Item name, Price
Rice pudding, £1.20
Jam sandwich, £1.75
Coffee, £1.00
Crisps, £1.00
Custard tart, £1.75
Red grapes, £1.20
Green grapes, £1.20

I’d like to format it in the same way as the fruits in example 1.

£1.00   Kiwi
        Pear
£1.50   Banana
        Peach
        Tomato
£2.00   Apple
        Kumquat
        Raspberry
        Strawberry

There’s a slight problem here though. You’ll notice that the top row of this file contains Item name and Price. We don’t want this in our report, so we’ll need a way to remove it. There’s another issue here however, and it isn’t immediately obvious. In our original fruit_formatting script we said

awk '{print $2 "\t" $1}'

This printed the price followed by the name of the fruit. Unfortunately, this won’t work here because some of our food names span two words, for example Jam sandwich. This script would print sandwich followed by a tab, followed by jam. Let’s run it to find out.

awk '{print $2 "\t" $1}' foods.txt
name,   Item
pudding,        Rice
sandwich,       Jam
£1.00   Coffee,
£1.00   Crisps,
tart,   Custard
grapes, Red
grapes, Green

First, let’s look at ignoring the top line of the file:

$ awk 'NR != 1 {print $0}' foods.txt
Rice pudding, £1.20
Jam sandwich, £1.75
Coffee, £1.00
Crisps, £1.00
Custard tart, £1.75
Red grapes, £1.20
Green grapes, £1.20

NR here tells us the number of the current record. In our script we are saying “if the current record is not the first record in the file, then print the full record”.

Our next issue of printing the price and full food name correctly can be solved by choosing a new field separator. Instead of using the default character (a space) to delimit words, let’s use a comma.

$ awk -F ", " 'NR != 1 {print $2 "\t" $1}' foods.txt
£1.20   Rice pudding
£1.75   Jam sandwich
£1.00   Coffee
£1.00   Crisps
£1.75   Custard tart
£1.20   Red grapes
£1.20   Green grapes

Putting all this together with our original fruit_formatting script, we get the following:

$ cat food_formatting
awk -F ", " 'NR != 1 {print $2 "\t" $1}' foods.txt |
sort |
awk '{
  price = $1
  name = $2
  if (price == previous_price) {
    print "\t" name
  } else {
    previous_price = price
    print price "\t" name
  }
}'

$ ./food_formatting
£1.00   Coffee
        Crisps
£1.20   Green
        Red
        Rice
£1.75   Custard
        Jam

Almost there, but it still looks like we’re cutting off the sandwich part of Jam sandwich. This is because our awk script

awk '{
  price = $1
  name = $2
  if (price == previous_price) {
    print "\t" name
  } else {
    previous_price = price
    print price "\t" name
  }
}'

is given this input:

awk -F ", " 'NR != 1 {print $2 "\t" $1}' foods.txt | sort
£1.00 Coffee
£1.00 Crisps
£1.20 Green grapes
£1.20 Red grapes
£1.20 Rice pudding
£1.75 Custard tart
£1.75 Jam sandwich

and the print "\t" name part is going to only look at the second column of text (Jam), but not the remaining line since we set name to $2.

Let’s set $1 (the price) to the empty string and then use $0, which gives us the full record, to print the full food name:

$ cat food_formatting
awk -F ", " 'NR != 1 {print $2 "\t" $1}' foods.txt |
sort |
awk '{
  price = $1
  name = $2
  if (price == current_price) {
    $1 = ""
    print "\t" $0
  } else {
    current_price = price
    $1 = ""
    print price "\t" $0
  }
}'

$ ./food_formatting
£1.00    Coffee
         Crisps
£1.20    Green grapes
         Red grapes
         Rice pudding
£1.75    Custard tart
         Jam sandwich

There you have it. We just used some simple parts of Awk to do some nifty text processing. If you’d like to know more things, like how to write detailed pattern matching, functions, loops, arrays, etc. then check out the sed & awk book. This has the added bonus of being written by someone that actually knows this stuff 😅

If you’ve got any awk scripts you use regularly that make your life easier I’d love to hear about them.

👋