The notes below are from my initial readings into Awk, and they demonstrate using it (and some other unix tools) to perform simple text processing. Full disclaimer, I'm still learning this stuff 🙂
Example 1: Tabulating Data
Below are some fruits and their prices.
$ cat fruits.txt
Apple £2.00
Banana £1.50
Kumquat £2.00
Peach £1.50
Strawberry £2.00
Raspberry £2.00
Kiwi £1.00
Pear £1.00
Tomato £1.50
It'd be really useful if I could take that list and print it in a tabulated way:
$ awk '{print $2 "\t" $1}' fruits.txt
£2.00 Apple
£1.50 Banana
£2.00 Kumquat
£1.50 Peach
£2.00 Strawberry
£2.00 Raspberry
£1.00 Kiwi
£1.00 Pear
£1.50 Tomato
Might be nice to sort them by price too:
$ awk '{print $2 "\t" $1}' fruits | sort
£1.00 Kiwi
£1.00 Pear
£1.50 Banana
£1.50 Peach
£1.50 Tomato
£2.00 Apple
£2.00 Kumquat
£2.00 Raspberry
£2.00 Strawberry
That looks great, but actually, seeing the duplicated price is making all this data look too noisy. If the price is the same as the previous fruit, let's just not print it:
$ awk '{print $2 "\t" $1 }' fruits.txt |
sort |
awk '{
price = $1
name = $2
if (price == previous_price) {
print "\t" name
} else {
previous_price = price
print price "\t" name
}
}'
You'll notice that's a bit of a mouthful to write all in one go. At this point we could move that to an executable file called fruit_formatting
if we want:
$ ls -la fruit_formatting
-rwxr--r-- 1 andy staff 190 7 Jul 22:29 fruit_formatting
$ cat fruit_formatting
awk '{print $2 "\t" $1 }' fruits.txt |
sort |
awk '{
price = $1
name = $2
if (price == previous_price) {
print "\t" name
} else {
previous_price = price
print price "\t" name
}
}'
$ ./fruit_formatting
£1.00 Kiwi
Pear
£1.50 Banana
Peach
Tomato
£2.00 Apple
Kumquat
Raspberry
Strawberry
Nice, we did it!
Example 2: Filtering
OK, here's a new problem. I'd like to find the 5 slowest response times on my site with a 200 status code.
Here's what my log file looks like:
$ tail -20 logfile
Started GET "/balance_forecasts/97?date=18-12-2016" for ::1 at 2017-05-21 13:33:12 +0100
Processing by BalanceForecastsController#show as */*
Parameters: {"date"=>"18-12-2016", "id"=>"97"}
User Load (0.2ms) SELECT "users".* FROM "users" WHERE "users"."id" = ? ORDER BY "users"."id" ASC LIMIT 1 [["id", 97]]
Balance Load (0.2ms) SELECT "balances".* FROM "balances" WHERE "balances"."user_id" = ? ORDER BY "balances"."on" DESC LIMIT 1 [["user_id", 97]]
Transfer Load (0.1ms) SELECT "transfers".* FROM "transfers" WHERE "transfers"."user_id" = ? [["user_id", 97]]
CACHE (0.0ms) SELECT "balances".* FROM "balances" WHERE "balances"."user_id" = ? ORDER BY "balances"."on" DESC LIMIT 1 [["user_id", 97]]
Rendered balance_forecasts/_blank_slate.html.erb (0.1ms)
Completed 200 OK in 22ms (Views: 14.2ms | ActiveRecord: 0.5ms)
Started GET "/balance_forecasts/97?date=26-6-2017" for ::1 at 2017-05-21 13:33:14 +0100
Processing by BalanceForecastsController#show as */*
Parameters: {"date"=>"26-6-2017", "id"=>"97"}
User Load (0.1ms) SELECT "users".* FROM "users" WHERE "users"."id" = ? ORDER BY "users"."id" ASC LIMIT 1 [["id", 97]]
Balance Load (0.3ms) SELECT "balances".* FROM "balances" WHERE "balances"."user_id" = ? ORDER BY "balances"."on" DESC LIMIT 1 [["user_id", 97]]
Transfer Load (0.1ms) SELECT "transfers".* FROM "transfers" WHERE "transfers"."user_id" = ? [["user_id", 97]]
CACHE (0.0ms) SELECT "balances".* FROM "balances" WHERE "balances"."user_id" = ? ORDER BY "balances"."on" DESC LIMIT 1 [["user_id", 97]]
Rendered balance_forecasts/_show.html.erb (1.2ms)
Completed 200 OK in 261ms (Views: 12.4ms | ActiveRecord: 0.6ms)
First, let's see if we can find all of the Completed 200 OK
rows:
$ tail -50 logfile | awk '/Completed 200 OK/'
Completed 200 OK in 490ms (Views: 12.4ms | ActiveRecord: 0.6ms)
Completed 200 OK in 388ms (Views: 11.7ms | ActiveRecord: 0.4ms)
Completed 200 OK in 32ms (Views: 15.3ms | ActiveRecord: 0.5ms)
Completed 200 OK in 22ms (Views: 14.2ms | ActiveRecord: 0.5ms)
Completed 200 OK in 261ms (Views: 12.4ms | ActiveRecord: 0.6ms)
Next up is extracting the time in ms from those rows. Since the times are always in column 5 we can use $5
to get the values.
$ tail -50 logfile | awk '/Completed 200 OK/{print $5}'
490ms
388ms
32ms
22ms
261ms
Now let's sort them in descending order to see the slowest values at the top
$ tail -50 logfile | awk '/Completed 200 OK/{print $5}' | sort -nr
490ms
388ms
261ms
32ms
22ms
I used sort -nr
there to sort by numerical value and reverse the order.
That's looking quite useful, but if I run that over a much larger log file I'd get a lot of text printed to my terminal. I'll run this script over 5000 rows and then use the head
command to only look at the top 5 slowest times.
$ tail -5000 logfile | awk '/Completed 200 OK/{print $5}' | sort -nr | head -5
8140ms
6257ms
5409ms
5382ms
5118ms
Example 3: More tabulation
Back to food again (naturally). I have a new price list:
$ cat foods.txt
Item name, Price
Rice pudding, £1.20
Jam sandwich, £1.75
Coffee, £1.00
Crisps, £1.00
Custard tart, £1.75
Red grapes, £1.20
Green grapes, £1.20
I'd like to format it in the same way as the fruits in example 1.
£1.00 Kiwi
Pear
£1.50 Banana
Peach
Tomato
£2.00 Apple
Kumquat
Raspberry
Strawberry
There's a slight problem here though. You'll notice that the top row of this file contains Item name
and Price
. We don't want this in our report, so we'll need a way to remove it. There's another issue here however, and it isn't immediately obvious. In our original fruit_formatting
script we said
awk '{print $2 "\t" $1}'
This printed the price followed by the name of the fruit. Unfortunately, this won't work here because some of our food names span two words, for example Jam sandwich
. This script would print sandwich
followed by a tab, followed by jam
. Let's run it to find out.
awk '{print $2 "\t" $1}' foods.txt
name, Item
pudding, Rice
sandwich, Jam
£1.00 Coffee,
£1.00 Crisps,
tart, Custard
grapes, Red
grapes, Green
First, let's look at ignoring the top line of the file:
$ awk 'NR != 1 {print $0}' foods.txt
Rice pudding, £1.20
Jam sandwich, £1.75
Coffee, £1.00
Crisps, £1.00
Custard tart, £1.75
Red grapes, £1.20
Green grapes, £1.20
NR
here tells us the number of the current record. In our script we are saying "if the current record is not the first record in the file, then print the full record".
Our next issue of printing the price and full food name correctly can be solved by choosing a new field separator. Instead of using the default character (a space) to delimit words, let's use a comma.
$ awk -F ", " 'NR != 1 {print $2 "\t" $1}' foods.txt
£1.20 Rice pudding
£1.75 Jam sandwich
£1.00 Coffee
£1.00 Crisps
£1.75 Custard tart
£1.20 Red grapes
£1.20 Green grapes
Putting all this together with our original fruit_formatting
script, we get the following:
$ cat food_formatting
awk -F ", " 'NR != 1 {print $2 "\t" $1}' foods.txt |
sort |
awk '{
price = $1
name = $2
if (price == previous_price) {
print "\t" name
} else {
previous_price = price
print price "\t" name
}
}'
$ ./food_formatting
£1.00 Coffee
Crisps
£1.20 Green
Red
Rice
£1.75 Custard
Jam
Almost there, but it still looks like we're cutting off the sandwich
part of Jam sandwich
. This is because our awk script
awk '{
price = $1
name = $2
if (price == previous_price) {
print "\t" name
} else {
previous_price = price
print price "\t" name
}
}'
is given this input:
awk -F ", " 'NR != 1 {print $2 "\t" $1}' foods.txt | sort
£1.00 Coffee
£1.00 Crisps
£1.20 Green grapes
£1.20 Red grapes
£1.20 Rice pudding
£1.75 Custard tart
£1.75 Jam sandwich
and the print "\t" name
part is going to only look at the second column of text (Jam
), but not the remaining line since we set name
to $2
.
Let's set $1
(the price) to the empty string and then use $0
, which gives us the full record, to print the full food name:
$ cat food_formatting
awk -F ", " 'NR != 1 {print $2 "\t" $1}' foods.txt |
sort |
awk '{
price = $1
name = $2
if (price == current_price) {
$1 = ""
print "\t" $0
} else {
current_price = price
$1 = ""
print price "\t" $0
}
}'
$ ./food_formatting
£1.00 Coffee
Crisps
£1.20 Green grapes
Red grapes
Rice pudding
£1.75 Custard tart
Jam sandwich
There you have it. We just used some simple parts of Awk to do some nifty text processing. If you'd like to know more things, like how to write detailed pattern matching, functions, loops, arrays, etc. then check out the sed & awk book. This has the added bonus of being written by someone that actually knows this stuff 😅
If you've got any awk scripts you use regularly that make your life easier I'd love to hear about them.
👋