Beauty of the Pipe
The Unix Pipeline is a powerful and beautiful piece of software that is sometimes difficult to grasp for a command line beginner. We are used to use graphical interface apps that mostly only interact with each other by writing and reading files if at all. The concept of standard out (stdout) and standard in (stdin) are something that takes some time to learn and understand when one is learning programming and/or data tools in command line.
The examples expect some level of understanding on the basics of terminal, such as parameters and flags. Some of the commands can take file as a parameter and as such, don’t require cat but for the sake of education, I will not go through that route.
Let’s start with basics. cat is a command that outputs the contents of the file into standard out. So running
$ cat file.txt
will print the entire contents. If we want to get 10 first lines, we pipe the results of cat into head command. The pipe, |, will feed the standard output of one command as standard in for another.
$ cat file.txt | head -n 10
Print out top 10 words by word count in file file.txt with the frequencies is one exercise that is usually thrown at students in basic courses. Let’s build a pipe for this exercise one step at a time.
- Let’s start with the simplest, outputting the file to stdout
$ cat file.txt
- Next, we need to tokenize the output to contain one word per line. For that purpose, tr is useful command. It will replace all spaces with line breaks.
$ cat file.txt | tr ' ' '\n'
- Next, we need to do little cleaning up. We want to remove all punctuation, commas, periods, exclamation points etc. Let’s introduce sed, the stream editor. We will use simple regex to help us out here.
$ cat file.txt | tr ' ' '\n' | sed 's/[.-\!?,]//g'
- Next, we want to lowercase everything, so that ‘Lorem’ and ‘lorem’ count for the same word.
$ cat file.txt | tr ' ' '\n' | sed 's/[.-\!?,]//g' | tr '[:upper:]' '[:lower:]'
- There is a command uniq which combined with -c flag gives us the count of words. However, for that to work, we need to sort the lines first since it only works on sequential lines.
$ cat file.txt | tr ' ' '\n' | sed 's/[.-\!?,]//g' | tr '[:upper:]' '[:lower:]' | sort
- Then, let’s apply uniq -c
$ cat file.txt | tr ' ' '\n' | sed 's/[.-\!?,]//g' | tr '[:upper:]' '[:lower:]' | sort | uniq -c
- To get the ten most used words, we need to sort again, this time in reverse order
$ cat file.txt | tr ' ' '\n' | sed 's/[.-\!?,]//g' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -r
- And finally, use head to get first ten lines
$ cat file.txt | tr ' ' '\n' | sed 's/[.-\!?,]//g' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | head -n 10
Just looking at the final command pipe can seem really intimidating but it is important to remember, that the beauty of the pipe is the fact that you can build it one step at the time and always see what is going on between every step.
With the great variety of command line tools for data manipulation, it is possible to do complex things like scraping data from HTML table in Wikipedia into a JSON file that only contains wanted columns:
curl -s 'http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio' | scrape -be 'table.wikitable > tr:not(:first-child)' | xml2json | jq -c '.html.body.tr[] | {country: .td[1][], border: .td[2][], surface: .td[3][], ratio: .td[4][]}' | head
Above example is from the great blog post 7 command-line tools for data science by Jeroen Janssens.