Command-Line-Fu

An introduction to the UNIX command line for munging data.

(Source on github, made with remark)

"UNIX philosophy"

small, sharp tools
that can be chained together
to solve complex problems
goes back to the introduction of the pipe and grep in 1973

where simple means: providing well thought out functionality, implemented to perfection (see grep algorigthm and implementation)

wikipedia

Command Line Keys

use the up-arrow (and down-arrow) to go back to previouse commands
use the tabulator key for file name completion: type just the first letter of a filename, then try tab
use CTRL-A to move your cursor to the beginning of the line
use CTRL-E to move your cursor to the end of the line
use CTRL-K to delete from the cursor to the end of the line

Command Line History

!! repeats the last command
!$ repeat the last word of the last command

$ mkdir this_long_directory_name
$ cd !$

What kind of file is that?

first, let's look at the size:

$ ls -l parole 
-rw-r----- 1 bjelline admin 2286089 Feb 25 12:13 parole
$ ls -lh parole 
-rw-r----- 1 bjelline admin 2.2M Feb 25 12:13 parole

What kind of file is that?

you don't know what might be in the file? guess the file type:

$ file parole 
parole: ASCII text

if it's text, we can take a peek with less.

$ less parole

more or less

use key commands in less (in many contexts)

q to quit
SPACE to page one down
CTRL-B to go back one page
CTRL-F to go back one page
G to jump to the end
CTRL-G to find out where you a (line number + percent)
50% jump to middle of file (works for any number 0-100)
/cookie to search for cookie
- n to jump to the next occurance
- b to jump back to the previous occurance

How many lines are there?

$ wc -l parole
24291 parole

rename and move stuff around

now we know that 'parole' is a csv file let's rename it:

$ mv parole parole.csv

grep

find certain lines, get rid of certain files

grep will look for a string in each line, and print the line if the string is contained.

$ grep TRAMA parole.csv

grep

If there is a space in your search string you must quote it:

$ grep 'JAMES, MARK' parole.csv

grep is case sensitive, use -i to make it case insensitive:

$ grep -i trama parole.csv

grep inverted

the option v turns grep around: all lines are printed, except those matching the pattern. this way you can get rid of certain files

$ grep -v ....

grep enhanced

to build more complex search, with search patterns, use egrep

$ egrep 'TRAMA|TENNEY' parole.csv

sorting

unsorted stuff

$ cat names.txt
DUQUIN, JON
MCALPINE, ERIC M
MCALPINE, ERIC M
MCALPINE, ERIC M
BROWN, STEPHANIE
BROWN, WILLIAM
BOATWRIGHT, MAURICE
PAYNE, RAYMOND
DUQUIN, JON
BROWN, WILLIAM
MCKINNEY, ROSS B
MORALES, VICTOR
DUQUIN, JON
MCALPINE, ERIC M
BYNUM, STEVEN

sorting

$ sort names.txt
BOATWRIGHT, MAURICE
BROWN, STEPHANIE
BROWN, WILLIAM
BROWN, WILLIAM
BYNUM, STEVEN
DUQUIN, JON
DUQUIN, JON
DUQUIN, JON
MCALPINE, ERIC M
MCALPINE, ERIC M
MCALPINE, ERIC M
MCALPINE, ERIC M
MCKINNEY, ROSS B
MORALES, VICTOR
PAYNE, RAYMOND

sort and make uniq

$ sort names.txt | uniq 
BOATWRIGHT, MAURICE
BROWN, STEPHANIE
BROWN, WILLIAM
BYNUM, STEVEN
DUQUIN, JON
MCALPINE, ERIC M
MCKINNEY, ROSS B
MORALES, VICTOR
PAYNE, RAYMOND

sort and count

$ sort names.txt | uniq -c
      1 BOATWRIGHT, MAURICE
      1 BROWN, STEPHANIE
      2 BROWN, WILLIAM
      1 BYNUM, STEVEN
      3 DUQUIN, JON
      4 MCALPINE, ERIC M
      1 MCKINNEY, ROSS B
      1 MORALES, VICTOR
      1 PAYNE, RAYMOND

sort and count, sort by number

$ sort names.txt | uniq -c | sort -n
      1 BOATWRIGHT, MAURICE
      1 BROWN, STEPHANIE
      1 BYNUM, STEVEN
      1 MCKINNEY, ROSS B
      1 MORALES, VICTOR
      1 PAYNE, RAYMOND
      2 BROWN, WILLIAM
      3 DUQUIN, JON
      4 MCALPINE, ERIC M

sort and count, sort by number, reversed

$ sort names.txt | uniq -c | sort -rn
      4 MCALPINE, ERIC M
      3 DUQUIN, JON
      2 BROWN, WILLIAM
      1 PAYNE, RAYMOND
      1 MORALES, VICTOR
      1 MCKINNEY, ROSS B
      1 BYNUM, STEVEN
      1 BROWN, STEPHANIE
      1 BOATWRIGHT, MAURICE

top 3 by number of occurences

$ sort names.txt | uniq -c | sort -rn | head -3
      4 MCALPINE, ERIC M
      3 DUQUIN, JON
      2 BROWN, WILLIAM

redirection

Here we use the greater than sign for output-redirection to a file.

This will overwrite the file if it already exists, so be careful when to use it!

$ grep DENIED parole.csv > parole-denied.csv

redirection

if you want to add to an existing file use double greater than signs:

$ grep GRANTED parole.csv  > parole-granted.csv
$ grep PAROLED parole.csv >> parole-granted.csv

redirection

Using redirection you can get any program to write to a file.

How does this work? The UNIX command line tools send their output to a "channel" called "standard output" (or STDOUT, stdout in different programming languages). the redirection is the general way of sending this channel to a file.

See also man bash, search for REDIRECTION

pipeline

How many people were denied parole? we just need to count the lines:

$ grep DENIED parole.csv > parole-denied.csv
$ wc -l parole-denied.csv

pipeline

But there is no need to create the file at all: we can creat a pipeline with the vertical bar symbol. The pipeline connects the output channel of grep to the input channel of wc:

$ grep DENIED parole.csv | wc -l

The vertical bar is often called "pipe", so you would read this command as: grep denied parole.csv pipe wordcount minus L

Using the pipe you can create complex programs without writing any permanent code.

pipeline

When building pipes and working with large data sets, always put "| less" at the end to take a peek.

$ sort names.txt | less
$ sort names.txt | uniq -c | less
$ sort names.txt | uniq -c | sort -n | less
$ sort names.txt | uniq -c | sort -rn | less
$ sort names.txt | uniq -c | sort -rn | head -30 | less

cutting columns from files

cut, paste

handling csv files

There are the commands 'cut' and 'paste' that you might want to use to handle comma-separated-values files. But beware: as soon as you have commas inside one of the columns cut will not work for you!

Here csvkit comes in: csvkit knows about strings and escaping in csv and will cut out the correct columns:

$ csvcut -c1,2 parole.csv

Now we can use the power of pipelines to learn more about a csv file

$ csvcut -c1,2 parole.csv | sort .....