Recall that in last week's lab, we saw how to use python and SQL to perform the count distinct and count group queries. We had two main takeaways:
- Python's pandas library requires
$\Omega(n)$ memory, and so it is not suitable for large datasets. - SQLite requires only
$O(1)$ memory, and so is suitable for large datasets.
In this lab, we will see how to run those same queries in the shell. We'll also see how to make nice visualizations of those queries using the terminal program gnuplot, and parallelize those queries using MapReduce.
Fork this repo, clone your fork onto the lambda server, and cd into the repo directory. It will be important later that you are working on your forked repo because you'll be uploading images to the repo for your submission.
The uniq
and sort
commands are included in every POSIX compliant operating system.
They are commonly used together to perform the count distinct and count group operations.
The uniq
command (pronounced like "unique") filters its input to remove any line that is the same as its previous line.
We'll use the colors
file included in the repo as an example.
Compare the output of the following two commands.
$ cat colors
$ cat colors | uniq
Notice that every line in the output of uniq
is not necessarily "unique",
but no two neighboring lines are the same.
One common strategy for outputting only the distinct lines is to first sort the input.
$ cat colors | sort | uniq
Now notice that there are no duplicate colors in the output.
We can complete our count distinct query by combining with wc -l
:
$ cat colors | sort | uniq | wc -l
Note: Unfortunately, the
sort
command above is rather expensive. Sorting requires$\Omega(n)$ memory and$\Omega(n\log n)$ compute, and thesort
command is the only commonly used shell command that does not useO(1)
memory. Still, thesort
command is about as efficient as theoretically possible. It uses an external sort algorithm, which is something that you probably did not study in data structures. External sort is very similar to merge sort, except that the intermediate steps are stored on the hard drive. This allowssort
to sort extremely large files even on systems with limited RAM.
The count group query is almost as easy to do.
The uniq
program with the -c
flag counts the total number of duplicated rows,
and so adding this flag performs the count group query.
$ cat colors | sort | uniq -c
Notice, however, that the output above is not sorted by the number of occurrences. But sorted output would be easier to read, so let's figure out how to get it.
You might think that another call to sort
would do the trick.
Try it:
$ cat colors | sort | uniq -c | sort
Technically, this output is now sorted ASCIIbeticallly.
(The 1
character comes before
in ASCII, and so 11
comes before 1
.)
But this isn't what we really wanted.
We want to sort on the numerical values of the numbers in our table,
and not their ASCII values.
sort
has a flag -n
for this purpose.
Adding it to our command, we can now perform a nice count group query on colors
with the command
$ cat colors | sort | uniq -c | sort -n
Exercise: Use output redirection to store the output of the command above in a file called
colors.dat
. We will use this file to experiment with plotting in the next section.
It is common to plot the results of a count group query as a bar chart. In this section, we'll see how to do that with gnuplot. Gnuplot is a popular program for generating charts on the terminal. It is the recommended tool for generating plots for wikipedia, and most plots on wikipedia were generated with gnuplot.
Note: The GNU project is an open source rewrite of the Unix operating system, and all of the terminal commands we have used so far are part of GNU. Gnuplot, however, is not affiliated with the GNU project. Therefore, even though GNU is pronounced with a hard G, gnuplot is canonically pronounced as "newplot" using the standard English pronunciation of gnu. The authors have a detailed FAQ answer about the origin of the name.:
Start the gnuplot program.
$ gnuplot
You should get a large welcome message printed followed by a new prompt gnuplot>
indicating that you are now typing gnuplot commands instead of shell commands.
The simplest command is the plot
command,
which just takes as input a filename of data to plot.
Try it.
gnuplot> plot 'colors.dat'
You likely get an error similar to
qt.qpa.screen: QXcbConnection: Could not connect to display
Could not connect to any X display.
This error message references the X Window System, which is a popular system for displaying graphics on Linux machines. One of its main advantages is that it allows windows created by remote machines (like the lambda server) to be displayed on your local machine.
If your laptop is a linux machine, then you already have X Windows installed.
You can solve this problem by enabling "X forwarding" with the -XY
flags in your ssh command.
There exist open source X Windows implementations for every operations system. Xming is the most popular one for Windows, and XQuartz for Mac. If you installed one of these on your laptop, then you would also be able to open windows on the lambda server and have them appear on your machine. For this lab, however, you don't need to install this software if you don't want to. We'll see alternative ways to get access to the plots.
Note: Many people think that X Windows gets its name from an allusion to Microsoft Windows graphical interface. But the opposite is closer to the truth. The first version of X Windows was released in 1984, and the first version of Microsoft Windows was released on 1985.
The easiest way to view the plots is to plot them directly in the terminal with ASCII art.
This is a good time to mention that the word terminal has a different meaning in the context of gnuplot.
Recall that in most contexts, a terminal is the graphical program on your computer that use when interacting with the shell.
In gnuplot, however, a terminal refers to the graphical engine used to render the plot.
The default terminal is called qt
because it uses the QT library to render the plot to the X Windows system.
We can change the terminal to the dumb terminal to get the contents of our plot printed as ASCII art.
The following commands should work for everyone.
gnuplot> set terminal dumb
gnuplot> plot 'colors.dat'
12 +---------------------------------------------------------------------+
| + + |
| |
10 |-+ +-|
| |
| |
| |
8 |-+ +-|
| |
| |
6 |-+ +-|
| |
| |
4 |-+ A +-|
| |
| A |
| |
2 |-+ +-|
| |
| + + |
0 +---------------------------------------------------------------------+
yellow green red blue
This plot is hard to read, and the values for yellow
and blue
aren't even being displayed because they overlap with the axes.
We can make this plot a little bit nicer by adding some more formatting commands.
gnuplot> set style data histogram
gnuplot> set style fill solid border -1
gnuplot> plot 'colors.dat' using 1:xtic(2) notitle
12 +---------------------------------------------------------------------+
| + + + + |
| ****** |
10 |-+ * * +-|
| * * |
| * * |
| * * |
8 |-+ * * +-|
| * * |
| * * |
6 |-+ * * +-|
| * * |
| * * |
4 |-+ ****** * * +-|
| * * * * |
| ****** * * * * |
| * * * * * * |
2 |-+ * * * * * * +-|
| ****** * * * * * * |
| * * * * * * * * |
0 +---------------------------------------------------------------------+
yellow green red blue
The set style
commands change the formatting to use a bar plot instead of a line plot.
The using 1:xtic(2)
tells gnuplot that the first column in the datafile should be the height of the bars,
and the second column should be the label on the x-axis.
Finally, the notitle
command removes the legend.
This second plot is a little bit nicer,
but it's still not very good since the results are limited to ASCII art.
To generate a proper plot, we can use the png
terminal.
The following commands will replot the graph above and store it in the file colors.png
.
(There is no need to retype the set style
commands, as those will remain in effect.)
gnuplot> set terminal png size 800,400
gnuplot> set output 'colors.png'
gnuplot> plot 'colors.dat' using 1:xtic(2) notitle
The plot
command above should have no output.
Leave the gnuplot shell by typing ^D
and run ls
.
You should see the file colors.png
was created in your lab folder.
Note:
png
is officially pronounced like "ping", and socolors.png
is pronounced as "colors dot ping".
Unfortunately, there's no way to view png files inside the terminal.
To view the colors.png
file, you will need to transfer it to you computer.
On Linux machines, this is once again trivial.
The sshfs
program lets you mount the lambda server's filesystem onto your own filesystem.
For example, if I run the following command on my laptop:
$ sshfs -p 5055 [email protected]:/home/csci143example ~/lambda
Then the folder ~/lambda
on my laptop will contain all of the contents of my home folder /home/csci143example
on the lambda server.
I can then navigate to that folder using my standard file explorer tools to view the file.
Note: There exist sshfs implementations for Mac and Windows. You are not required to download and install them, but they may make working on the lambda server easier for you.
If you don't have sshfs
installed, then the next best option is to use github to transfer the file.
You can upload the file to github with the following commands.
$ git add colors.png
$ git commit -m 'added colors.png'
$ git push origin master
If these commands worked successfully, then the image below should work.
If not, ensure that you're looking at your forked repo and not my repo.
Don't move on to the next steps until you're successfully able to view your image. Part of the submission for this lab will require that all of these broken image links are replaced with working images.
Working with gnuplot through the terminal interface is possible, but it's annoying to have to retype all of those commands anytime we want to make a plot. A better solution is to automate plotting with a script. We will now see how to write and use these scripts.
Create a file boxplot.gp
with the following contents.
set terminal png size 800,400
set output 'colors.png'
set style data histogram
set style fill solid border -1
plot 'country_code.plot_data' using 1:xtic(2) notitle
Notice that these are just the commands that we previously typed directly into the gnuplot terminal.
We can now run all of these commands at once by passing the -c boxplot.gp
arguments to gnuplot.
First, delete the colors.png
file.
$ rm colors.png
$ ls
Then run the script and verify that it worked by seeing that it recreated the file.
$ gnuplot -c boxplot.gp
$ ls
Just like python functions are more useful when they take parameters that adjust how they work, scripts are also more useful when they take input. We will modify the script so that it can be used with the pipe. This will require two changes:
- Replace the hard coded output filename
'colors.png'
with the variableARG1
. Gnuplot will substitute the first command line argument of the script with this value. - Replace the hard coded input filename
'colors.dat'
with the special filename'/dev/stdin'
. This will allow us to get our data from the pipe.
After making these changes, the final boxplot.gp
script should look like:
set terminal png size 800,400
set output ARG1
set style data histogram
set style fill solid border -1
plot '/dev/stdin' using 1:xtic(2) notitle
To test this new script, we will recreate the original colors.png
file.
$ rm colors.png
$ ls
$ cat 'colors.dat' | gnuplot -c boxplot.gp colors.png
$ ls
As before, you should verify that the output of the first ls
and second ls
differ only by the newly created colors.png
file.
Recall that the file /data/Twitter dataset/geoTwitter20-01-01.zip
contains all of the geolocated tweets sent on January 1st 2020.
Let's do an analysis to see how many tweets were sent from each country on this day.
We can easily do this with the following shell 1-liner:
$ unzip -p /data/Twitter\ dataset/geoTwitter20-01-01.zip | jq '.place.country_code' | sort | uniq -c | sort -n | tail -n10 | gnuplot -c boxplot.gp top10.png
For long commands like this, it is common to break them up onto multiple lines.
In the shell (and most programming environments), any line that ends with a backslash \
is treated as continuing on to the next line.
Thus the following more readable shell command is 100% equivalent to the shell command above.
$ unzip -p /data/Twitter\ dataset/geoTwitter20-01-01.zip \
| jq '.place.country_code' \
| sort \
| uniq -c \
| sort -n \
| tail -n10 \
| gnuplot -c boxplot.gp top10.png
This command takes about 5 minutes to run.
Perhaps surprisingly, the bottleneck of this command is the jq
command which parses the JSON.
To verify this, press ^Z
while the program above is running,
then run the command ps
.
You should get output similar to
$ ps
PID TTY TIME CMD
10327 pts/3 00:00:00 bash
25755 pts/3 00:00:22 unzip
25756 pts/3 00:00:54 jq
25757 pts/3 00:00:00 sort
25758 pts/3 00:00:00 uniq
25759 pts/3 00:00:00 sort
25760 pts/3 00:00:00 tail
25761 pts/3 00:00:00 gnuplot
25825 pts/3 00:00:00 ps
Recall that ps
lists all of the processes that are currently running.
Notice that each of the commands in your shell 1-liner is actually running concurrently.
The time that is listed for each of these processes is the total amount of CPU time that process has used.
In the output above, unzip
has used 22 seconds, and jq
has used 54 seconds.
Fortunately, because the lambda server has so many CPUs available, each of these processes will be running on their own CPU and running in parallel.
The amount of time for the entire command to complete is therefore only the length of time for the slowest command,
and not the total length of time for all commands.
In Python, it is essentially impossible to have the unzip and json decoding happen in parallel due to the Global Interpreter Lock (GIL) (although there is some recent work to fix this peoblem).
In the shell, it is trivial to have these expensive tasks run in parallel.
When your command completes, upload the top10.png
file to github.
You should see it appear below.
Recall from the MapReduce homework that MapReduce is a parallel procedure for large scale data analysis. In MapReduce, the "mappers" analyze small parts of the dataset in parallel, and then the "reducers" combine those results into a final result. In the homework, you used (or will use) a combination of python and the shell to perform these tasks. In this lab, we'll see how to perform MapReduce entirely in the shell.
Our goal will be to generate a plot of how many tweets were sent from each country in the first 9 days of 2020.
The mapper is fairly simple. It's just a count group query like we did in the previous section, but without plotting the results. The following shell code runs these mappers in parallel.
$ for file in /data/Twitter\ dataset/geoTwitter20-01-0*.zip; do
unzip -p "$file" \
| jq '.place.country_code' \
| sort \
| uniq -c \
| sort -n \
> map.$(basename "$file").dat &
done
The reducer is more complicated. The code below performs the reduce step by merging all of the outputs from the map step into a single file.
cat map.geoTwitter20-01-01.zip.dat | while read line; do
country_code=$(echo "$line" | sed 's/[^a-zA-Z"]//g')
counts=$(cat map.* | grep "$country_code" | sed 's/[^0-9]//g')
sum=$(echo $counts | sed 's/ /+/g' | bc)
echo "$sum" "$country_code"
done | sort -n > reduce
The while loop above reads each line from stdin (i.e. the output of the cat
command) one at a time, storing it in the line
variable.
We then extract the country code, search all of the map files for that country code, and sum their totals together with the bc
command (bc stands for basic calculator and is the main tool for doing math in the shell).
The final output is re-sorted and stored in the file reduce
.
Exercise: Run the map and reduce procedures above. Then create a plot of the results in the file
country_code_mapreduce.png
. Upload your plot to github and ensure that it appears below.
In this last section, you will have to write your own MapReduce procedure based on the code above. I want you to calculate how many tweets sent from the United States are written in each language over the period Jan 1 to Jan 9 2020.
Hint: You should modify the above commands to filter out tweets that don't come from the US with the
grep
command. You should also modify the command to do a count group by query on the language the tweet was written in. You'll have to examine the JSON objects in order to figure out where this information is stored.
Plot the top 20 languages into the file uslang-top20.png
and upload it to github.
Upload the url of your forked github repo to sakai. In order to get full credit for the lab, you'll need to have all of the images uploaded to github.
The lab is worth 4 points. Part 6 is worth 2 points, and the rest of the lab the other 2 points.