The assignment in the course website: https://www.cs.bgu.ac.il/~dsp162/Assignments/Assignment_2 Briefly, the assignment was to calculate the PMI of each 2 words in google ngram.
- To develop comfortably, install local hadoop. a good tutorial for mac: http://zhongyaonan.com/hadoop-tutorial/setting-up-hadoop-2-6-on-mac-osx-yosemite.html
- A nice tutorial to run your first mapreduce program can be found in hadoop's website: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
The app is separated to 2 modules: the local module and the mapreduce module. They have a mutual parent. In order to run the app (currently a wordcount app):
- Install the maven project (
mvn install
) - Locate the jar that was created by maven
- Run hadoop:
hadoop jar <jar-path> <wordcount-hdfs-input-path> <wordcount-hdfs-output-path>
For example:hadoop jar /Users/dsp-assignment-2/dsp-assignment-2-mapreduce/target/dsp-assignment-2-mapreduce-1.0-SNAPSHOT-job.jar wordcount/input wordcount/output
We are currently based on hadoop official tutorial and a tikal tutorial for maven + hadoop.
The map reduce steps are as follows:
- Split the n-grams into 2-grams. In addition a simple word count app to count how many times each word and each 2-gram appears in each decade.
- Grouping of each 2-gram count with its first word count, i.e. for the 2-gram
map reduce
, we will write the resultmap,<map-count>,map reduce,<map-reduce-count>
with the keyreduce
. In addition, in this step we will write each single word with its own count, i.e. for the wordmap
we will writemap,<map-count>
withmap
as a key. - Final calculation: Now we will get in the reducer, for the 2gram
map reduce
the count ofmap
, the count ofreduce
and the count ofmap reduce
, so we are ready to caculate the PMI, and this is what we will do in this part.
- We have used hadoop counters in order to count how many words appear in each decade (this is required for the calculation of the PMI)
- Before emitting a pair of words, we sort them lexicographically in order to prevent duplications.
- The method that we have introduced above, in which for the pair
w1,w2
, we write the count ofw1
andw1,w2
withw2
as a key saves us from performing a "join", which means that our application runs with runtime, instead of