Collaborators: Alolika Gon, Kinnri Sinha
Runs mapper and reducer on multiple worker processes. Incorporates fault tolerance by restarting a worker process if it is killed.
Application 1: Inverted Index
This application takes multiple document IDs and words in the document as an input. We want to build an inverted index out of this, i.e., we want to know the mapping of each word to all the documents it is present in.
Input:
Document ID \t Contents i.e. words separated by space
Output:
Word \t Document IDs in which this word is present
Code: src/UDF1.cpp Input File: inputFile1.txt Python Code: invertedIndex.py Outputs from Python execution: true_outputFile1.txt
Application 2: Word Count
The goal of this application is to count the number of word occurences in the document.
Input:
A text document
Output:
Word \t Number of occurences of the word
Code: src/UDF2.cpp
Input File: inputFile2.txt
Python Code: spark_word_count.py
Outputs from Python execution: true_outputFile2.txt
UDF 3: k-mer counter
In bioinformatics, k-mers are substrings of length k contained within a genome sequence containing nucleotides (A, C, T and G). In this application, we find all k-length substrings of a genome sequence and find the number of occurences of each of these sequences. We have taken k=3
for this application.
Input:
A genome sequence containing the 4 nucleotides (A, C, T and G)
Output:
3-mer sequence \t Number of occurences of the 3-mer sequence
Code: src/UDF3.cpp
Input File: inputFile3.txt
Python Code: kmerCount.py
Outputs from Python execution: true_outputFile3.txt
testfile.py compiles the code and runs it for different config file attributes: UDF1, UDF2 and UDF3. It also runs the test cases for the three UDFs and test the output against true results generated by python files.
Run testfile.py (Pass command line argument 1 to test fault tolerance otherwise pass 0):
pip install psutil
python testfile.py <0/1>
Three output files will be created: outputFile1.txt, outputFile2.txt, outputFile3.txt for each UDF respectively.
config.txt is the config file through which the input file, output file, number of mappers and reducers and the UDF that needs to be run are defined. It is of the following format:
app.inputfilename=inputFile3
app.outputfilename=output_dir/outputFile3
app.N=3
app.class_name=UDF3
Changes to app.inputfilename: Type the name of the file in plaintext. Do not use file extensions or quotes.
Changes to app.outputfilename: Type the name of the file in plaintext after output_dir/. Do not use file extensions or quotes. All files generated will be in .txt format.
Changes to app.class_name: There are 3 choices for this: UDF1/UDF2/UDF3.
- Run
g++ -o mapreduce.exe src/master_fault_tolerance.cpp src/worker.cpp src/UDF1.cpp src/UDF2.cpp src/UDF3.cpp
in the main directory to create the .exe file. - Run
./mapreduce.exe
.
- Run
g++ -o mapreduce.exe src/master_fault_tolerance.cpp src/worker.cpp src/UDF1.cpp src/UDF2.cpp src/UDF3.cpp
in the main directory to create the .exe file. - Run
./mapreduce.exe
andkill_process.py
concurrently.