Project 1: MapReduce on a single server

Collaborators: Alolika Gon, Kinnri Sinha

Runs mapper and reducer on multiple worker processes. Incorporates fault tolerance by restarting a worker process if it is killed.

Application Explanation

Application 1: Inverted Index

This application takes multiple document IDs and words in the document as an input. We want to build an inverted index out of this, i.e., we want to know the mapping of each word to all the documents it is present in.

Input:

Document ID \t Contents i.e. words separated by space

Output:

Word \t Document IDs in which this word is present

Code: src/UDF1.cpp Input File: inputFile1.txt Python Code: invertedIndex.py Outputs from Python execution: true_outputFile1.txt

Application 2: Word Count

The goal of this application is to count the number of word occurences in the document.

Input:

A text document

Output:

Word \t Number of occurences of the word

Code: src/UDF2.cpp
Input File: inputFile2.txt
Python Code: spark_word_count.py
Outputs from Python execution: true_outputFile2.txt

UDF 3: k-mer counter

In bioinformatics, k-mers are substrings of length k contained within a genome sequence containing nucleotides (A, C, T and G). In this application, we find all k-length substrings of a genome sequence and find the number of occurences of each of these sequences. We have taken k=3 for this application.

Input:

A genome sequence containing the 4 nucleotides (A, C, T and G)

Output:

3-mer sequence \t Number of occurences of the 3-mer sequence

Code: src/UDF3.cpp
Input File: inputFile3.txt
Python Code: kmerCount.py
Outputs from Python execution: true_outputFile3.txt

Running automated testing:

testfile.py compiles the code and runs it for different config file attributes: UDF1, UDF2 and UDF3. It also runs the test cases for the three UDFs and test the output against true results generated by python files.

Run testfile.py (Pass command line argument 1 to test fault tolerance otherwise pass 0):
pip install psutil
python testfile.py <0/1>

Three output files will be created: outputFile1.txt, outputFile2.txt, outputFile3.txt for each UDF respectively.

Running the system for one application:

Making changes to config.txt:

config.txt is the config file through which the input file, output file, number of mappers and reducers and the UDF that needs to be run are defined. It is of the following format:

app.inputfilename=inputFile3
app.outputfilename=output_dir/outputFile3
app.N=3
app.class_name=UDF3

Changes to app.inputfilename: Type the name of the file in plaintext. Do not use file extensions or quotes.
Changes to app.outputfilename: Type the name of the file in plaintext after output_dir/. Do not use file extensions or quotes. All files generated will be in .txt format.
Changes to app.class_name: There are 3 choices for this: UDF1/UDF2/UDF3.

Compiling the code:

Run g++ -o mapreduce.exe src/master_fault_tolerance.cpp src/worker.cpp src/UDF1.cpp src/UDF2.cpp src/UDF3.cpp in the main directory to create the .exe file.
Run ./mapreduce.exe.

Test fault tolerance by passing 1 as argument:

Run g++ -o mapreduce.exe src/master_fault_tolerance.cpp src/worker.cpp src/UDF1.cpp src/UDF2.cpp src/UDF3.cpp in the main directory to create the .exe file.
Run ./mapreduce.exe and kill_process.py concurrently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project 1: MapReduce on a single server

Runs mapper and reducer on multiple worker processes. Incorporates fault tolerance by restarting a worker process if it is killed.

Application Explanation

Running automated testing:

Running the system for one application:

Making changes to config.txt:

Compiling the code:

Test fault tolerance by passing 1 as argument:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
output_dir		output_dir
src		src
README.md		README.md
config.txt		config.txt
inputFile1.txt		inputFile1.txt
inputFile2.txt		inputFile2.txt
inputFile3.txt		inputFile3.txt
invertedIndex.py		invertedIndex.py
kill_process.py		kill_process.py
kmerCount.py		kmerCount.py
mapreduce.exe		mapreduce.exe
spark_word_count.py		spark_word_count.py
testfile.py		testfile.py
true_outputFile1.txt		true_outputFile1.txt
true_outputFile2.txt		true_outputFile2.txt
true_outputFile3.txt		true_outputFile3.txt

chhandak1/MapReduce

Folders and files

Latest commit

History

Repository files navigation

Project 1: MapReduce on a single server

Runs mapper and reducer on multiple worker processes. Incorporates fault tolerance by restarting a worker process if it is killed.

Application Explanation

Running automated testing:

Running the system for one application:

Making changes to config.txt:

Compiling the code:

Test fault tolerance by passing 1 as argument:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages