Skip to content

chhandak1/MapReduce

Repository files navigation

Project 1: MapReduce on a single server

Collaborators: Alolika Gon, Kinnri Sinha

Runs mapper and reducer on multiple worker processes. Incorporates fault tolerance by restarting a worker process if it is killed.

Application Explanation

Application 1: Inverted Index

This application takes multiple document IDs and words in the document as an input. We want to build an inverted index out of this, i.e., we want to know the mapping of each word to all the documents it is present in.

Input:

Document ID \t Contents i.e. words separated by space

Output:

Word \t Document IDs in which this word is present

Code: src/UDF1.cpp Input File: inputFile1.txt Python Code: invertedIndex.py Outputs from Python execution: true_outputFile1.txt

Application 2: Word Count

The goal of this application is to count the number of word occurences in the document.

Input:

A text document

Output:

Word \t Number of occurences of the word

Code: src/UDF2.cpp
Input File: inputFile2.txt
Python Code: spark_word_count.py
Outputs from Python execution: true_outputFile2.txt

UDF 3: k-mer counter

In bioinformatics, k-mers are substrings of length k contained within a genome sequence containing nucleotides (A, C, T and G). In this application, we find all k-length substrings of a genome sequence and find the number of occurences of each of these sequences. We have taken k=3 for this application.

Input:

A genome sequence containing the 4 nucleotides (A, C, T and G)

Output:

3-mer sequence \t Number of occurences of the 3-mer sequence

Code: src/UDF3.cpp
Input File: inputFile3.txt
Python Code: kmerCount.py
Outputs from Python execution: true_outputFile3.txt

Running automated testing:

testfile.py compiles the code and runs it for different config file attributes: UDF1, UDF2 and UDF3. It also runs the test cases for the three UDFs and test the output against true results generated by python files.

Run testfile.py (Pass command line argument 1 to test fault tolerance otherwise pass 0):
pip install psutil
python testfile.py <0/1>

Three output files will be created: outputFile1.txt, outputFile2.txt, outputFile3.txt for each UDF respectively.

Running the system for one application:

Making changes to config.txt:

config.txt is the config file through which the input file, output file, number of mappers and reducers and the UDF that needs to be run are defined. It is of the following format:

app.inputfilename=inputFile3
app.outputfilename=output_dir/outputFile3
app.N=3
app.class_name=UDF3

Changes to app.inputfilename: Type the name of the file in plaintext. Do not use file extensions or quotes.
Changes to app.outputfilename: Type the name of the file in plaintext after output_dir/. Do not use file extensions or quotes. All files generated will be in .txt format.
Changes to app.class_name: There are 3 choices for this: UDF1/UDF2/UDF3.

Compiling the code:

  1. Run g++ -o mapreduce.exe src/master_fault_tolerance.cpp src/worker.cpp src/UDF1.cpp src/UDF2.cpp src/UDF3.cpp in the main directory to create the .exe file.
  2. Run ./mapreduce.exe.

Test fault tolerance by passing 1 as argument:

  1. Run g++ -o mapreduce.exe src/master_fault_tolerance.cpp src/worker.cpp src/UDF1.cpp src/UDF2.cpp src/UDF3.cpp in the main directory to create the .exe file.
  2. Run ./mapreduce.exe and kill_process.py concurrently.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published