This repository contains implementations of a document classification algorithm using both single-process and MPI-based parallel processing techniques. This project is part of the TU Sofia Parallel Processing of Information course.
The Document-Classification-Algorithm project demonstrates two different approaches to document classification:
- Single-Process Implementation: A straightforward implementation that processes documents one by one.
- Open MPI-Based Parallel Implementation: A parallelized version that uses MPI to distribute the workload among multiple processes, improving performance and scalability.
Both versions classify documents based on a catalog of topics and identifiers, reading from text files and determining the relevance of each document to different topics.
- Reads a catalog of topics and identifiers.
- Tokenizes and processes each document sequentially.
- Outputs the classification results to a CSV file.
- Utilizes MPI for parallel processing.
- Distributes document classification tasks across multiple processes.
- Aggregates and outputs the results.
- C++: The core programming language used for both implementations.
- Open MPI (Message Passing Interface): Used for parallel processing in the MPI-based implementation.
- Standard Template Library (STL): Utilized for data structures and algorithms.
- Filesystem Library: Used for directory and file handling.
- Chrono Library: Used for measuring execution time.
- Clone the repository:
git clone https://github.com/yourusername/Document-Classification-Algorithm.git
- Navigate to the project directory:
cd Document-Classification-Algorithm
- Compile the single-process code:
g++ ./main.cpp -o single_classification
- Compile the MPI-based code:
mpicxx -I/usr/lib/x86_64-linux-gnu/openmpi/include main.cpp -o parallel
- Run the single-process classification:
./single_classification
- Run the MPI-based classification (example with 4 processes):
Where 4 is the amount of processes you want Open MPI to spawn.
mpirun -np 4 ./mpi_classification
This project is licensed under the MIT License - see the LICENSE file for details.
This project is part of the TU Sofia Parallel Processing of Information course.