Skip to content

Latest commit

 

History

History
179 lines (106 loc) · 3.74 KB

readme.md

File metadata and controls

179 lines (106 loc) · 3.74 KB


textract logo


textract


Single Header High Performance C++ Image Processing Library to read content from Images and transform Images to text files.




Build from Source using CMake

Dependencies


brew install opencv openssl libomp folly tesseract

Build


cd textract && mkdir build && cd build
cmake ..
make

# using LLVM and Clang++ directly
cmake -DCMAKE_CXX_COMPILER=/path/to/clang++ -DCMAKE_C_COMPILER=/path/to/clang ..
make

# getting clang++ and clang paths
echo $(brew --prefix llvm)/bin/clang++
echo $(brew --prefix llvm)/bin/clang

Design


OpenCV and Tesseract

For Processing images and using Tesseract OCR to extract text from Images.


OpenSSL

For generating SHA256 hashes from Image bytes and metadata.


OpenMP

To provide parallelization on systems for processing.


Folly

textract uses Folly's AtomicUnorderedInsertMap for the Cache implementation to provide wait free parallel access to the Cache

Folly::AtomicUnorderedInsertMap



Usage


Process Images and get their textual content


#include "imgtotext.h"

int main() {
    imgstr::ImageTranslator app = imgstr::ImageTranslator();

    std::vector<std::string> results = app.processImages("cs101_notes.png","bio.jpeg");

    app.writeImageTextOut("cs101_notes.png", "cs_notes.txt");

    return 0;
}

Process all valid Image files from a directory and create text files


#include "imgtotext.h"

int main() {

    /* Process 10000 images using parallelism */

    imgstr::ImageTranslator app = imgstr::ImageTranslator(10000);


    app.processImagesWriteResults("/path/to/dir");


    return 0;
}


In Memory Cache Benchmarks


============================================================================
/textract/benchmarks/cache_benchmark.cc     relative  time/iter   iters/s
============================================================================
UnorderedMapMutexedSingleThreaded                         254.95ns     3.92M
UnorderedMapMutexedMultiThreaded                            3.23us   309.19K
UnorderedMapMutexedMaxThreads                               7.28us   137.27K
ConcurrentHashMapSingleThreaded                           859.52ns     1.16M
ConcurrentHashMapMultiThreaded                              3.41us   293.37K
ConcurrentHashMapMaxThreads                                26.40us    37.87K
ConcurrentHashMapComplexSingleThreaded                      1.43us   700.82K
ConcurrentHashMapComplexMultiThreaded                       4.81us   207.69K
ConcurrentHashMapComplexMaxThreads                         32.92us    30.38K
AtomicUnorderedMapSingleThreaded                          159.61ns     6.27M
AtomicUnorderedMapMultiThreaded                           403.47ns     2.48M
AtomicUnorderedMapMaxThreads                                1.63us   611.78K
AtomicUnorderedMapComplexSingleThreaded                   917.79ns     1.09M
AtomicUnorderedMapComplexMultiThreaded                      2.44us   409.85K
AtomicUnorderedMapComplexMaxThreads                        12.62us    79.25K
============================================================================

It can be seen AtomicUnorderedInsertMap is over 8x faster than the Concurrent HashMap.

AtomicUnorderedInsertMap provides an overview of the tradeoffs.



Author: kuro337