Document Puppeteer is a project that manipulates documents of multiple formats (PDF, Doc, Images, Videos), OCR them, correct the grammar, and probably translate it to other languages.
This project relies on existing tools such as Tesseract and Poppler. I use an Apple Sillicon Mac for development so here are the instructions used to set the environment
brew install tesseract
brew install poppler
arch -arm64e brew install --build-from-source mecab
git clone github.com/sparksam/document_puppeteer
cd document_puppeteer
python3.10 -m venv venv --upgrade-deps
source venv/bin/activate
pip cache purge # Delete cache files from pervious installations
ARCHFLAGS='-arch arm64' pip install --compile --use-pep517 -r requirements.txt
- Python script that takes documents as input and OCR them
- Convert Text to speech using Coqui TTS
- Spellcheck fix
- Convert the documents to different other formats.
- Large documents processing is time and resource cosuming with Poppler and Tesseract. Find ways to improve. Probably reduce the DPI but is that a trade-off for the text accuracy?
- @sparksam ☕️