Skip to content

Document Puppeteer is a project that manipulates documents of multiple formats (PDF, Doc, Images, Videos), OCR them, correct the grammar, and probably translate them to other languages.

Notifications You must be signed in to change notification settings

sparksam/document-puppeteer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Puppeteer

Document Puppeteer is a project that manipulates documents of multiple formats (PDF, Doc, Images, Videos), OCR them, correct the grammar, and probably translate it to other languages.

Requirements

This project relies on existing tools such as Tesseract and Poppler. I use an Apple Sillicon Mac for development so here are the instructions used to set the environment

brew install tesseract
brew install poppler
arch -arm64e brew install --build-from-source mecab
git clone github.com/sparksam/document_puppeteer
cd document_puppeteer
python3.10 -m venv venv --upgrade-deps
source venv/bin/activate
pip cache purge # Delete cache files from pervious installations
ARCHFLAGS='-arch arm64' pip install --compile --use-pep517 -r requirements.txt

TODO

  • Python script that takes documents as input and OCR them
  • Convert Text to speech using Coqui TTS
  • Spellcheck fix
  • Convert the documents to different other formats.

Issues

  • Large documents processing is time and resource cosuming with Poppler and Tesseract. Find ways to improve. Probably reduce the DPI but is that a trade-off for the text accuracy?

Authors

  • @sparksam ☕️

About

Document Puppeteer is a project that manipulates documents of multiple formats (PDF, Doc, Images, Videos), OCR them, correct the grammar, and probably translate them to other languages.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published