Document Puppeteer

Document Puppeteer is a project that manipulates documents of multiple formats (PDF, Doc, Images, Videos), OCR them, correct the grammar, and probably translate it to other languages.

Requirements

This project relies on existing tools such as Tesseract and Poppler. I use an Apple Sillicon Mac for development so here are the instructions used to set the environment

brew install tesseract
brew install poppler
arch -arm64e brew install --build-from-source mecab
git clone github.com/sparksam/document_puppeteer
cd document_puppeteer
python3.10 -m venv venv --upgrade-deps
source venv/bin/activate
pip cache purge # Delete cache files from pervious installations
ARCHFLAGS='-arch arm64' pip install --compile --use-pep517 -r requirements.txt

TODO

Python script that takes documents as input and OCR them
Convert Text to speech using Coqui TTS
Spellcheck fix
Convert the documents to different other formats.

Issues

Large documents processing is time and resource cosuming with Poppler and Tesseract. Find ways to improve. Probably reduce the DPI but is that a trade-off for the text accuracy?

Authors

@sparksam ☕️

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dockerfiles		dockerfiles
src		src
tests/docs		tests/docs
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Puppeteer

Requirements

TODO

Issues

Authors

About

Releases

Packages

Languages

sparksam/document-puppeteer

Folders and files

Latest commit

History

Repository files navigation

Document Puppeteer

Requirements

TODO

Issues

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages