pdf_split.py
- Splits a PDF into smaller PDFs of arbitrary length.
run_ocr.py
- OCRs a PDF and outputs text to a file.
chapters.py
- Splits a text doc into multiple text docs based on some repeated string like "CHAPTER". Useful for making up small amounts of dev data.
doc.prep.py
- Turns a text file or directory full of text files into reduced bags of words ready for use with the https://pypi.python.org/pypi/lda package.
run_lda.py
- Run LDA using the https://pypi.python.org/pypi/lda package.