Building Blocks

Below is a list of the corpus tools we use at Mind Mimic Labs. They are intended to be building blocks for both general research in our lab as well as publication boilerplate. Each tool should be considered stand-alone and includes both code (~/code) and documentation (~/docs). There is a combined requirements.txt file for all the tools found in the root of the repo. The documentation will include both instructions as to what the code is for, how to run it, and what publication boilerplate to put in the Methods and Materials section.

Scripts

Unless otherwise noted, all scripts follow the same execution path.

Open a command prompt
Change into the ~/code folder.
Run python {{scriptname}}.py -in d:/corpus_in -out d:/corpus_out. You should change the input and output paths as desired.

The list of current scripts is below. In general, you want to first run documents-to-corpus, then other scripts. Individual papers/projects/repos will instruct on the exact order in their README.md's Tabula Rasa section.

Data Pre-Processing

remove-stop-words
lowercase-corpus
stem-corpus
remove-whitespace-from-corpus
remove-punction-from-corpus
remove-numbers-from-corpus

Formatting

documents-to-corpus
flatten-corpus
re-encode-corpus

Deep Learning

vectorize-corpus
normalize-corpus-by-padding
normalize-corpus-by-truncation
normalize-corpus-by-windowing
normalize-corpus-by-zipfs-law

Misc

corpus-to-reading-level
(un)nest-corpus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Building Blocks

Scripts

Files

README.md

Latest commit

History

README.md

File metadata and controls

Building Blocks

Scripts